Introduction to web scraping in Python

introduction-web-scraping-python-header.png

In this tutorial, you will learn how to build a web scraper using Python. You will scrape stack overflow to get questions along with their stats.

Introduction

In this tutorial, you will learn how to build a web scraper using Python. You will scrape stack overflow to get questions along with their stats.

Python is a high-level programming language designed to be easy to read and simple to implement. It is open source, which means it is free to use, even for commercial applications.

Web scraping is a technique used to extract data from websites. Data displayed by most websites can only be viewed using a web browser. They do not offer the functionality to save a copy of this data for personal use. The only option then is to manually copy and paste the data – a very tedious job which can take many hours or sometimes days to complete. Web Scraping is the technique of automating this process so that instead of manually copying the data from websites, the Web Scraping algorithm will perform the same task within a fraction of the time.

NB: Before you scrape a site, please check their terms and conditions to be sure it isn’t illegal. An example can be found when Bidder’s Edge was sued by ebay for scraping here.

Python in this piece refers to Python 3.x versions.

Setting Up The Environment

You will use two important libraries while dealing with web scraping: requests and beautifulsoup

The requests library will make a GET request to a web server, which will download the HTML contents of a web page for us.
The beautifulsoup library will parse the HTML and also extract information from it.

To install these libraries, run:

1pip install requests bs4

There are basically 3 steps to web scraping:

  • Fetching the host site
  • Parsing and extracting information
  • Saving the information

Fetching The Host Site’s Content

Fetching a site’s content is straightforward using Python. It is as easy as just performing a GET request. For example, look at the code below:

1import requests
2    site = requests.get('https://stackoverflow.com/');

In the code above, you have imported the requests library, and used the GET function to fetch the site to be scraped. The variable called site would now contain a response object.
To check if the GET request was successful before performing any actions, you can check the status code:

1if site.status_code is 200:
2        print(site.content)

Since the response object returns a lot of properties such as status_code , content, headers etc. , you can always use the status code as a condition to decide whether to parse the response or not.

To find out more about the various properties exposed by the response object, you can check the official docs here.

Parsing and extracting the information

Now you have your requests library working as it should, it’s time to parse the content of the response and extract the information needed from the site.

There are two ways to extract data from the response object available, which are:

  • CSS selectors
  • using the find and find_all functions.

CSS Selectors

BeautifulSoup objects support searching a page via CSS selectors using the select method. You can use CSS selectors to find all the questions on the stack overflow home page like this:

1from bs4 import BeautifulSoup
2    content = BeautifulSoup(site.content, 'html.parser')
3    questions = content.select('.question-summary')

If you look at the code block above, you notice you have imported BeautifulSoup library, and used it to parse the site’s content using the html parser. While there are third party parsers that can be installed and configured, I will stick to the default HTML parser for this piece.

One other thing you must ask now is: Where did the class question-summary come from?.

To use CSS selectors, or even the find and find_all methods of BeautifulSoup, you have to know the structure of the HTML that holds the element you want to draw information from. A good method to do this is to inspect the element you want, and get its class from the developer tools. In our case, every question is wrapped in a class called question-summary.

Next, you will get the topic, url, views, answers and votes for each question. Look at the code below:

1for question in questions:
2        topic = question.select( '.question-hyperlink')[0].get_text()
3        url = question.select( '.question-hyperlink')[0].get('href')
4        views = question.select('.views .mini-counts span')[0].get_text()
5        answers = question.select('.status .mini-counts span')[0].get_text()
6        votes = question.select('.votes .mini-counts span')[0].get_text()

If you take a look at the code above, you should notice 3 main things:

  • I manually passed in 0 to the response of the select method. This is because the select method always returns a list even if it has just one response.
  • I used the get_text() method: This method is used to get the text / innerHTML of a single element.
  • I used the get(``'``href``'``) method: The get method can get any attribute from an HTML element. Here, I wanted to get the href attribute.

If you print each topic, URL, views, answers, and votes to the terminal, you notice that the information printed tallies with the information on the website.

Using the find and find_all functions

Another method with which you can easily parse an HTML page will be to use the find and find_all methods. This two methods can also get you any information you want from a webpage, as it allows you to find by tag, id or even class name. Interesting? First, you will need to import the BeautifulSoup library and initialize it with the HTML parser:

1from bs4 import BeautifulSoup
2    content = BeautifulSoup(site.content, 'html.parser')
3    questions = content.find_all(class_='question-summary')

If you look at the code block above, you notice only the third line is different from the snippet we have in the CSS selectors section.

In the line where you defined the variable called questions, you will notice it is similar to what you have done in the CSS selectors, except that you called the find_all method. You will also notice class_. This argument is passed to the find_all method tells the method you want to find all the elements that have the class passed to it. Alternatively, if you want to find all elements with an ID, you will pass the id_ argument instead.

Next, you will get the topic, URL, views, answers, and votes for each question. Look at the code below:

1for question in questions:
2          topic = question.find(class_='question-hyperlink').get_text()
3          url =   question.find(class_='question-hyperlink').get('href')
4          views = question.find(class_='views').find(class_='mini-counts').find('span').get_text()
5           answers = question.find(class_='status').find(class_='mini-counts').find('span').get_text()
6           votes = question.find(class_='votes').find(class_='mini-counts').find('span').get_text()

Looking at the code block above, you will notice:

  • I do not have to manually select the first element as the find function returns only the first element that matches.
  • For nested classes such as that of views, answers and votes, I had to do multiple finds to achieve my goal.
  • The get and get_text methods were also used here as it is not peculiar to CSS selectors only.

NOTE: While in the two examples above, CSS selectors achieved its aim by using only CSS selectors vis-a-vis the **find** and **find_all** methods, we can combine the two ways together as seen below:

1questions = content.select('.question-summary')
2    for question in questions:
3          topic = question.find(class_='question-hyperlink').get_text()

Bringing the codes together

After all, is seen and done, it would be nice to see how each method will look like at a full glance. Here is what the CSS selector method will look like:

1import requests
2    from bs4 import BeautifulSoup
3    site = requests.get('https://stackoverflow.com/');
4    if site.status_code is 200:
5        content = BeautifulSoup(site.content, 'html.parser')
6        questions = content.select('.question-summary')
7        for question in questions:
8            topic = question.select( '.question-hyperlink')[0].get_text()
9            url = question.select( '.question-hyperlink')[0].get('href')
10            views = question.select('.views .mini-counts span')[0].get_text()
11            answers = question.select('.status .mini-counts span')[0].get_text()
12            votes = question.select('.votes .mini-counts span')[0].get_text()

Here is what the find and find_all method will look like:

1import requests
2    from bs4 import BeautifulSoup
3    site = requests.get('https://stackoverflow.com/');
4    if site.status_code is 200:
5        content = BeautifulSoup(site.content, 'html.parser')
6        questions = content.find_all(class_='question-summary')
7        for question in questions:
8            topic = question.find(class_='question-hyperlink').get_text()
9            url =   question.find(class_='question-hyperlink').get('href')
10            views = question.find(class_='views').find(class_='mini-counts').find('span').get_text()
11            answers = question.find(class_='status').find(class_='mini-counts').find('span').get_text()
12            votes = question.find(class_='votes').find(class_='mini-counts').find('span').get_text()

Saving the information

The real motive behind scraping any site is to save the information somewhere. It might be a local database such as MySQL, a JSON file or even a CSV document. Here, you will save the information into a CSV file.

The easiest way to have the parsed data saved into a CSV file will be to create an empty list, append to the empty list as we scrape, and then at the end, write the list of data into the CSV file. Take a look at this:

1import csv
2    import requests
3    from bs4 import BeautifulSoup
4    data_list=[]
5    site = requests.get('https://stackoverflow.com/');
6    if site.status_code is 200:
7        content = BeautifulSoup(site.content, 'html.parser')
8        questions = content.select('.question-summary')
9        for question in questions:
10            topic = question.select( '.question-hyperlink')[0].get_text()
11            url = question.select( '.question-hyperlink')[0].get('href')
12            views = question.select('.views .mini-counts span')[0].get_text()
13            answers = question.select('.status .mini-counts span')[0].get_text()
14            votes = question.select('.votes .mini-counts span')[0].get_text()
15            new_data = {"topic": topic, "url": url, "views": views, "answers":answers, "votes":votes}
16            data_list.append(new_data)
17        with open ('selector.csv','w') as file:
18            writer = csv.DictWriter(file, fieldnames = ["topic", "url", "views", "answers", "votes"], delimiter = ';')
19            writer.writeheader()
20            for row in data_list:
21                writer.writerow(row)

If you look at the code block above, you notice it is similar to the version of the CSS selector code except that:

  • A new import of the CSV library was declared.
  • An empty list called data_list was declared at the beginning of our code.
  • A dictionary of the current items being scraped was declared and pushed into the data_list.
  • A new CSV file is created, and you use the DictWriter function of the CSV library to create headers for our data.
  • A loop is being run which writes every row (which is a dict) of the data_list into the CSV file.

Alternatively, here is the end product for the one with find and find_all methods:

1import csv
2    import requests
3    from bs4 import BeautifulSoup
4    data_list=[]
5    site = requests.get('https://stackoverflow.com/');
6    if site.status_code is 200:
7        content = BeautifulSoup(site.content, 'html.parser')
8        questions = content.find_all(class_='question-summary')
9        for question in questions:
10            topic = question.find(class_='question-hyperlink').get_text()
11            url =   question.find(class_='question-hyperlink').get('href')
12            views = question.find(class_='views').find(class_='mini-counts').find('span').get_text()
13            answers = question.find(class_='status').find(class_='mini-counts').find('span').get_text()
14            votes = question.find(class_='votes').find(class_='mini-counts').find('span').get_text()
15            new_data = {"topic": topic, "url": url, "views": views, "answers":answers, "votes":votes}
16            data_list.append(new_data)
17        with open ('find.csv','w') as file:
18            writer = csv.DictWriter(file, fieldnames = ["topic", "url", "views", "answers", "votes"], delimiter = ';')
19            writer.writeheader()
20            for row in data_list:
21                writer.writerow(row)

Conclusion

In this little piece, you found out how to scrape data from a website. You have also learned that it is illegal to scrape some sites, and you should check their terms and conditions before scraping. You have also learned about CSS selectors as well as the find and find_all methods of the BeautifulSoup library. Finally, you discovered how to save these data into a CSV file.

The code base to this tutorial is available here.