How To Crawl A Web Page with Scrapy and Python 3_Prancheta

How to Crawl a Web Page with Scrapy and Python 3

Web scraping, web crawling, web harvesting, or web data extraction are synonyms referring to the act of mining data from web pages across the Internet. Web scrapers or web crawlers are tools that go over web pages programmatically extracting the required data. These data, which is usually large sets of text can be used for analytical purposes, to understand products, or to satisfy one’s curiosity about a certain web page.

If you are wondering how you can go about web crawling, we will be showing you the basics of web scraping through a simple data set. You should be able to follow along with the tutorial regardless of your level of programming expertise. For the practical example, we will be using our CloudSigma blog. We will try to get information about the tutorials on our blog page. By the time you are reading this tutorial’s conclusion, you will be having a functioning web scraper made with Python 3 that crawls several pages on our blog section, then displays the data on your screen.

Using the knowledge from creating this basic web scraper, you can expand on it and create your own web scrapers. This should be fun, let’s begin!

Prerequisites

This is a hands-on tutorial, so you should have a local development environment for Python 3 to follow along well. First, you can refer to our tutorial on how to install Python 3 and set up a local programming environment on Ubuntu.

Scrapy

Web scraping involves two steps: the first step is finding and downloading web pages, the second step is crawling through and extracting information from those web pages.

There are a number of ways and libraries that can be used to build a web scraper from scratch in many programming languages. However, this may bring issues in the future when your web scraper becomes complex, or when you need to crawl multiple pages with different settings and patterns at a time. It may be quite a heavy task figuring out how to transform your scraped data between different formats such as CSV, XML, or JSON.

While some may appreciate the challenge of building their own web scraper from scratch, it’s better if you do not reinvent the wheel and rather build it on top of an existing library that handles all those issues. We will be using Scrapy, a Python library, together with Python 3 to implement the web scraper in this tutorial. Scrapy is an open-source tool and one of the most popular and powerful Python web scraping libraries. Scrapy was built to handle some of the common functionalities that all scrapers should have. This way you don’t have to reinvent the wheel whenever you want to implement a web crawler. With Scrapy, the process of building a scraper becomes easy and fun.

Scrapy is available from PyPi, commonly known as pip – the Python Package Index. PyPi is a community-owned repository that hosts most Python packages. When you install and set up Python 3 on your local development environment, it installs pip too, which you can use to install Python packages.

Step 1: How to Build a Simple Web Scraper

First, to install Scrapy, run the following command:

Optionally, you may follow the Scrapy official installation instructions from the documentation page. If you have successfully installed Scrapy, create a folder for the project using a name of your choice:

Navigate into the folder and create the main file for the code. This file will hold all the code for this tutorial:

If you wish, you can create the file using your text editor or IDE instead of the above command.

Next, open the file, and let’s start by creating a basic scraper that uses Scrapy. We will create a Python class that extends scrapy.Spider, a basic spider class from Scrapy. These class will have two required attributes as defined below:

  • name — a string name to identify the spider (you may enter a name of choice).
  • start_urls — an array containing a list of URLs to crawl from. We will start with one URL.

Add the following code snippet in the opened file to create the basic spider:

Below is an explanation of each line of code:

The first line imports Scrapy, allowing us to use the various classes that the package provides.

In the next line, we extend the Spider class provided by Scrapy and create a subclass called CloudSigmaCrawler. By extending a Class(Spider), we get access to the Class’ properties which we can now use in our code. In this case, the Spider class has methods and behaviors that define how to follow URLs and extract data from web pages. However, it doesn’t know which URLs to follow or what data to extract. By extending it, we can provide the required information to the methods. To understand more on subclassing and extending, read on Object-Oriented programming principles.

In our CloudSigmaCrawler, we define the required attributes. First, we name our spider cloudsigma_crawler. Then, we provide a single URL to start from: https://blog.cloudsigma.com/blog/. Opening this URL will take you to the CloudSigma blog page 1 that contains some of the many tutorials.

Time to test the scraper. You have a few options. If you are using an IDE, for example, the PyCharm community edition from JetBrains, it probably comes with a button that you can just click to run the script. Another option is following the typical way of running Python files from the command line: python path/to/file.py, or py path/to/file.py. Another option is Scrapy’s command-line interface. Scrapy comes with its own command-line interface to help in starting a scraper. Enter the following command to start the scraper:

Depending on the library version of Scrapy that you installed, you should see an output that is something like the following:

output spider scrapy

As you can see, the output is quite long so we just picked some parts. Here is what happened when you executed the command:

  • The scraper was initialized. Hence, it loaded additional components and extensions it needs to use in following and reading data from URLs.
  • By using the URL provided in the start_urls list, it grabbed the HTML from the page. This is a similar process that the browser follows when opening web pages.
  • After grabbing the HTML, it’s passed to the parse method which we haven’t defined yet. For now it doesn’t do anything, hence the spider just exits without doing any processing. We will define the behavior of the parse method in the next step.

Step 2: How to Extract Data from a Page

In step 1, we only implemented a basic scraper that grabs an HTML page but does nothing after. In this section, we will provide instructions for extracting data. On the CloudSigma Blog page, we want to extract data from, there are some things you can notice such as:

  • The header, present on all pages.
  • The navigation menu and search filter box.
  • The actual list of the tutorials in a grid format.

Viewing the source code of the HTML page you intend to scrap gives you a general idea of the structure of the page. This helps you in writing a scraper. You can view the source code by right-clicking on the page and selecting View Source Code, or pressing Ctrl + U. Here is a snippet of the source code:

As you can see each blog tutorial is enclosed within an HTML tag called <article>. Scraping the page will involve two steps. The first step will be grabbing each blog tutorial by looking at the parts of the page containing the data we want. The next step is to pull the data we want from each tutorial identified by the HTML tag.

Scrapy identifies the data to grab based on the selectors you provide. We can use selectors to find one or more elements on a page and get the data within the elements. Scrapy has support for XPath and CSS selectors.

From the source we viewed earlier, CSS selectors seem to be easier. Hence, this will be the option we will go with as it will help us find all the tutorials on the page. From the HTML source code, each tutorial is specified within the CSS class called post. CSS class names are usually identified with .class_name (dot class_name). Thus, we will use .post for our CSS selector. Inside our main.py scraper source code, we will pass the .post class to the response object, so that your file will now look this:

This code snippet will grab all tutorials on the page with the specified start_urls, and loop through the tutorials to extract data. In the next step you will want to extract and display this data. If you examine the source code of the CloudSigma blog again, you will see that the title of each tutorial is stored within a <a> tag that is within a <h2> tag, for example:

Each tutorial object we are looping over contains a CSS method that we can pass in a selector to locate and extract child elements. For this example, we want to extract the title which is enclosed inside the <a> tag. This tag is inside the <h2> tag inside the .entry-header class inside the .entry-wrap class. We can pass these CSS selectors to the object method to extract the title, modify the code to look like this:

The trailing comma after extract_first() is not a typo as we will be adding more code below the section.

Some points to note from the source code above include:

  • ::text appended to the selector – this is a CSS pseudo-selector that instructs the code to fetch the text inside the tag and not the tag itself.
  • extract_first() method call within the object – instructs the code to only pick the first element that matches the selector. Hence, we get a string rather than a list of elements.

Next, save the file and run the code by entering the following command in your terminal:

In the output, you should see the titles of the tutorials:

post links

We can keep expanding on this by adding more selectors to get other details about a tutorial, such as the tutorial’s URL, featured image, and caption.

Let’s examine the HTML code for a single tutorial again:

We want to try and extract the highlighted pieces, i.e. the tutorial’s URL, featured image, and caption.

  • From the code snippet above, the image for the blog is stored inside the data-lazy-src attribute of an img tag inside an <a> tag inside a div tag at the start of the blog tutorial. We can use a CSS selector to grab the value like we did with the tutorial titles.
  • Getting the tutorial’s URL is straightforward, as we have the <a> tag inside the <div> element.
  • The caption is enclosed inside the <p> tag which is inside the <div> tag.

We will be using the CSS classes to get what we want. Let’s modify the code so that it looks like this:

Save the changes and run the code with the following command:

You will see more data in the output, like the URL, image, and caption that we added:

post collected data

That’s all for crawling a single page. Next, let’s see how we can create a scraper that follows links.

Step 3: How to Crawl Multiple Pages

Up to this point, we have created a scraper that can get data from a single page. However,  we want more than that. You want a spider that can follow links and extract data from multiple pages of a website programmatically.

If you go to the bottom of the CloudSigma blog page, you will notice the pagination links, and a small arrow pointing to the right indicating the next page. Here’s the HTML code snippet:

The snippet shows several page navigation links within <li> tags, under the <div> tag with the .x-pagination CSS class. Our focus is on the link pointing to the next page. It is in the <a> tag of the last <li> in the <ul> tag.

The link pointing to the next page has the .prev-next class within the <a> tag as seen in the snippet above. However, if you move to the next page, you will also notice that the links to the previous and next page have this CSS class. Consider this snippet for Page 2:

If we use the Scrapy extract_first() method, it will work on the first page. When it reaches the next page, it will pick the first link with the .prev-next class, which in the above snippet, points to the first page. This will result in a loop. Hence, we will use the Scrapy extract() method. This method extracts all elements matching a use-case and puts them in an array. From this array, we can pick the last element which will contain the actual link pointing to the next page. Modify your code to look like this:

Let’s walk through the code for the next_page selection.

We first define a selector for the next and previous page links. We then use the extract() method to extract the URLs and put them in an array. The next_page variable will be an array with two elements like:

Since we are navigating to the next page, we will pick the last element in the array. Next_page[-1] picks the last element in the array.

The if block checks if the next_page variable has something then it calls the scrapy.Request() method. In our code, we instruct this method to crawl the page with the provided URL and pass it back to the parse() method so that we can parse it to extract the data and repeat the process for the next page. This process is repeated until it doesn’t find a link to the next page, or rather if the block fails, then it stops.

Save your code and run it. You will notice that the iteration continues looping through the pages as it finds more pages to scrape. This is how you would define a scraper that follows links on a website. Our example is quite straightforward. We are only going to a page, finding the link to the next page, and repeating the process. In other use cases, you may want to follow tags or links that point to external sources and more. Here is the completed source code for the Python 3 basic web scraper:

Conclusion

In this tutorial, we built a basic web scraper that can crawl the CloudSigma blog directory and display some information about the blog tutorials in about 27 lines of code only.

Of course, this is possible because we built it on top of the Scrapy Python library. This is only a foundation that should help you build more complex scrapers that follow more tags, search results of websites, and more. You can check out Scrapy’s official docs for more information on working with Scrapy.

Happy Computing!