Introduction To Web Scraping



Nov 17, 2020 Web scraping or also known as web harvesting is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered. Although web scraping can be done manually, this can be a long and tedious process. “Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.” (Source: Wikipedia) Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse. Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There are different techniques and powerful tools you can do web scraping. For example, if you manually copy-paste the data from a web page to a text editor, you are carrying out a basic form of web scraping. In this video, I show you how to web scrape a table.Kieng Iv/SAF Business Analyticshttps://ca.linkedin.com/in/kiengivhttps://www.facebook.com/UWaterlooBusine.

  1. What Is Web Scraping
  2. Introduction To Web Scraping With Python
  3. Introduction To Web Scraping Using
  4. Python Web Scraping
  5. Web Scraping Software Comparison
  6. Introduction To Web Scraping Examples
  7. Web Scraping With Python

Our Beginner's Guide to Web Scraping

The internet has become such a powerful tool because there is so much information on there. Many marketers, web developers, investors and data scientists use web scraping to collect online data to help them make valuable decisions.

But if you’re not sure how to use a web scraper tool, it can be intermediating and discouraging. The goal of this beginner's guide is to help introduce web scraping to people who are new to it or for those who don't know where to exactly start.

We’ll even go through an example together to give a basic understanding of it. So I recommended downloading our free web scraping tool so you can follow along.

So, let’s get into it.

Introduction to Web scraping

First, it's important to discuss what is web scraping and what you can do with it. Whether this is your first time hearing about web scraping, or you have but have no idea what it is, this beginner's guide will help guide you to discover what Web scraping is capable of doing!

What is Web Scraping?

Web scraping or also known as web harvesting is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered.

Although web scraping can be done manually, this can be a long and tedious process. That’s why using data extraction tools are preferred when scraping online data as they can be more accurate and more efficient.

Web scraping is incredibly common and can be used to create APIs out of almost any website.

How do web scrapers work?


Automatic web scraping can be simple but also complex at the same time. But once you understand and get the hang of it, it’ll become a lot easier to understand. Just like anything in life, you need practice to make it perfect. At first, you’re not going to understand it but the more you do it, the more you’ll get the hang of it.

The web scraper will be given one or more URLs to load before scraping. The scraper then loads the entire HTML code for the page in question. More advanced scrapers will render the entire website, including CSS and JavaScript elements.

Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run.

Ideally, you want to go through the process of selecting which data you want to collect from the page. This can be texts, images, prices, ratings, ASIN, addresses, URLs etc.

Once you have everything you want to extract selected, you can then place it on an excel/CSV file for you to analyze all of the data. Some advanced web scrapers can convert the data into a JSON file which can be used as an API.

If you want to learn more, you can read our guide on What is Web Scraping and what it’s used for

Is Web Scraping Legal?

With you being able to attract public information off of competitors or other websites, is web scraping legal?

Any publicly available data that can be accessed by everyone on the internet can be legally extracted.

The data has to follow these 3 criteria for it to be legally extracted:

  • User has made the data public
  • No account required for access
  • Not blocked by robots.txt file

As long as it follows these 3 rules, it's legal!

You can learn more about the rules of web scraping here: Is web scraping legal?

Web scraping for beginners

Now that we understand what web scraping is and how it works. Let’s use it in action to get the hang of it!

For this example, we are going to extract all of the blog posts ParseHub has created, how long they take to read, who wrote them and URLs. Not sure what you will use with this information, but we just want to show you what you can do with web scraping and how easy it can be!

First, download our free web scraping tool.

You’ll need to set up ParseHub on your desktop so here’s the guide to help you: Downloading and getting started.

Once ParesHub is ready, we can now begin scraping data.

If it’s your first time using ParseHub, we recommend following the tutorial just to give you an idea of how it works.

But let’s scrape an actual website like our Blog.

What Is Web Scraping

For this example, we want to extract all of the blogs we have written, the URL of the blog, who wrote the blog, and how long it takes to read.

Your first web scraping project

1. Open up ParseHub and create a new project by selecting “New Project”

2. Copy this URL: https://www.parsehub.com/blog/ and place it in the text box on the left-hand side and then click on the “Start project on this URL” button.

3. Once the page is loaded on ParseHub there will be 3 sections:

  • Command Section
  • The wbe page you're extracting from
  • Preview of what the data will look like

The command section is where you will tell the software what you want to do, whether this is a click making a selection, or the advanced features ParseHub can do.

4. To begin extracting data, you will need to click on what exactly you want to extract, in this case, the blog title. Click on the first blog title you see.

Once clicked, the selection you made will turn green. ParseHub will then make suggestions of what it thinks you want to extract.

The suggested data will be in a yellow container. Click on a title that is in a yellow container then all blog titles will be selected. Scroll down a bit to make sure there is no blog title missing.

Now that you have some data, you can see a preview of what it will look like when it's exported.

5. Let’s rename our selection to something that will help us keep our data organized. To do this, just double click on the selection, the name will be highlighted and you can now rename it. In this case, we are going to name it “blog_name”.

Quick note, whenever renaming your selections or data to have no spaces i.e. Blog names won't work but blog_names will.

Now that all blog titles are selected, we also want to extract who wrote them, and how long they take to read. We will need to make a relative selection.

6. On the left sidebar, click the PLUS (+) sign next to the blog name selection and choose the Relative Select command.

7. Using the Relative Select command, click on the first blog name and then the author. You will see an arrow connect the two selections. You should see something like this:

Let’s rename the relative selection to blog_author

Since we don’t need the image URL let’s get rid of it. To do this you want to click on the expand button on the “relative blog_author” selection.

Now select the trash can beside “extract blog_author”

8. Repeat steps 6 and 7 to get the length of the blog, you won't need to delete the URL since we are extracting a text. Let's name this selection “blog_length”

It should look like this.

Since our blog is a scrolling page (scroll to load more) we will need to tell the software to scroll to get all the content.

If you were to run the project now you would only get the first few blogs extracted.

9. To do this, click on the PLUS + sign beside the page selection and click select. You will need to select the main element to this, in this case, it will look like this.

10. Once you have the main Div clicked you can add the scroll function, to do this On the left sidebar, click the PLUS (+) sign next to the main selection, click on advanced, then select the scroll function.

You will need to tell how long the software to scroll, depending on how big the blog is you may need a bigger number. But for now, let’s put it 5 times and make sure it's aligned to the bottom.

If you still need help with the scroll option you can click here to learn more.

Web

We will need to move the main scroll option above blog names, it should look like this now:

11. Now that we have everything we want to be extracted; we can now let ParseHub do its magic. Click on the “Get data” button

12. You’ll be taken to this page.

IntroductionSoftware

You can test your extraction to make sure it’s working properly. For bigger projects, we recommend doing a test run first. But for this project let's press “run” so ParseHub can extract the online data.

13. This project shouldn’t take too long, but once ParseHub is done extracting the data, you can now download it and export it into a CSV/Excel, JSON, or API. But we just need a CSV/ Excel file for this project.

And there you have it! You’ve completed your first web scraping project. Pretty simple huh? But ParseHub can do so much more!

Introduction To Web Scraping With Python

What else can you do with web scraping?

Now that we scraped our blog and movie titles (if you did the tutorial), you can try to implement web scraping in more of a business-related setting. Our mission is to help you make better decisions and to make better decisions you need data.

ParseHub can help you make valuable decisions by doing efficient competitor research, brand monitoring and management, lead generation, finding investment opportunities and many more!

Whatever you choose to do with web scraping, ParseHub can Help!

Check out our other blog posts on how you can use ParseHub to help grow your business. We’ve split our blog posts into different categories depending on what kind of information you're trying to extract and the purpose of your scraping.

Ecommerce website/ Competitor Analysis / Brand reputation

Lead Generation

Brand Monitoring and Investing Opportunities

Closing Thoughts

There are many ways web scraping can help with your business and every day many businesses are finding creative ways to use ParseHub to grow their business! Web scraping is a great way to collect the data you need, but can be a bit intimidating at first if you don’t know what you’re doing. That’s why we wanted to create this beginner's guide to web scraping to help you gain a better understanding of what it is, how it works, and how you can use web scraping for your business!

If you have any trouble with anything, you can visit our help center or blog to help you to navigate with ParseHub or can contact support for any inquiries.

Learn more about web scraping

If you want to learn more about web scraping and elevate your skills, you can check out our free web scraping course! Once completed, you'll get a certification to show off your new skills and knowledge.

Happy Scraping!

Introduction To Web Scraping Using

Introduction

In this post, we'll cover how to scrape Newegg using python, lxml and requests. Python is a great language that anyone can pick up quickly and I believe it's also one of the more readable languages, where you can quickly scan the code to determine what it is doing.

Just look at this loop with auto incrementing index:

We'll scrape Newegg with the use case of monitoring prices and inventory, especially the RTX 3080 and RTX 3090.

Setting up

We're going to work in a virtual python environment which helps us address dependencies and versions separately for each application / project. Let's create a virtual environment in our home directory and install the dependencies we need.

Make sure you are running at least python 3.6, 3.5 is end of support.

Let's create the following folders and files.

We created a __main__.py file, this lets us run the Newegg scraper with the following command (nothing should happen right now):

Crawling the content

We need to write code that can crawl the content, by crawl I mean fetch or download the HTML from the target website. Our first target is Newegg, this website doesn't seem to require javascript for the data we need. We'll get into rendering javascript in a future post that covers headless scraping using requests-html on Google Places.

Open core/crawler.py which we created earlier. Now, we'll begin by requesting the HTML content from Newegg's domain.

In newegg/__main__.py we can import crawler and the code above will execute.

Remember you can execute and test your code with the previous python command in your terminal (must be run in the root folder ~/intro-web-scraping).

It looks like the request succeeded, the status code should of been printed to your terminal with a success of 200.Let's clean up the code to make it reusable and define a function for returning the response text.

In core/crawler.py we'll define a crawl_html function (we want to reuse it and this lets us redefine where the HTML comes from in the future).

In newegg/__main__.py we'll use the function, you can run it and see the HTML being printed. We use an uppercased variable NEWEGG_URL to define a constant - something that shouldn't change.

Scraping the data we need

Now that we have access to the HTML content from Newegg, we want a way to pull out stock information and price for the RTX 3080 and RTX 3090.Let's find the page from Newegg that has that information first.

Navigate to https://www.newegg.com/p/pl?N=100007709%20601357282 in your browser and you'll see we have filters applied for RTX 30 series.

We'll take that path and append it to our NEWEGG_URL. We do this using f-strings in python, which is a way to interpolate variables in strings.

From this URL we can start scraping the data we need. Let's start by creating a few useful functions in the file core/scraping.py. These functions wrap lxml and handle some of the type conversions to make it easier for us to work with.

Finding the data

We'll first try to get the prices with XPath. I highly recommend you use XPath instead of CSS selectors which is much more declarative and more expressive, you can use this simple cheat sheet for quickly finding out how to specify selectors. A more in-depth guide can be found from librarycarpentry.

Open your chrome browser and visit the crawl url we defined earlier: https://www.newegg.com/p/pl?N=100007709%20601357282.

Press F12 on your keyboard or open the developer console by right-clicking one of the prices on the page and selecting inspect.

Using XPath

We'll use the inspector and practice our XPath to figure out how to get all prices on the page (there are 29 items listed). This selector: //li[contains(@class, 'price-current')] grabs all relevant prices.

With the selector in hand, let's modify our newegg/__main__.py entry file by adding a new function to grab the prices.

We should see output like the following.

Let's clean this extra HTML entity appearing at the end of our prices with a utility function. We'll make use of re for regex and unescape from html module to cleanup our data. We need to check if the input contains numbers in order to account for the COMING SOON labels. We'll keep this logic encapsulated in our get_rtx_prices by mapping over each item and then converting it back to a list (map returns an object iterator).

Let's grab the item names.

Web

We also want the link to the item.

More complex XPath

Next we want the stock information (out of stock or in stock). To do this we need to add another function called get_children_text to core/scraper.py. This will allow us to specify a parent selector and a child selector, which will return the first child that matches. If our parent selector has many matches it will try to find a matching child and if it does not find one it will return None. In our case we have many parent matches but some of them may not contain the OUT OF STOCK element.

In core/scraper.py add the new function.

Back in newegg/__main__.py we can add the stock selector.

We also want the product id, having this can help us track changes to the product in the future. Here's how we can find the item id from the page.

If you notice on the highlighted lines below, you can see we added another function to our scraper. Because we are using the text() function of XPath, we are asking for the text node which ignores the other strong label node in the tree seen in the screenshot above.

Let's add get_nodes to our core/scraper.py module.

Our final output structure

Let's put it all together now to generate the final structure for our output which will contain basic stock information, price, product name, product id and product link.

This is what our newegg/__main__.py should look like now.

Ommitted some of the results for readability, but the output should total 29 products as of this post.

Saving our data

With our data in hand, we can quickly save it for analysis later - it's not hard to imagine what else is possible when you have the data you want. We could monitor the price changes of these items, their stock status or when new items are added.

Python Web Scraping

Let's add two csv utility functions to our core/utils.py file. We will write one to tansform our scraped output to proper csv lines and another to write the csv output.

We can use it in our newegg/__main__.py file and just save the output we receive from get_rtx_items. First import the utils at the top of the file.

Now let's use our utility function at the bottom of our Newegg scraper to save the output and complete the full web scraping cycle - crawling, scraping and saving the output.

Web Scraping Software Comparison

Checking the output

Introduction To Web Scraping Examples

We can open the csv file to view the output which is saved in the folder we created at the beginning ~/intro-web-scraping.

Wrapping up

Web Scraping With Python

From this guide we should have learned most of what I believe is the web scraping basics:

  1. Crawling content (using requests)
  2. Scraping relevant data (lxml and XPath)
  3. Saving the output (writing to a csv file)

What we didn't cover:

  1. Headers
  2. Proxies (residential, data center, tor)
  3. Headless browsers
  4. Bot detection (fingerprinting)
  5. Throttling
  6. Captcha (recaptcha, image based input)

In a future post, we will scrape a website which requires javascript rendering and we'll make use of the requests-html python library to render the page and execute javascript.

Hopefully you'll find this post enlightening as web scraping has some really creative use cases that are not so obvious. Till next time!