Nokogiri Web Scraping

In the previous chapter, we saw how to use the web inspector to intercept raw data files. This allows us to read from them directly rather than deal with the data in HTML format.

There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly. Finally check maybe more about Ruby's basics before continuing with web scraping. Here is the code with fixes: require 'rubygems' require 'mechanize' agent = Mechanize.new page = agent.get ('fp =. I am using Nokogiri to scrape a website and am running into an issue when I try to grab a field from a table. I am using selector gadget to find the CSS selector of the table. I am grabbing data fr.

But there will be many instances when you'll need to parse raw HTML. The Ruby gem Nokogiri makes reading raw HTML as easy as crack-parsed XML and JSON.

Nokogiri

This can still mean that today, but primarily when you hear screen scraping we’re talking about the web which means that chances are we’re really talking about HTML. Install We’re going to use Nokogiri, a Ruby gem (also a Japanese saw) to help us parse the HTML (or XML). Scraping with Nokogiri 2018, Oct 01 Web scraping is the practice of parsing a site’s HTML (or even DOM) and extracting meaningful data from it. Scraping can be a. $ gem install nokogiri Fetching nokogiri-1.11.0-x8664-linux.gem Successfully installed nokogiri-1.11.0-x8664-linux 1 gem installed.

The Nokogiri gem is a fantastic library that serves virtually all of our HTML scraping needs. Once you have it installed, you will likely use it for the remainder of your web-crawling career.

Installing Nokogiri

Unfortunately, it can be a pain to install because it has various other dependences, libxml2 among them, that may or may not have been correctly installed on your system.

Follow the official Nokogiri installation guide here.

Hopefully, this step is as painless as typing gem install nokogiri. If not, start Googling for the error message that you're getting. In a later edition of this book, I'll try to go into more detail on the installation process. But for now, I'm just going to wish you godspeed on this task.

For the remainder of this section, assume that the first two lines of every script are:

Opening a page with Nokogiri and `open-uri`

Passing the contents of a webpage to the Nokogiri parser is not much different than opening a regular textfile.

If the webpage is stored as a file on your hard drive, you can pass it in like so:

The Nokogiri::HTML construct takes in the opened file's contents and wraps it in a special Nokogiri data object.

The open-uri module

If the webpage is live on a remote site, like http://en.wikipedia.org/, then you'll want to include the open-uri module, which is part of the standard Ruby distribution but must be explicitly required:

What open-uri does for us is encapsulate all the work of making a HTTP request into the open method, making the operation as simple as as opening a file on our own hard drive.

Using `rest-client`

Web Scraping Tools

You can also use the RestClient gem as we've done before. All the Nokogiri::HTML constructor needs is raw HTML as a string.

Nokogiri and CSS selectors

CSS – Cascading Style Sheets – are how web designers define the look of a group of HTML elements. It has its own syntax but can be mixed in with HTML (the typical use case, though, is to load CSS files externally from the HTML, so that web designers can work on the CSS separately).

Without CSS, this is how you would make all the <a> elements (i.e. the links) the color red on a given webpage:

This is the resulting effect:

You can click here to get to Apple's website. Click here to get to Microsoft.
Or you can visit w3.org to learn more about the World Wide Web.

The use of CSS allows designers to apply a style across a group of elements, thus eliminating the need to define the styles of every HTML element. This example shows how a CSS selector targets all <a> elements in a single line:

What do style tags have to do with web scraping? Nokogiri's css method allows us to target individual or groups of HTHML methods using CSS selectors.No worries if you're not an expert on CSS. It's enough to recognize the basic syntax.

We'll be working with this simple example webpage. If you view its source, you'll see this markup:

A table of syntax

Here's a convenient table that shows all the syntax I'll cover in this section. The columns describe:

The description of the selection and the syntax to do the selection
A visual depiction of which HTML elements are selected

Assume that this code has been run before each of the syntax calls:

Description and Syntax	Selection results
The `<title>` element
All `<li>` elements
The text of the first `<li>` element
The url of the second`<li>` element
The `<li>` elements with a `data-category` of news
The `<div>` element with an `id` of `'funstuff'`
The `<a>` elements nested inside the `<div>` element that has an `id` of `'reference'`

The rest of this chapter explains the selectors in a little more detail. But feel free to refer back to this table. Knowing the CSS selectors is just a matter of a little memorization

Selecting an element

Simply pass the name of the element you want into the Nokogiri document object's css method:

The css method does not return the text of the target element, i.e. 'My webpage'. It returns an array – more specifically, a Nokogiri data object that is a collectino of Nokogiri::XML::Element objects. These Element objects have a variety of methods, including text, which does return the text contained in the element:

The name method simply returns the name of the element, which we already know since we specified it in the css call: 'title'.

Note that even though there is only one title element, the method returns it as an array of one element, so we still need to specify the first element using array notation.

Get an attribute of an element

One of the most common web-scraping tasks is extracting URL's from links, i.e. anchor tags: <a>. The attributes of an element are provided in Hash form:

Here's what that first anchor tag looks like in markup form:

Limiting selectors

You'll often want to limit the scope of what Nokogiri grabs. For example, you may not want all the links on a given page, especially if many of those links are just sidebar navigation links.

Using `select` for a collection

In some cases, this can be done with combining the Nokogiri parser results with the Enumerableselect method.

If you noticed in the sample HTML code, there are two anchor tags with class attributes equal to 'news'. This is how we would use select to grab only those:

Select elements by attributes

So select works. But the ideal solution is to not have Nokogiri's css method pull in unwanted elements in the first place. And that requires a little more knowledge of CSS selectors.

For the above example, we can use CSS selectors to specify attribute values. I won't go into detail here, but suffice to say, it requires a tad more memorization on your part:

The last line above demonstrates one advantage of doing the filtering with the css method and CSS selectors rather than Array.select: you don't have to convert the Nokogiri NodeSet object into an Array. Keeping it as a NodeSet allows you to keep doing...well...more NodeSet-specific methods. The following code calls css twice – once to gather the anchor links and then to gather any bolded elements (which use the <strong> tag) that are within those links:

Again, this will only target <strong> elements within <a> tags that have an attribute data-category set to 'news' (whew, that was a mouthful). Any other <strong> elements that aren't children of such anchor tags won't be selected.

The `id` and `class` attributes

In our sample HTML, there are two div tags, each with its own id attribute.

The class and id attributes are the most common way of specifying individual and groups of elements. There's nothing inherently special about them, they're just commonly-accepted attributes that web designers use to label HTML elements.

id: Only one element on a page should have a given id attribute (though this rule is broken all the time). The CSS selector used to refer to the name of an id is:
class: The main difference between id and class is that many elements can have the same class. The CSS selector to select all elements of a given class is:

Nested elements

Rather than call css twice, as in this example:

You can refer to nested elements with a single CSS selector:

To specify elements within another element, separate the element names with a space. For example, the following selector would select all image tags that are within anchor tags:

To select all image tags within anchor tags that themselves are within div tags, you would do this:

In the sample below, I've bolded which elements the above selector would grab:

Exercise: Select nested CSS elements

Referring back to our sample HTML, write a selector that chooses only the anchor tags in the div that has the id of 'references'. Print out the text within the anchor tag followed by its URL.

Solution

Given the sample HTML, my CSS selector could have been div#references p a since all the anchor tags were also within paragraph tags. I could've also done #references a, as there is only element with that particular id.

Xpath selectors

Nokogiri's css method will serve most of your needs. For webpages that require more precise selectors, you can jump into the world of XPath syntax and utilize Nokogiri's xpath method. XPath can handle certain kinds of selections more gracefully than CSS selectors.

I'll add working examples to this section in a later update. XPath is one of those mini-languages that, like regular expressions, are very useful for a specific purpose and can be learned on a need-to-know basis.

W3.org has a thorough reference. w3schools.com has a solid tutorial. JQuery creator John Resig wrote a short comparison of CSS and XPath.

Nokogiri and your web inspector

So how does the web inspector come into play? It makes it easy to find the right CSS selector.

Take a look at sample webpage again. Inspect one of the links.

Notice that the CSS selectors that are listed below the the tabs (in Chrome/Safari, it's along the bottom of the panel). In Firebug, you can simply right-click and pick Copy CSS Path, and you have yourself the CSS selector for that link:

Exercise: Print out Wikipedia summary labels

Visit the Wikipedia entry for HTML: http://en.wikipedia.org/wiki/HTML

And highlight one of the labels in the top-right summary box, such as 'Filename extension'.

Use your web inspector to find the CSS selector for all the labels in the summary box. Use Nokogiri to select the elements and print out their text content.

You may find Firebug's inspector to be more useful as it allows you to right-click on the CSS selector listing and copy it to the clipboard directly.

Solution

Using the web inspector, we see that the CSS selector for the category labels is:

However, this is not quite accurate for our needs. With <table> elements, browsers will often insert a <tbody> element around the table content, even if such an element does not exist in the actual source. Also, the '.client-firefox' class most likely appears only when you visit the site with Firefox, which doesn't apply when we retrieve the page using a Ruby script.

In fact, there's usually no reason to include the html or body CSS selectors, as it's a given that most content you want are between those tags.

So, using the CSS selector from the web inspector – but omitting the html, body, and tbody parts of it – we get:

This is the output:

As easy as it looks

There's not much more to scraping HTML with Nokogiri. The web inspector helps guide you to the right CSS selectors. Nokogiri's css method does the rest of the work. In the next chapter, we see how to put all of what we've learned to scrape real-world websites.

Chances are you know a bit about screen scraping and already have an opinion on it, but if you don’t here is a quick summary:

Here is what you need to know about screen scraping. Screen scraping is taking something you can see on your computer, typically in a browser, and making accessible inside your code to do store or do some sort of operation on.

Some people love it, some people hate it. Often times both groups feel this way for the same reasons. It is both an equally loved and equally reviled art and science. Some love it because it seems to be more of an art as it can take a lot of creativity in pattern generating. Others dislike it for this same reason.

Screen scraping makes you dependent on code outside of your control. This can happen in other programming situations, but many of those changes are made to avoid upsetting people (i.e. rational versioning). This isn’t the case with screen scraping, you’re taking someones code who most likely only intended it to be viewed in a browser and absorbed by human beings and feeding it to a computer.

Why do such a thing? Because sometimes you have to. Sometimes there isn’t a better way. Sometimes screen scraping is the better way. It all depends on your situation. Basically you’ll screen scrape any time that you need data that is viewable by a human being, but hasn’t yet been formatted or delivered in a way that a computer might like.

Once upon a time this used to mean stocks and banking data inside of terminals. This can still mean that today, but primarily when you hear screen scraping we’re talking about the web which means that chances are we’re really talking about HTML.

Install

We’re going to use Nokogiri, a Ruby gem (also a Japanese saw) to help us parse the HTML (or XML). You may have heard that its hard to install if you’ve used it before, but thats not really true anymore. The header of the Nokigiri website contains: Installsudo gem install nokogiri. If you need more help than that, check out their installation guide.

Nokogiri will let you get your HTML or XML pretty much anywhere you’d like. You can get your data from a file, from a string, from stdin, or most likely from the web. That’s what we’ll be doing, like normal people, we get our beer from the store.

Once you give Nokogiri data you have to tell it what to do with it. That is you have to tell it what nodes that you want and what you want done with them. Nokogiri will let you edit documents, which means you can add or delete nodes, but we’re going to stick with grabbing data out of them for now.

You can communicate with Nokogiri in a few different ways. One is with XPath, the other with CSS selectors. Be warned, that Nokogiri doesn’t always speak CSS selectors as well as you can. Also XPath is more powerful than CSS selectors, but can also be more complicated. As always which you use is up to you as there is no “best” answer.

First we’ll take a look at doing things the XPath way and then we’ll look at CSS selectors. Like your year of foreign language in college, we’re going to work with immersion. That is, we’ll develop the XPath together with a real example, but we’re not going to look at a tutorial or table or other stuff that you probably don’t really need to know right now.

If you disagree with me, you’re always free to check out some XPath tutorials. Don’t worry I’ll link to them later.

Finding Caffeine

I’m a caffeine addict. This has only gotten worse over the years. I’ve resigned myself to this and decided that I need new and interesting caffeine delivery systems, preferably at a good price point.

This of course has lead me to ThinkGeek. They have all kinds of edibles and goodies and fun stuff. They’re also a prefect example of why you might want to screen scrape. They’ve got a few RSS feeds, but none of them tell us exactly what we need to know.

So we’re going to get a list of all the items in the caffeine and edibles category on the site and display them in a terminal with pricing info.

You may want to check out the page before we start but you don’t have to. This is a good time to talk a little more about the downsides to screen scraping, namely that by you’re counting on the page not to change or at least not change in the time that you need to get the data.

This one is a toss up of whether it applies to you. Many big sites don’t change often, its up to you to take that in to account when you decide if screen scraping is the way to go for you. As an aside, in the time that I was writing and editing this article the site we’re scraping change slightly, causing me to change the XPath query we’re gong to use, but not a lot. Not to worry though, I’ll show you what I’m working off of at the time of this writing.

Getting Items

To begin with we’re going to grab the name of all the items. This is important because believe it or not there are days where I don’t want to just order random chemicals or food to stuff in my face.

So here’s what an item “looks” like in the HTML:

We’re going to look for repeated patterns and develop a rule set that is as consistent as possible and then use Nokogiri to apply that ruleset to the HTML. If that sounded confusing don’t worry it just means we’ll make an XPath and then give it to Nokogiri.

For example in screen scraping we could be looking at things like: “the second link in every paragraph” or even “all of the bold text that is not in a table”. Things like that. Once we’ve found our pattern, we’ll translate it in to XPath for Nokogiri to act on.

Thinkgeek has made it pretty easy on us, with the div class of product. Big pattern giveaway there. This is something to keep in mind: people designing the sites you want to scrape often need things organized in the same or similar manner as you do.

Continuing down the tree, we see that all of the products are links. Since we’re trying to develop the most accurate patterns, we can check that all of the product names are links inside of h4 tags. Going through the code you’ll see that this is always the case. So far so good, sounds pretty specific.

The next step in developing any pattern is to look for what could break it. Call it an outlier or a boundary condition, we’re just hunting for things we left out, or things we’re catching that we didn’t want to.

Here’s a good one. Every product that is the last item in a row, has a different class:

This means in order to get the name of the products, we’d say:

English: Starting at the root of the document: look in every div that has a class name containing the word ‘product’. Inside that find a link. In that link find h4 text.

XPath://div[contains(@class,'product')]/a/h4

Why the contains in there? The XPath equality operator only matches complete values, in this case a string. XPath only matches whole class names so div[@class='product'] in Xpath would not work to get the last column as you might expect.

As with most things in programming there is of course more than one way to do it. XPath allows us to be verbose and very specific as we just demonstrated, but that doesn’t always mean we need to be. Its possible to say exactly what we mean without using too many words.

Now that we’ve developed the verbose pattern, we can review the code and our statement and realize that there is not time where an h4 tag shoes in side of a link inside of a div that isn’t a product name. That means less chance for ambiguity, which means easier pattern recognition and easier screen scraping.

This means keeping the same English, we could say:

English: Starting at the root of the document: look in every div that has a class name containing the word ‘product’. Inside that find a link. In that link find h4 text.

XPath://div/a/h4

Getting Prices

Now we’ve retrieved all the product names, its time to get the prices. Here’s that example item again:

We can use some of what we learned already. We see that we’re still going to be looking inside that div, but in this case pricing information seems to be contained in p tags.

Lets look again for anything that will break our pattern. Checking the items on sale is a good start:

Lets also check things that are out of stock:

This makes things a little trickier, we still have our div classes that we can use, but we can’t say that every p inside of a div is going to give us what we need. This is especially true with styling information or times when an item is out of stock.

That doesn’t mean that we can’t find a pattern though. This is a good time to review what we know:

Every item’s price is contained as a child of a div whose class contains the word product.
Prices are contained in paragraph tags
Not all paragraph tags that are children of the div contain the price
Tags that we don’t want have a style attribute

English: Starting at the root of the document, take all the divs whose class contains the word product and get the text that is contained inside the paragraph tag that doesn’t have a style attribute.

XPath://div[contains(@class,'product')]/p[not(@style)]/text()

The XPath is a bit different this time. Each time we’ve used XPath we’ve been after text (as opposed id or class information), but only now are we using text(). text() is a “node test” as its called in XPath lingo that allows you to match well text nodes only. We’re also using the XPath operator not to eliminate tags that have style attributes.

For this example, I purposefully chose an XPath that parses the price regardless of stock. How much an item runs for is more broadly useful than whether a specific site has it in stock.

Now that we know how to say what we want in XPath, we still need to work with Nokogiri. While you can require only Nokogiri, this doesn’t make a whole lot of sense if you’re trying to get HTML or XML from the web like are so we’re going to require nokogiri and open-uri.

Next, we need to tell Nokogiri where to get our document. We’ll use the Nokogiri::HTML module to do that:

Now we need to tell Nokogiri what part of the document it is that we want, starting with item names. We’ll do that by using the xpath method which checks each node for the XPath query.

Once the data is obtained, how its used will vary from project to project of course, but lets take a look at a typical example, storing and displaying:

To store them the names in an item array:

and again with prices in it’s own array:

We use prices.delete(') because some of the nodes will be blank. This is another thing to consider when screen scraping, not all the data will be in the right format as needed, sometimes it needs massaged a bit.

So to put it all together, we come up with something like: