Breadth-First Search in Web Crawling 186

Part I. A Guided Tour of the Social Web Prelude

5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

5.2. Scraping, Parsing, and Crawling the Web 183

5.2.1. Breadth-First Search in Web Crawling 186

This section contains some detailed content and analysis about how web crawls can be implemented and is not essential to your under‐

standing of the content in this chapter (although you will likely find it interesting and edifying). If this is your first reading of the chapter, feel free to save it for next time.

The basic algorithm for a web crawl can be framed as a breadth-first search, which is a fundamental technique for exploring a space that’s typically modeled as a tree or a graph given a starting node and no other known information except a set of possibilities. In our web crawl scenario, our starting node would be the initial web page and the set of neighboring nodes would be the other pages that are hyperlinked.

There are alternative ways to search the space, with a depth-first search being a common alternative to a breadth-first search. The particular choice of one technique versus an‐

other often depends on available computing resources, specific domain knowledge, and even theoretical considerations. A breadth-first search is a reasonable approach for exploring a sliver of the Web. Example 5-3 presents some pseudocode that illustrates how it works, and Figure 5-1 provides some visual insight into how the search would look if you were to draw it out on the back of a napkin.

Example 5-3. Pseudocode for a breadth-first search

Create an empty graph

Create an empty queue to keep track of nodes that need to be processed Add the starting point to the graph as the root node

Add the root node to a queue for processing

186 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

Repeat until some maximum depth is reached or the queue is empty:

Remove a node from the queue For each of the node's neighbors:

If the neighbor hasn't already been processed:

Add it to the queue Add it to the graph

Create an edge in the graph that connects the node and its neighbor

We generally haven’t taken quite this long of a pause to analyze an approach, but breadth- first search is a fundamental tool you’ll want to have in your belt and understand well.

In general, there are two criteria for examination that you should always consider for an algorithm: efficiency and effectiveness (or, to put it another way: performance and quality).

Standard performance analysis of any algorithm generally involves examining its worst- case time and space complexity—in other words, the amount of time it would take the program to execute, and the amount of memory required for execution over a very large data set. The breadth-first approach we’ve used to frame a web crawl is essentially a breadth-first search, except that we’re not actually searching for anything in particular because there are no exit criteria beyond expanding the graph out either to a maximum depth or until we run out of nodes. If we were searching for something specific instead of just crawling links indefinitely, that would be considered an actual breadth-first search. Thus, a more common variation of a breadth-first search is called a bounded breadth-first search, which imposes a limit on the maximum depth of the search just as we do in this example.

For a breadth-first search (or breadth-first crawl), both the time and space complexity can be bounded in the worst case by bd, where b is the branching factor of the graph and d is the depth. If you sketch out an example on paper, as in Figure 5-1, and think about it, this analysis quickly becomes more apparent.

5.2. Scraping, Parsing, and Crawling the Web | 187

Figure 5-1. In a breadth-first search, each step of the search expands the depth by one level until a maximum depth or some other termination criterion is reached

188 | Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

If every node in a graph had five neighbors, and you only went out to a depth of one, you’d end up with six nodes in all: the root node and its five neighbors. If all five of those neighbors had five neighbors too and you expanded out another level, you’d end up with 31 nodes in all: the root node, the root node’s five neighbors, and five neighbors for each of the root node’s neighbors. Table 5-1 provides an overview of how bd grows for a few sizes of b and d.

Table 5-1. Example branching factor calculations for graphs of varying depths Branching factor Nodes for

depth = 1 Nodes for

depth = 2 Nodes for

depth = 3 Nodes for

depth = 4 Nodes for depth = 5

2 3 7 15 31 63

3 4 13 40 121 364

4 5 21 85 341 1,365

5 6 31 156 781 3,906

6 7 43 259 1,555 9,331

Figure 5-2 provides a visual for the values displayed in Table 5-1.

Figure 5-2. The growth in the number of nodes as the depth of a breadth-first search increases

5.2. Scraping, Parsing, and Crawling the Web | 189

2. A homonym is a special case of a homograph. Two words are homographs if they have the same spelling.

Two words are homonyms if they have the same spelling and the same pronunciation. For some reason, homonym seems more common in parlance than homograph, even if it’s being misused.

While the previous comments pertain primarily to the theoretical bounds of the algo‐

rithm, one final consideration worth noting is the practical performance of the algorithm for a data set of a fixed size. Mild profiling of a breadth-first implementation that fetches web pages would likely reveal that the code is primarily I/O bound from the standpoint that the vast majority of time is spent waiting for a library call to return content to be processed. In situations in which you are I/O bound, a thread pool is a common technique for increasing performance.

Breadth-First Search in Web Crawling 186

Why Is Twitter All the Rage? 6

Creating a Twitter API Connection 12