Mining the Web - Discovering Knowledge from Hypertext Data

Mining the Web - Discovering Knowledge from Hypertext Data

von: Soumen Chakrabarti

Elsevier Textbooks, 2002

ISBN: 9780080511726 , 344 Seiten

Format: PDF, ePUB, OL

Kopierschutz: DRM

Windows PC,Mac OSX für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones Online-Lesen für: Windows PC,Mac OSX,Linux

Preis: 69,95 EUR

Mehr zum Inhalt

Mining the Web - Discovering Knowledge from Hypertext Data


 

CHAPTER 2 CRAWLING THE WEB

The World Wide Web, or the Web for short, is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks, which is why they are a form of hypertext. These documents, or Web pages, are typically a few thousand characters long, written in a diversity of languages, and cover essentially all topics of human endeavor. Web pages are served through the Internet using the hypertext transport protocol (Http) to client computers, where they can be viewed using browsers. Http is built on top of the transport control protocol (TCP), which provides reliable data streams to be transmitted from one computer to another across the Internet.

Throughout this book, we shall study how automatic programs can analyze hypertext documents and the networks induced by the hyperlinks that connect them. To do so, it is usually necessary to fetch the pages to the computer where those programs will be run. This is the job of a crawler (also called a spider, robot, or bot). In this chapter we will study in detail how crawlers work. If you are more interested in how pages are indexed and analyzed, you can skip this chapter with hardly any loss of continuity.

I will assume that you have basic familiarity with computer networking using TCP, to the extent of writing code to open and close sockets and read and write data using a socket. We will focus on the organization of large-scale crawlers, which must handle millions of servers and billions of pages.

2.1 HTML and HTTP Basics


Web pages are written in a tagged markup language called the hypertext markup language (HTML). HTML lets the author specify layout and typeface, embed diagrams, and create hyperlinks. A hyperlink is expressed as an anchor tag with an href attribute, which names another page using a uniform resource locator (URL), like this:

In its simplest form, the target URL contains a protocol field (http), a server hostname (www.cse.iitb.ac.in), and a file path (/, the “root” of the published file system).

A Web browser such as Netscape Communicator or Internet Explorer will let the reader click the computer mouse on the hyperlink. The click is translated transparently by the browser into a network request to fetch the target page using Http.

A browser will fetch and display a Web page given a complete URL like the one above, but to reveal the underlying network protocol, we will (ab)use the telnet command available on UNIX machines, as shown in Figure 2.1. First the telnet client (as well as any Web browser) has to resolve the server hostname www.cse.iitb.ac.in to an Internet address of the form 144.16.111.14 (called an IP address, IP standing for Internet protocol) to be able to contact the server using TCP. The mapping from name to address is done using the Domain Name Service (DNS), a distributed database of name-to-IP mappings maintained at known servers [202]. Next, the client connects to port 80, the default Http port, on the server. The underlined text is entered by the user (this is transparently provided by Web browsers). The slanted text is called the MIME header. (MIME stands for multipurpose Internet mail extensions, and is a metadata standard for email and Web content transfer.) The ends of the request and response headers are indicated by the sequence CR-LF-CR-LF (double newline, written in and shown as the blank lines).

Browsing is a useful but restrictive means of finding information. Given a page with many links to follow, it would be unclear and painstaking to explore them in search of a specific information need. A better option is to index all the text so that information needs may be satisfied by keyword searches (as in library catalogs). To perform indexing, we need to fetch all the pages to be indexed using a crawler.

FIGURE 2.1 Fetching a Web page using telnet and Http.

2.2 Crawling Basics


How does a crawler fetch “all” Web pages? Before the advent of the Web, traditional text collections such as bibliographic databases and journal abstracts were provided to the indexing system directly, say, on magnetic tape or disk. In contrast, there is no catalog of all accessible URLs on the Web. The only way to collect URLs is to scan collected pages for hyperlinks to other pages that have not been collected yet. This is the basic principle of crawlers. They start from a given set of URLs, progressively fetch and scan them for new URLs (outlinks), and then fetch these pages in turn, in an endless cycle. New URLs found thus represent potentially pending work for the crawler. The set of pending work expands quickly as the crawl proceeds, and implementers prefer to write this data to disk to relieve main memory as well as guard against data loss in the event of a crawler crash. There is no guarantee that all accessible Web pages will be located in this fashion; indeed, the crawler may never halt, as pages will be added continually even as it is running. Apart from outlinks, pages contain text; this is submitted to a text indexing system (described in Section 3.1) to enable information retrieval using keyword searches.

It is quite simple to write a basic crawler, but a great deal of engineering goes into industry-strength crawlers that fetch a substantial fraction of all accessible Web documents. Web search companies like AltaVista, Northern Light, Inktomi, and the like do publish white papers on their crawling technologies, but piecing together the technical details is not easy. There are only a few documents in the public domain that give some detail, such as a paper about AltaVista’s Mercator crawler [108] and a description of Google’s first-generation crawler [26]. Based partly on such information, Figure 2.2 should be a reasonably accurate block diagram of a large-scale crawler.

The central function of a crawler is to fetch many pages at the same time, in order to overlap the delays involved in together with time spent in scanning pages for outlinks and saving pages to a local document repository. Typically, for short pages, DNS lookup and socket connection take a large portion of the processing time, which depends on round-trip times on the Internet and is generally unmitigated by buying more bandwidth.

1. Resolving the hostname in the URL to an IP address using DNS
2. Connecting a socket to the server and sending the request
3. Receiving the requested page in response

The entire life cycle of a page fetch, as listed above, is managed by a logical thread of control. This need not be a thread or process provided by the operating system, but may be specifically programmed for this purpose for higher efficiency. In Figure 2.2 this is shown as the “Page fetching context/thread,” which starts with DNS resolution and finishes when the entire page has been fetched via Http (or some error condition arises). After the fetch context has completed its task, the page is usually stored in compressed form to disk or tape and also scanned for outgoing hyperlinks (hereafter called “outlinks”). Outlinks are checked into a work pool. A load manager checks out enough work from the pool to maintain network utilization without overloading it. This process continues until the crawler has collected a “sufficient” number of pages. It is difficult to define “sufficient” in general. For an intranet of moderate size, a complete crawl may well be possible. For the Web, there are indirect estimates of the number of publicly accessible pages, and a crawler may be run until a substantial fraction is fetched. Organizations with less networking or storage resources may need to stop the crawl for lack of space, or to build indices frequently enough to be useful.

FIGURE 2.2 Typical anatomy of a large-scale crawler.

2.3 Engineering Large-Scale Crawlers


In the previous section we discussed a basic crawler. Large-scale crawlers that send requests to millions of Web sites and collect hundreds of millions of pages need a great deal of care to achieve high performance. In this section we will discuss the important performance and reliability considerations for a large-scale crawler. Before we dive into the details, it will help to list the main concerns:

Because a single page fetch may involve several seconds of network latency, it is essential to fetch many pages (typically hundreds to thousands) at the same time to utilize the network bandwidth available.
Many simultaneous fetches are possible only if the DNS lookup is streamlined to be highly concurrent, possibly replicated on a few DNS servers.
Multiprocessing or multithreading provided by the operating system is not the best way to manage the multiple fetches owing to high overheads. The best bet is to explicitly encode the state of a fetch context in a data structure and use asynchronous sockets, which do not block the process/thread using it, but can be polled to check for completion of network transfers.
Care is needed to extract URLs and eliminate duplicates to reduce redundant fetches and to avoid “spider traps”—hyperlink graphs constructed carelessly or malevolently to keep a crawler trapped in that graph, fetching what can potentially be an infinite...