Suchen und Finden
CHAPTER 2 CRAWLING THE WEB
The World Wide Web, or the Web for short, is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks, which is why they are a form of hypertext. These documents, or Web pages, are typically a few thousand characters long, written in a diversity of languages, and cover essentially all topics of human endeavor. Web pages are served through the Internet using the hypertext transport protocol (Http) to client computers, where they can be viewed using browsers. Http is built on top of the transport control protocol (TCP), which provides reliable data streams to be transmitted from one computer to another across the Internet.
Throughout this book, we shall study how automatic programs can analyze hypertext documents and the networks induced by the hyperlinks that connect them. To do so, it is usually necessary to fetch the pages to the computer where those programs will be run. This is the job of a crawler (also called a spider, robot, or bot). In this chapter we will study in detail how crawlers work. If you are more interested in how pages are indexed and analyzed, you can skip this chapter with hardly any loss of continuity.
I will assume that you have basic familiarity with computer networking using TCP, to the extent of writing code to open and close sockets and read and write data using a socket. We will focus on the organization of large-scale crawlers, which must handle millions of servers and billions of pages.
2.1 HTML and HTTP Basics
Web pages are written in a tagged markup language called the hypertext markup language (HTML). HTML lets the author specify layout and typeface, embed diagrams, and create hyperlinks. A hyperlink is expressed as an anchor tag with an href attribute, which names another page using a uniform resource locator (URL), like this:
In its simplest form, the target URL contains a protocol field (http), a server hostname (www.cse.iitb.ac.in), and a file path (/, the “root” of the published file system).
A Web browser such as Netscape Communicator or Internet Explorer will let the reader click the computer mouse on the hyperlink. The click is translated transparently by the browser into a network request to fetch the target page using Http.
A browser will fetch and display a Web page given a complete URL like the one above, but to reveal the underlying network protocol, we will (ab)use the telnet command available on UNIX machines, as shown in Figure 2.1. First the telnet client (as well as any Web browser) has to resolve the server hostname www.cse.iitb.ac.in to an Internet address of the form 144.16.111.14 (called an IP address, IP standing for Internet protocol) to be able to contact the server using TCP. The mapping from name to address is done using the Domain Name Service (DNS), a distributed database of name-to-IP mappings maintained at known servers [202]. Next, the client connects to port 80, the default Http port, on the server. The underlined text is entered by the user (this is transparently provided by Web browsers). The slanted text is called the MIME header. (MIME stands for multipurpose Internet mail extensions, and is a metadata standard for email and Web content transfer.) The ends of the request and response headers are indicated by the sequence CR-LF-CR-LF (double newline, written in and shown as the blank lines).
Browsing is a useful but restrictive means of finding information. Given a page with many links to follow, it would be unclear and painstaking to explore them in search of a specific information need. A better option is to index all the text so that information needs may be satisfied by keyword searches (as in library catalogs). To perform indexing, we need to fetch all the pages to be indexed using a crawler.
FIGURE 2.1 Fetching a Web page using telnet and Http.
2.2 Crawling Basics
How does a crawler fetch “all” Web pages? Before the advent of the Web, traditional text collections such as bibliographic databases and journal abstracts were provided to the indexing system directly, say, on magnetic tape or disk. In contrast, there is no catalog of all accessible URLs on the Web. The only way to collect URLs is to scan collected pages for hyperlinks to other pages that have not been collected yet. This is the basic principle of crawlers. They start from a given set of URLs, progressively fetch and scan them for new URLs (outlinks), and then fetch these pages in turn, in an endless cycle. New URLs found thus represent potentially pending work for the crawler. The set of pending work expands quickly as the crawl proceeds, and implementers prefer to write this data to disk to relieve main memory as well as guard against data loss in the event of a crawler crash. There is no guarantee that all accessible Web pages will be located in this fashion; indeed, the crawler may never halt, as pages will be added continually even as it is running. Apart from outlinks, pages contain text; this is submitted to a text indexing system (described in Section 3.1) to enable information retrieval using keyword searches.
It is quite simple to write a basic crawler, but a great deal of engineering goes into industry-strength crawlers that fetch a substantial fraction of all accessible Web documents. Web search companies like AltaVista, Northern Light, Inktomi, and the like do publish white papers on their crawling technologies, but piecing together the technical details is not easy. There are only a few documents in the public domain that give some detail, such as a paper about AltaVista’s Mercator crawler [108] and a description of Google’s first-generation crawler [26]. Based partly on such information, Figure 2.2 should be a reasonably accurate block diagram of a large-scale crawler.
The central function of a crawler is to fetch many pages at the same time, in order to overlap the delays involved in together with time spent in scanning pages for outlinks and saving pages to a local document repository. Typically, for short pages, DNS lookup and socket connection take a large portion of the processing time, which depends on round-trip times on the Internet and is generally unmitigated by buying more bandwidth.
The entire life cycle of a page fetch, as listed above, is managed by a logical thread of control. This need not be a thread or process provided by the operating system, but may be specifically programmed for this purpose for higher efficiency. In Figure 2.2 this is shown as the “Page fetching context/thread,” which starts with DNS resolution and finishes when the entire page has been fetched via Http (or some error condition arises). After the fetch context has completed its task, the page is usually stored in compressed form to disk or tape and also scanned for outgoing hyperlinks (hereafter called “outlinks”). Outlinks are checked into a work pool. A load manager checks out enough work from the pool to maintain network utilization without overloading it. This process continues until the crawler has collected a “sufficient” number of pages. It is difficult to define “sufficient” in general. For an intranet of moderate size, a complete crawl may well be possible. For the Web, there are indirect estimates of the number of publicly accessible pages, and a crawler may be run until a substantial fraction is fetched. Organizations with less networking or storage resources may need to stop the crawl for lack of space, or to build indices frequently enough to be useful.
FIGURE 2.2 Typical anatomy of a large-scale crawler.
2.3 Engineering Large-Scale Crawlers
In the previous section we discussed a basic crawler. Large-scale crawlers that send requests to millions of Web sites and collect hundreds of millions of pages need a great deal of care to achieve high performance. In this section we will discuss the important performance and reliability considerations for a large-scale crawler. Before we dive into the details, it will help to list the main concerns:
Alle Preise verstehen sich inklusive der gesetzlichen MwSt.