Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

World Wide Web 2 (1999) 219–229 219

Mercator: A scalable, extensible Web crawler


Allan Heydon and Marc Najork
Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
E-mail: {heydon,najork}@pa.dec.com

This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important
component of many Web services, but their design is not well-documented in the literature. We enumerate the major components of any
scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator.
We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we
have found to be comparable to that of other crawlers for which performance numbers have been published.

1. Introduction cator as a random walker without modifying the crawler’s


core, merely by plugging in modules totaling 360 lines of
Designing a scalable Web crawler comparable to the Java source code.
ones used by the major search engines is a complex en- The remainder of the paper is structured as follows. The
deavor. However, due to the competitive nature of the next section surveys related work. Section 3 describes the
search engine business, there are few papers in the liter- main components of a scalable Web crawler, the alterna-
ature describing the challenges and tradeoffs inherent in tives and tradeoffs in their design, and the particular choices
Web crawler design. This paper’s main contribution is to we made in Mercator. Section 4 describes Mercator’s sup-
fill that gap. It describes Mercator, a scalable, extensible port for extensibility. Section 5 describes some of the Web
Web crawler written entirely in Java. crawling problems that arise due to the inherent anarchy of
By scalable, we mean that Mercator is designed to scale the Web. Section 6 reports on performance measurements
up to the entire Web, and has been used to fetch tens of and Web statistics collected during a recent extended crawl.
millions of Web documents. We achieve scalability by im- Finally, section 7 offers our conclusions.
plementing our data structures so that they use a bounded
amount of memory, regardless of the size of the crawl.
Hence, the vast majority of our data structures are stored 2. Related work
on disk, and small parts of them are stored in memory for
efficiency. Web crawlers – also known as robots, spiders, worms,
By extensible, we mean that Mercator is designed in a walkers, and wanderers – are almost as old as the Web
modular way, with the expectation that new functionality itself [Koster]. The first crawler, Matthew Gray’s Wanderer,
will be added by third parties. In practice, it has been used was written in the spring of 1993, roughly coinciding with
to create a snapshot of the Web pages on our corporate the first release of NCSA Mosaic [Gray]. Several papers
intranet, to collect a variety of statistics about the Web, and about Web crawling were presented at the first two World
to perform a series of random walks of the Web [Henzinger Wide Web conferences [Eichmann 1994; McBryan 1994;
et al. 1999]. Pinkerton 1994]. However, at the time, the Web was two to
One of the initial motivations of this work was to col- three orders of magnitude smaller than it is today, so those
lect statistics about the Web. There are many statistics that systems did not address the scaling problems inherent in a
might be of interest, such as the size and the evolution of crawl of today’s Web.
the URL space, the distribution of Web servers over top- Obviously, all of the popular search engines use crawlers
level domains, the lifetime and change rate of documents, that must scale up to substantial portions of the Web. How-
and so on. However, it is hard to know a priori exactly ever, due to the competitive nature of the search engine
which statistics are interesting, and topics of interest may business, the designs of these crawlers have not been pub-
change over time. Mercator makes it easy to collect new licly described. There are two notable exceptions: the
statistics – or more generally, to be configured for differ- Google crawler and the Internet Archive crawler. Unfor-
ent crawling tasks – by allowing users to provide their own tunately, the descriptions of these crawlers in the literature
modules for processing downloaded documents. For exam- are too terse to enable reproducibility.
ple, when we designed Mercator, we did not anticipate the The Google search engine is a distributed system that
possibility of using it for the random walk application cited uses multiple machines for crawling [Brin and Page 1998;
above. Despite the differences between random walking Google]. The crawler consists of five functional compo-
and traditional crawling, we were able to reconfigure Mer- nents running in different processes. A URL server process

 Baltzer Science Publishers BV


220 A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler

reads URLs out of a file and forwards them to multiple 3. Architecture of a scalable Web crawler
crawler processes. Each crawler process runs on a differ-
ent machine, is single-threaded, and uses asynchronous I/O The basic algorithm executed by any Web crawler takes
to fetch data from up to 300 Web servers in parallel. The a list of seed URLs as its input and repeatedly executes
crawlers transmit downloaded pages to a single StoreServer the following steps. Remove a URL from the URL list,
process, which compresses the pages and stores them to determine the IP address of its host name, download the
corresponding document, and extract any links contained
disk. The pages are then read back from disk by an indexer
in it. For each of the extracted links, ensure that it is an
process, which extracts links from HTML pages and saves
absolute URL (derelativizing it if necessary), and add it
them to a different disk file. A URL resolver process reads
to the list of URLs to download, provided it has not been
the link file, derelativizes the URLs contained therein, and encountered before. If desired, process the downloaded
saves the absolute URLs to the disk file that is read by the document in other ways (e.g., index its content). This basic
URL server. Typically, three to four crawler machines are algorithm requires a number of functional components:
used, so the entire system requires between four and eight
machines. • a component (called the URL frontier) for storing the
The Internet Archive also uses multiple machines to list of URLs to download,
crawl the Web [Burner 1977; InternetArchive]. Each • a component for resolving host names into IP addresses,
crawler process is assigned up to 64 sites to crawl, and • a component for downloading documents using the
no site is assigned to more than one crawler. Each single- HTTP protocol,
threaded crawler process reads a list of seed URLs for its • a component for extracting links from HTML docu-
assigned sites from disk into per-site queues, and then uses ments, and
asynchronous I/O to fetch pages from these queues in par- • a component for determining whether a URL has been
allel. Once a page is downloaded, the crawler extracts the encountered before.
links contained in it. If a link refers to the site of the page it
was contained in, it is added to the appropriate site queue; This section describes how Mercator refines this basic
otherwise it is logged to disk. Periodically, a batch process algorithm, and the particular implementations we chose for
the various components. Where appropriate, we comment
merges these logged “cross-site” URLs into the site-specific
on design alternatives and the tradeoffs between them.
seed sets, filtering out duplicates in the process.
In the area of extensible Web crawlers, Miller and
3.1. Mercator’s components
Bharat’s SPHINX system [Miller and Bharat 1998] pro-
vides some of the same customizability features as Mer- Figure 1 shows Mercator’s main components. Crawling
cator. In particular, it provides a mechanism for limiting is performed by multiple worker threads, typically number-
which pages are crawled, and it allows customized docu- ing in the hundreds. Each worker repeatedly performs the
ment processing code to be written. However, SPHINX is steps needed to download and process a document. The
targeted towards site-specific crawling, and therefore is not first step of this loop is to remove an absolute URL from
designed to be scalable. the shared URL frontier for downloading.

Figure 1. Mercator’s main components.


A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler 221

An absolute URL begins with a scheme (e.g., “http”), need to balance memory use and performance. In the
which identifies the network protocol that should be used remainder of this section, we describe how we address
to download it. In Mercator, these network protocols are this time-space tradeoff in several of Mercator’s main data
implemented by protocol modules. The protocol modules structures.
to be used in a crawl are specified in a user-supplied con-
figuration file, and are dynamically loaded at the start of the 3.2. The URL frontier
crawl. The default configuration includes protocol modules
for HTTP, FTP, and Gopher. As suggested by the tiling of The URL frontier is the data structure that contains all
the protocol modules in figure 1, there is a separate in- the URLs that remain to be downloaded. Most crawlers
stance of each protocol module per thread, which allows work by performing a breath-first traversal of the Web,
each thread to access local data without any synchroniza- starting from the pages in the seed set. Such traversals
tion. are easily implemented by using a FIFO queue.
Based on the URL’s scheme, the worker selects the ap- In a standard FIFO queue, elements are dequeued in the
propriate protocol module for downloading the document. order they were enqueued. In the context of Web crawl-
It then invokes the protocol module’s fetch method, which ing, however, matters are complicated by the fact that it
downloads the document from the Internet into a per- is considered socially unacceptable to have multiple HTTP
thread RewindInputStream (or RIS for short). A RIS is requests pending to the same server. If multiple requests
an I/O abstraction that is initialized from an arbitrary input are to be made in parallel, the queue’s remove operation
stream, and that subsequently allows that stream’s contents should not simply return the head of the queue, but rather
to be re-read multiple times. a URL close to the head whose host has no outstanding
Once the document has been written to the RIS, the request.
worker thread invokes the content-seen test to determine To implement this politeness constraint, the default ver-
whether this document (associated with a different URL) sion of Mercator’s URL frontier is actually implemented
has been seen before . If so, the document is not by a collection of distinct FIFO subqueues. There are two
processed any further, and the worker thread removes the important aspects to how URLs are added to and removed
next URL from the frontier. from these queues. First, there is one FIFO subqueue per
Every downloaded document has an associated MIME worker thread. That is, each worker thread removes URLs
type. In addition to associating schemes with protocol mod- from exactly one of the FIFO subqueues. Second, when a
ules, a Mercator configuration file also associates MIME new URL is added, the FIFO subqueue in which it is placed
types with one or more processing modules. A processing is determined by the URL’s canonical host name. Together,
module is an abstraction for processing downloaded doc- these two points imply that at most one worker thread will
uments, for instance extracting links from HTML pages, download documents from a given Web server. This design
counting the tags found in HTML pages, or collecting sta- prevents Mercator from overloading a Web server, which
tistics about GIF images. Like protocol modules, there is a could otherwise become a bottleneck of the crawl.
separate instance of each processing module per thread. In In actual World Wide Web crawls, the size of the crawl’s
general, processing modules may have side-effects on the frontier numbers in the hundreds of millions of URLs.
state of the crawler, as well as on their own internal state. Hence, the majority of the URLs must be stored on disk.
Based on the downloaded document’s MIME type, the To amortize the cost of reading from and writing to disk,
worker invokes the process method of each processing our FIFO subqueue implementation maintains fixed-size en-
module associated with that MIME type . For example, queue and dequeue buffers in memory; our current imple-
the Link Extractor and Tag Counter processing modules in mentation uses buffers that can hold 600 URLs each.
figure 1 are used for text/html documents, and the GIF Stats As described in section 4 below, the URL frontier is
module is used for image/gif documents. one of Mercator’s “pluggable” components, meaning that
By default, a processing module for extracting links is it can easily be replaced by other implementations. For
associated with the MIME type text/html. The process example, one could implement a URL frontier that uses a
method of this module extracts all links from an HTML ranking of URL importance (such as the PageRank met-
page. Each link is converted into an absolute URL, and ric [Brin and Page 1998]) to order URLs in the frontier set.
tested against a user-supplied URL filter to determine if it Cho, Garcia-Molina, and Page have performed simulated
should be downloaded . If the URL passes the filter, crawls showing that such ordering can improve crawling
the worker performs the URL-seen test , which checks if effectiveness [Cho et al. 1998].
the URL has been seen before, namely, if it is in the URL
frontier or has already been downloaded. If the URL is 3.3. The HTTP protocol module
new, it is added to the frontier .
The above description omits several important imple- The purpose of a protocol module is to fetch the docu-
mentation details. Designing data structures that can effi- ment corresponding to a given URL using the appropriate
ciently handle hundreds of millions of entries poses many network protocol. Network protocols supported by Merca-
engineering challenges. Central to these concerns is the tor include HTTP, FTP, and Gopher.
222 A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler

Courteous Web crawlers implement the Robots Exclu- document contents multiple times. To prevent processing
sion Protocol, which allows Web masters to declare parts a document more than once, a Web crawler may wish to
of their sites off limits to crawlers [RobotsExclusion]. The perform a content-seen test to decide if the document has
Robots Exclusion Protocol requires a Web crawler to fetch already been processed. Using a content-seen test makes
a special document containing these declarations from a it possible to suppress link extraction from mirrored pages,
Web site before downloading any real content from it. To which may result in a significant reduction in the num-
avoid downloading this file on every request, Mercator’s ber of pages that need to be downloaded (section 5 below
HTTP protocol module maintains a fixed-sized cache map- elaborates on these points). Mercator includes just such
ping host names to their robots exclusion rules. By default, a content-seen test, which also offers the side benefit of
the cache is limited to 218 entries, and uses an LRU re- allowing us to keep statistics about the fraction of down-
placement strategy. loaded documents that are duplicates of pages that have
Our initial HTTP protocol module used the HTTP sup- already been downloaded.
port provided by the JDK 1.1 Java class libraries. However, The content-seen test would be prohibitively expensive
we soon discovered that this implementation did not allow in both space and time if we saved the complete contents
us to specify any timeout values on HTTP connections, so a of every downloaded document. Instead, we maintain a
malicious Web server could cause a worker thread to hang data structure called the document fingerprint set that stores
indefinitely. Also, the JDK implementation was not par- a 64-bit checksum of the contents of each downloaded
ticularly efficient. Therefore, we wrote our own “lean and document. We compute the checksum using Broder’s im-
mean” HTTP protocol module; its requests time out after plementation [Broder 1993] of Rabin’s fingerprinting al-
1 minute, and it has minimal synchronization and allocation gorithm [Rabin 1981]. Fingerprints offer provably strong
overhead. probabilistic guarantees that two different strings will not
have the same fingerprint. Other checksum algorithms, such
3.4. Rewind input stream as MD5 and SHA, do not offer such provable guarantees,
and are also more expensive to compute than fingerprints.
Mercator’s design allows the same document to be When crawling the entire Web, the document fingerprint
processed by multiple processing modules. To avoid read- set will obviously be too large to be stored entirely in mem-
ing a document over the network multiple times, we cache ory. Unfortunately, there is very little locality in the re-
the document locally using an abstraction called a Rewind- quests made on the document fingerprint set, so caching
InputStream (RIS). such requests has little benefit. We therefore maintain two
A RIS is an input stream with an open method that reads independent sets of fingerprints: a small hash table kept in
and caches the entire contents of a supplied input stream memory, and a large sorted list kept in a single disk file.
(such as the input stream associated with a socket). A RIS The content-seen test first checks if the fingerprint is con-
caches small documents (64 KB or less) entirely in mem- tained in the in-memory table. If not, it has to check if the
ory, while larger documents are temporarily written to a fingerprint resides in the disk file. To avoid multiple disk
backing file. The RIS constructor allows a client to specify seeks and reads per disk search, Mercator performs an inter-
an upper limit on the size of the backing file as a safeguard polated binary search of an in-memory index of the disk file
against malicious Web servers that might return documents to identify the disk block on which the fingerprint would
of unbounded size. By default, Mercator sets this limit to reside if it were present. It then searches the appropri-
1 MB. In addition to the functionality provided by normal ate disk block, again using interpolated binary search. We
input streams, a RIS also provides a method for rewinding use a buffered variant of Java’s random access files, which
its position to the beginning of the stream, and various lex- guarantees that searching through one disk block causes at
ing methods that make it easy to build MIME-type-specific most two kernel calls (one seek and one read). We use a
parsers. customized data structure instead of a more generic data
Each worker thread has an associated RIS, which it structure such as a B-tree because of this guarantee. It is
reuses from document to document. After removing a URL worth noting that Mercator’s ability to be dynamically con-
from the frontier, a worker passes that URL to the appropri- figured (see section 4 below) would easily allow someone
ate protocol module, which initializes the RIS from a net- to replace our implementation with a different one based
work connection to contain the document’s contents. The on B-trees.
worker then passes the RIS to all relevant processing mod- When a new fingerprint is added to the document fin-
ules, rewinding the stream before each module is invoked. gerprint set, it is added to the in-memory table. When this
table fills up, its contents are merged with the fingerprints
3.5. Content-seen test on disk, at which time the in-memory index of the disk
file is updated as well. To guard against races, we use
Many documents on the Web are available under multi- a readers-writer lock that controls access to the disk file.
ple, different URLs. There are also many cases in which Threads must hold a read share of the lock while reading
documents are mirrored on multiple servers. Both of these from the file, and must hold the write lock while writing to
effects will cause any Web crawler to download the same it.
A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler 223

3.6. URL filters 3.8. URL-seen test

The URL filtering mechanism provides a customizable In the course of extracting links, any Web crawler will
way to control the set of URLs that are downloaded. Before encounter multiple links to the same document. To avoid
adding a URL to the frontier, the worker thread consults the downloading and processing a document multiple times,
a URL-seen test must be performed on each extracted link
user-supplied URL filter. The URL filter class has a sin-
before adding it to the URL frontier. (An alternative design
gle crawl method that takes a URL and returns a boolean
would be to instead perform the URL-seen test when the
value indicating whether or not to crawl that URL. Merca-
URL is removed from the frontier, but this approach would
tor includes a collection of different URL filter subclasses
result in a much larger frontier.)
that provide facilities for restricting URLs by domain, pre-
To perform the URL-seen test, we store all of the URLs
fix, or protocol type, and for computing the conjunction, seen by Mercator in canonical form in a large table called
disjunction, or negation of other filters. Users may also the URL set. Again, there are too many entries for them all
supply their own custom URL filters, which are dynami- to fit in memory, so like the document fingerprint set, the
cally loaded at start-up. URL set is stored mostly on disk.
To save space, we do not store the textual representation
of each URL in the URL set, but rather a fixed-sized check-
3.7. Domain name resolution sum. Unlike the fingerprints presented to the content-seen
test’s document fingerprint set, the stream of URLs tested
Before contacting a Web server, a Web crawler must use against the URL set has a non-trivial amount of locality. To
the Domain Name Service (DNS) to map the Web server’s reduce the number of operations on the backing disk file,
host name into an IP address. DNS name resolution is a we therefore keep an in-memory cache of popular URLs.
well-documented bottleneck of most Web crawlers. This The intuition for this cache is that links to some URLs are
bottleneck is exacerbated in any crawler, like Mercator or quite common, so caching the popular ones in memory will
the Internet Archive crawler, that uses DNS to canonicalize lead to a high in-memory hit rate.
the host names of newly discovered URLs before perform- In fact, using an in-memory cache of 218 entries and the
ing the URL-seen test on them. LRU-like clock replacement policy, we achieve an overall
We tried to alleviate the DNS bottleneck by caching hit rate on the in-memory cache of 66.2%, and a hit rate
DNS results, but that was only partially effective. Af- of 9.5% on the table of recently-added URLs, for a net
ter some probing, we discovered that the Java interface hit rate of 75.7%. Moreover, of the 24.3% of requests
to DNS lookups is synchronized. Further investigation re- that miss in both the cache of popular URLs and the table
vealed that the DNS interface on most flavors of Unix (i.e., of recently-added URLs, about 1/3 produce hits on the
the gethostbyname function provided as part of the Berke- buffer in our random access file implementation, which also
ley Internet Name Domain (BIND) distribution [BIND]) is resides in user-space. The net result of all this buffering
also synchronized. This meant that only one DNS request is that each membership test we perform on the URL set
on an uncached name could be outstanding at once. The results in an average of 0.16 seek and 0.17 read kernel calls
(some fraction of which are served out of the kernel’s file
cache miss rate is high enough that this limitation causes a
system buffers). So, each URL set membership test induces
bottleneck.
one-sixth as many kernel calls as a membership test on the
To work around these problems, we wrote our own
document fingerprint set. These savings are purely due
multi-threaded DNS resolver class and integrated it into
to the amount of URL locality (i.e., repetition of popular
Mercator. The resolver forwards DNS requests to a local
URLs) inherent in the stream of URLs encountered during
name server, which does the actual work of contacting the
a crawl.
authoritative server for each query. Because multiple re- However, due to the prevalence of relative URLs in Web
quests can be made in parallel, our resolver can resolve pages, there is a second form of locality in the stream of
host names much more rapidly than either the Java or Unix discovered URLs, namely, host name locality. Host name
resolvers. locality arises because many links found in Web pages are
This change led to a significant crawling speedup. Be- to different documents on the same server. Unfortunately,
fore making the change, performing DNS lookups ac- computing a URL’s checksum by simply fingerprinting its
counted for 87% of each thread’s elapsed time. Using our textual representation would cause this locality to be lost.
custom resolver reduced that elapsed time to 25%. (Note To preserve the locality, we compute the checksum of a
that the actual number of CPU cycles spent on DNS reso- URL by merging two independent fingerprints: one of the
lution is extremely low. Most of the elapsed time is spent URL’s host name, and the other of the complete URL. These
waiting on remote DNS servers.) Moreover, because our re- two fingerprints are merged so that the high-order bits of
solver can perform resolutions in parallel, DNS is no longer the checksum derive from the host name fingerprint. As a
a bottleneck; if it were, we would simply increase the num- result, checksums for URLs with the same host component
ber of worker threads. are numerically close together. So, the host name locality
224 A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler

in the stream of URLs translates into access locality on the 3.10. Checkpointing
URL set’s backing disk file, thereby allowing the kernel’s
file system buffers to service read requests from memory A crawl of the entire Web takes weeks to complete. To
more often. On extended crawls in which the size of the guard against failures, Mercator writes regular snapshots of
URL set’s backing disk file significantly exceeds the size of its state to disk. These snapshots are orthogonal to Mer-
the kernel’s file system buffers, this technique results in a cator’s other disk-based data structures. An interrupted or
significant reduction in disk load, and hence, in a significant aborted crawl can easily be restarted from the latest check-
performance improvement. point.
The Internet Archive crawler implements the URL-seen We define a general checkpointing interface that is im-
test using a different data structure called a bloom fil- plemented by the Mercator classes constituting the crawler
ter [Bloom 1970]. A bloom filter is a probabilistic data core. User-supplied protocol or processing modules are also
structure for set membership testing that may yield false required to implement the checkpointing interface.
positives. The set is represented by a large bit vector. An Checkpoints are coordinated using a global readers-
element is added to the set by computing n hash functions writer lock. Each worker thread acquires a read share of
of the element and setting the corresponding bits. An ele- the lock while processing a downloaded document. At reg-
ment is deemed to be in the set if the bits at all n of the ular intervals, typically once a day, Mercator’s main thread
element’s hash locations are set. Hence, a document may acquires the write lock, so it is guaranteed to be running
incorrectly be deemed to be in the set, but false negatives in isolation. Once it has acquired the lock, the main thread
are not possible. arranges for the checkpoint methods to be called on Mer-
The disadvantage to using a bloom filter for the URL- cator’s core classes and all user-supplied modules.
seen test is that each false positive will cause the URL
not to be added to the frontier, and therefore the document
4. Extensibility
will never be downloaded. The chance of a false posi-
tive can be reduced by making the bit vector larger. The As mentioned previously, Mercator is an extensible
Internet Archive crawler uses 10 hash functions and a sep- crawler. In practice, this means two things. First, Merca-
arate 32 KB bit vector for each of the sites currently being tor can be extended with new functionality. For example,
crawled. For the batch phase in which non-local URLs are new protocol modules may be provided for fetching doc-
merged, a much larger 2 GB bit vector is used. As the uments according to different network protocols, or new
Web grows, the batch process will have to be run on a ma- processing modules may be provided for processing down-
chine with larger amounts of memory, or disk-based data loaded documents in customized ways. Second, Mercator
structures will have to be used. By contrast, our URL-seen can easily be reconfigured to use different versions of most
test does not admit false positives, and it uses a bounded of its major components. In particular, different versions
amount of memory, regardless of the size of the Web. of the URL frontier (section 3.2), document fingerprint set
(section 3.5), URL filter (section 3.6), and URL set (sec-
3.9. Synchronous vs. asynchronous I/O tion 3.8) may all be dynamically “plugged into” the crawler.
In fact, we have written multiple versions of each of these
Both the Google and Internet Archive crawlers use components, which we employ for different crawling tasks.
single-threaded crawling processes and asynchronous I/O to Making an extensible system such as Mercator requires
perform multiple downloads in parallel. In contrast, Mer- three ingredients:
cator uses a multi-threaded process in which each thread
performs synchronous I/O. These two techniques are dif- • The interface to each of the system’s components must
ferent means for achieving the same effect. be well-specified. In Mercator, the interface to each
The main advantage to using multiple threads and syn- component is defined by an abstract class, some of
chronous I/O is that it leads to a much simpler program whose methods are also abstract. Any component im-
structure. Switching between threads of control is left to plementation is required to be a subclass of this ab-
the operating system’s thread scheduler, rather than having stract class that provides implementations for the ab-
to be coded in the user program. The sequence of tasks ex- stract methods.
ecuted by each worker thread is much more straightforward • A mechanism must exist for specifying how the crawler
and self-evident. is to be configured from its various components. In
One strength of the Google and the Internet Archive Mercator, this is done by supplying a configuration file
crawlers is that they are designed from the ground up to to the crawler when it starts up. Among other things,
scale to multiple machines. However, whether a crawler is the configuration file specifies which additional protocol
distributed or not is orthogonal to the choice between syn- and processing modules should be used, as well as the
chronous and asynchronous I/O. It would not be too difficult concrete implementation to use for each of the crawler’s
to adapt Mercator to run on multiple machines, while still “pluggable” components. When Mercator is started, it
using synchronous I/O and multiple threads within each reads the configuration file, uses Java’s dynamic class
process. loading feature to dynamically instantiate the necessary
A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler 225

components, plugs these instances into the appropriate 270 lines of Java code, including comments; the majority
places in the crawler core’s data structures, and then are less than 100 lines long.
begins the crawl. Suitable defaults are defined for all
components. 4.2. Alternative URL frontier implementation
• Sufficient infrastructure must exist to make it easy to
write new components. In Mercator, this infrastructure In section 3.2, we described one implementation of the
consists of a rich set of utility libraries together with a set URL frontier data structure. However, when we ran that
of existing pluggable components. An object-oriented implementation on a crawl of our corporate intranet, we dis-
language such as Java makes it easy to construct a new covered that it had the drawback that multiple hosts might
component by subclassing an existing component and be assigned to the same worker thread, while other threads
overriding some of its methods. For example, our col- were left idle. This situation is more likely to occur on an
league Mark Manasse needed information about the dis- intranet because intranets typically contain a substantially
tribution of HTTP 401 return codes across Web servers. smaller number of hosts than the internet at large.
He was able to obtain this data by subclassing Merca- To restore the parallelism lost by our initial URL frontier
tor’s default HTTP protocol component, and by using implementation, we wrote an alternative URL frontier com-
one of our standard histogram classes to maintain the ponent that dynamically assigns hosts to worker threads.
statistics. As a result, his custom HTTP protocol com- Like our initial implementation, the second version guar-
ponent required only 58 lines of Java source code. antees that at most one worker thread will download doc-
uments from any given Web server at once; moreover, it
To demonstrate Mercator’s extensibility, we now describe maximizes the number of busy worker threads within the
some of the extensions we have written. limits of this guarantee. In particular, all worker threads
will be busy so long as the number of different hosts in
4.1. Protocol and processing modules the frontier is at least the number of worker threads. The
second version is well-suited to host-limited crawls (such
By default, Mercator will crawl the Web by fetching as intranet crawls), while the initial version is preferable
documents using the HTTP protocol, extracting links from for internet crawls, since it does not have the overheads
documents of type text/html. However, aside from extract- required by the second version to maintain a dynamic map-
ing links, it does not process the documents in any way. To ping from host names to worker threads.
fetch documents using additional protocols or to process the
documents once they are fetched, new protocol and process- 4.3. Configuring Mercator as a random walker
ing modules must be supplied.
The abstract Protocol class includes only two methods. We have used Mercator to perform random walks of
The fetch method downloads the document corresponding the Web in order to gather a sample of Web pages; the
to a given URL, and the newURL method parses a given sampled pages were used to measure the quality of search
string, returning a structured URL object. The latter method engines [Henzinger et al. 1999]. A random walk starts at
is necessary because URL syntax varies from protocol to a random page taken from a set of seeds. The next page to
protocol. In addition to the HTTP protocol module, we fetch is selected by choosing a random link from the current
have also written protocol modules for fetching documents page. The process continues until it arrives at a page with
using the FTP and Gopher protocols. no links, at which time the walk is restarted from a new
The abstract Analyzer class is the superclass for all random seed page. The seed set is dynamically extended
processing modules. It defines a single process method. by the newly discovered pages, and cycles are broken by
This method is responsible for reading the document and performing random restarts every so often.
processing it appropriately. The method may make method Performing a random walk of the Web is quite different
calls on other parts of the crawler (e.g., to add newly dis- from an ordinary crawl for two reasons. First, a page may
covered URLs to the frontier). Analyzers often keep private be revisited multiple times during the course of a random
state or write data to the disk. walk. Second, only one link is followed each time a page
We have written a variety of different Analyzer sub- is visited. In order to support random walking, we wrote
classes. Some of these analyzers keep statistics, such as a new URL frontier class that does not maintain a set of
our GifStats and TagCounter subtypes. Other processing all added URLs, but instead records only the URLs dis-
modules simply write the contents of each downloaded doc- covered on each thread’s most recently fetched page. Its
ument to disk. Our colleague Raymie Stata has written remove method selects one of these URLs at random and re-
such a module; the files it writes are in a form suitable turns it. To allow pages to be processed multiple times, we
to be read by the AltaVista indexing software. As another also replaced the document fingerprint set by a new version
experiment, we wrote a WebLinter processing module that that never rejects documents as already having been seen.
runs the Weblint program on each downloaded HTML page Finally, we subclassed the default LinkExtractor class to
to check it for errors, logging all discovered errors to a perform extra logging. The new classes are “plugged into”
file. These processing modules range in size from 70 to the crawler core at runtime using the extension mechanism
226 A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler

described above. In total, the new classes required for ran- • Replication across different hosts
dom walking amount to 360 lines of Java source code. Finally, multiple copies of a document may reside on
different Web servers. Mirror sites are a common in-
stance of this phenomenon, as are multiple Web servers
5. Crawler traps and other hazards that access the same shared file system.
In the latter two cases, we cannot avoid downloading dupli-
In the course of our experiments, we had to overcome
cate documents. However, we do avoid processing all but
several pitfalls that would otherwise cause Mercator to
the first copy by using the content-seen test described in sec-
download more documents than necessary. Although the
tion 3.5. During an eight-day crawl described in section 6
Web contains a finite number of static documents (i.e., doc-
below, 8.5% of the documents we fetched were duplicates.
uments that are not generated on-the-fly), there are an in-
Had we not performed the content-seen test, the number
finite number of retrievable URLs. Three frequent causes
of unnecessary downloads would have been much higher,
of this inflation are URL aliases, session IDs embedded
since we would have followed links from duplicate docu-
in URLs, and crawler traps. This section describes those
ments to other documents that are likely to be duplicates as
problems and the techniques we use to avoid them.
well.
5.1. URL aliases 5.2. Session IDs embedded in URLs
Two URLs are aliases for each other if they refer to the To track the browsing behavior of their visitors, a num-
same content. There are four causes of URL aliases: ber of Web sites embed dynamically-generated session
• Host name aliases identifiers within the links of returned pages. Session IDs
Host name aliases occur when multiple host names cor- create a potentially infinite set of URLs that refer to the
same document. With the advent of cookies, the use of
respond to the same IP address. For example, the host
embedded session IDs has diminished, but they are still
names coke.com and cocacola.com both corre-
used, partly because most browsers have a facility for dis-
spond to the host whose IP address is 208.134.241.178.
abling cookies.
Hence, every document served by that machine’s Web
Embedded session IDs represent a special case of alter-
server has at least three different URLs (one using each
native paths on the same host as described in the previous
of the two host names, and one using the IP address).
section. Thus, Mercator’s document fingerprinting tech-
Before performing the URL-seen test, we canonical-
nique prevents excessive downloading due to embedded
ize a host name by issuing a DNS request containing
session IDs, although it is not powerful enough to auto-
a CNAME (“canonical name”) and an A (“addresses”)
matically detect and remove them.
query. We use the host’s canonical name if one is re-
turned; otherwise, we use the smallest returned IP ad-
5.3. Crawler traps
dress.
It should be mentioned that this technique may be too A crawler trap is a URL or set of URLs that cause a
aggressive for some virtual domains. Many Internet ser- crawler to crawl indefinitely. Some crawler traps are unin-
vice providers (ISPs) serve up multiple domains from tentional. For example, a symbolic link within a file system
the same Web server, using the “Host” header field of can create a cycle. Other crawler traps are introduced in-
the HTTP request to recover the host portion of the URL. tentionally. For example, people have written traps using
If an ISP does not provide distinct CNAME records for CGI programs that dynamically generate an infinite Web of
these virtual domains, URLs from these domains will documents. The motivations behind such traps vary. Anti-
be collapsed into the same host space by our canonical- spam traps are designed to catch crawlers used by “Internet
ization process. marketeers” (better known as spammers) looking for e-mail
• Omitted port numbers addresses, while other sites use traps to catch search engine
If a port number is not specified in a URL, a protocol- crawlers so as to boost their search ratings.
specific default value is used. We therefore insert the We know of no automatic technique for avoiding crawler
default value, such as 80 in the case of HTTP, before traps. However, sites containing crawler traps are easily
performing the URL-seen test on it. noticed due to the large number of documents discovered
• Alternative paths on the same host there. A human operator can verify the existence of a trap
On a given host, there may be multiple paths to the and manually exclude the site from the crawler’s purview
same file. For example, the two URLs http://www. using the customizable URL filter described in section 3.6.
digital.com/index.html and http://www.
digital.com/home.html both refer to the same 6. Results of an extended crawl
file. One common cause of this phenomenon is the use
of symbolic links within the server machine’s file sys- This section reports on Mercator’s performance during
tem. an eight-day crawl, and presents some statistics about the
A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler 227

Table 1
Web collected during that crawl. In our analysis of Mer-
Relationship between the total number of URLs removed from the frontier
cator’s performance, we contrast it with the performance and the total number of HTTP requests.
of the Google and Internet Archive crawlers. We make
No. of URLs removed 76,732,515
no attempt to adjust for different hardware configurations
+ No. of robots.txt requests 3,675,634
since the papers describing the two other crawlers do not − No. of excluded URLs 3,050,768
contain enough information to do so. Our main intention in
presenting this performance data is to convey the relative = No. of HTTP requests 77,357,381
speeds of the various crawlers.
Table 2
6.1. Performance Breakdown of HTTP status codes.
Code Meaning Number Percent
Our production crawling machine is a Digital Ultimate
Workstation with two 533 MHz Alpha processors, 2 GB 200 OK 65,790,953 87.03%
404 Not found 5,617,491 7.43%
of RAM, 118 GB of local disk, and a 100 Mbit/sec FDDI 302 Moved temporarily 2,517,705 3.33%
connection to the Internet. We run Mercator under srcjava, 301 Moved permanently 842,875 1.12%
a Java runtime developed at our lab [Ghemawat]. Running 403 Forbidden 322,042 0.43%
on this platform, a Mercator crawl run in May 1999 made 401 Unauthorized 223,843 0.30%
77.4 million HTTP requests in 8 days, achieving an average 500 Internal server error 83,744 0.11%
406 Not acceptable 81,091 0.11%
download rate of 112 documents/sec and 1,682 KB/sec.
400 Bad request 65,159 0.09%
These numbers indicate that Mercator’s performance Other 48,628 0.06%
compares favorably with that of the Google and the In-
ternet Archive crawlers. The Google crawler is reported to Total 75,593,531 100.0%
have issued 26 million HTTP requests over 9 days, averag-
ing 33.5 docs/sec and 200 KB/sec [Brin and Page 1998]. However, there are two wrinkles to that equation, both re-
This crawl was performed using four machines running lated to the Robots Exclusion Protocol [RobotsExclusion].
crawler processes, and at least one more machine running First, before fetching a document, Mercator has to verify
the other processes. The Internet Archive crawler, which whether it has been excluded from downloading the docu-
also uses multiple crawler machines, is reported to fetch ment by the target site’s robots.txt file. If the appropriate
4 million HTML docs/day, the average HTML page being robots.txt data is not in Mercator’s cache (described in sec-
5 KB [Smith 1997]. This download rate is equivalent to tion 3.3), it must be downloaded, causing an extra HTTP
46.3 HTML docs/sec and 231 KB/sec. It is worth noting request. Second, if the robots.txt file indicates that Merca-
that Mercator fetches not only HTML pages, but docu- tor should not download the document, no HTTP request
ments of all other MIME types as well. This effect more is made for the document. Table 1 relates the number of
than doubles the size of the average document downloaded URLs removed from the frontier to the total number of
by Mercator as compared to the other crawlers. HTTP requests.
Achieving the performance numbers described above re- 1.8 million of the 77.4 million HTTP requests (2.3%)
quired considerable optimization work. In particular, we did not result in a response, either because the host could
spent a fair amount of time overcoming performance limi- not be contacted or some other network failure occurred.
tations of the Java core libraries [Heydon and Najork 1999]. Table 2 gives a breakdown of the HTTP status codes for
We used DCPI, the Digital Continuous Profiling In- the remaining 75.6 million requests. We were somewhat
frastructure [DCPI], to measure where Mercator spends surprised at the relatively low number of 404 status codes;
CPU cycles. DCPI is a freely available tool that runs on Al- we had expected to discover a higher percentage of broken
pha platforms. It can be used to profile both the kernel and links.
user-level processes, and it provides CPU cycle accounting Of the 65.8 million documents that were successfully
at the granularity of processes, procedures, and individual downloaded, 80% were between 1 K and 32 K bytes in
instructions. We found that Mercator spends 37% of its size. Figure 2 is a histogram showing the document size
cycles in JIT-compiled Java bytecode, 19% in the Java run- distribution. In this figure, the documents are distributed
time, and 44% in the Unix kernel. The Mercator method over 21 bins labeled with exponentially increasing docu-
accounting for the most cycles is the one that fingerprints ment sizes; a document of size n is placed in the rightmost
the contents of downloaded documents (2.3% of all cycles). bin with a label not greater than n.
According to our content-seen test, 8.5% of the success-
6.2. Selected Web statistics ful HTTP requests (i.e., those with status code 200) were
duplicates1. Of the 60 million unique documents that were
We now present statistics collected about the Web during
successfully downloaded, the vast majority were HTML
the eight-day crawl described above.
pages, followed in popularity by GIF and JPEG images.
One interesting statistic is the distribution of outcomes
from HTTP requests. Roughly speaking, each URL re- 1 This figure ignores the successful HTTP requests that were required to
moved from the frontier causes a single HTTP request. fetch robots.txt files.
228 A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler

Figure 2. Histogram of the sizes of successfully downloaded documents.

Table 3
Web crawlers for which performance numbers have been
Distribution of MIME types.
published.
MIME type Number Percent Mercator’s scalability design has worked well. It is easy
text/html 41,490,044 69.2% to configure the crawler for varying memory footprints. For
image/gif 10,729,326 17.9% example, we have run it on machines with memory sizes
image/jpeg 4,846,257 8.1% ranging from 128 MB to 2 GB. The ability to configure
text/plain 869,911 1.5%
Mercator for a wide variety of hardware platforms makes it
application/pdf 540,656 0.9%
audio/x-pn-realaudio 269,384 0.4% possible to select the most cost-effective platform for any
application/zip 213,089 0.4% given crawling task.
application/postscript 159,869 0.3% Mercator’s extensibility features have also been success-
other 829,410 1.4% ful. As mentioned above, we have been able to adapt Mer-
Total 59,947.946 100.0% cator to a variety of crawling tasks, and as stated earlier,
the new code was typically quite small (tens to hundreds of
lines). Java’s dynamic class loading support meshes well
Table 3 shows the distribution of the most popular MIME with Mercator’s extensibility requirements, and its object-
types. oriented nature makes it easy to write variations of existing
modules using subclassing. Writing new modules is also
simplified by providing a range of general-purpose reusable
7. Conclusions classes, such as classes for recording histograms and other
statistics.
Scalable Web crawlers are an important component of
Mercator is scheduled to be included in the next ver-
many Web services, but they have not been very well doc-
sion of the AltaVista Search Intranet product [AltaVista], a
umented in the literature. Building a scalable crawler is a
version of the AltaVista software sold mostly to corporate
non-trivial endeavor because the data manipulated by the
clients who use it to crawl and index their intranets.
crawler is too big to fit entirely in memory, so there are per-
formance issues relating to how to balance the use of disk
and memory. This paper has enumerated the main compo- Acknowledgements
nents required in any scalable crawler, and it has discussed
design alternatives for those components. Thanks to Sanjay Ghemawat for providing the high-
In particular, the paper described Mercator, an extensi- performance Java runtime we use for our crawls.
ble, scalable crawler written entirely in Java. Mercator’s
design features a crawler core for handling the main crawl-
ing tasks, and extensibility through protocol and processing References
modules. Users may supply new modules for performing
customized crawling tasks. We have used Mercator for a AltaVista, “AltaVista Software Search Intranet Home Page,”
variety of purposes, including performing random walks altavista.software.digital.com/search/intranet.
on the Web, crawling our corporate intranet, and collecting BIND, “Berkeley Internet Name Domain (BIND),”
www.isc.org/bind.html.
statistics about the Web at large. Bloom, B. (1970), “Space/Time Trade-Offs in Hash Coding with Allow-
Although our use of Java as an implementation language able Errors,” Communications of the ACM 13, 7, 422–426.
was somewhat controversial when we began the project, Brin, S. and L. Page (1998), “The Anatomy of a Large-Scale Hypertex-
we have not regretted the choice. Java’s combination of tual Web Search Engine,” In Proceedings of the Seventh International
features – including threads, garbage collection, objects, World Wide Web Conference, pp. 107–117.
Broder, A. (1993), “Some Applications of Rabin’s Fingerprinting
and exceptions – made our implementation easier and more Method,” In Sequences II: Methods in Communications, Security, and
elegant. Moreover, when run under a high-quality Java Computer Science, R. Capocelli, A. De Santis, and U. Vaccaro, Eds.,
runtime, Mercator’s performance compares well to other Springer-Verlag, pp. 143–152.
A. Heydon, M. Najork / Mercator: A scalable, extensible Web crawler 229

Burner, M. (1977), “Crawling Towards Eternity: Building an Archive of InternetArchive, “The Internet Archive,”
the World Wide Web,” Web Techniques Magazine 2, 5. www.archive.org/.
Cho, J., H. Garcia-Molina, and L. Page (1998), “Efficient Crawling Koster, M., “The Web Robots Pages,”
Through URL Ordering,” In Proceedings of the Seventh International info.webcrawler.com/mak/projects/robots/robots.
World Wide Web Conference, pp. 161–172. html.
DCPI, “Digital Continuous Profiling Infrastructure,” McBryan, O.A. (1994), “GENVL and WWWW: Tools for Taming the
www.research.digital.com/SRC/dcpi/. Web,” In Proceedings of the First International World Wide Web Con-
Eichmann, D. (1994), “The RBSE Spider – Balancing Effective Search ference, pp. 79–90.
Against Web Load,” In Proceedings of the First International World Miller, R.C. and K. Bharat (1998), “SPHINX: A Framework for Creating
Wide Web Conference, pp. 113–120. Personal, Site-Specific Web Crawlers,” In Proceedings of the Seventh
Ghemawat, S., “srcjava home page,” International World Wide Web Conference, pp. 119–130.
www.research.digital.com/SRC/java/. Pinkerton, B. (1994), “Finding What People Want: Experiences with the
Google, “Google! Search Engine,” WebCrawler,” In Proceedings of the Second International World Wide
google.stanford.edu/. Web Conference.
Gray, M., “Internet Growth and Statistics: Credits and Background,” Rabin, M.O. (1981), “Fingerprinting by Random Polynomials,” Techni-
www.mit.edu/people/mkgray/net/background.html. cal Report TR-15-81, Center for Research in Computing Technology,
Henzinger, M., A. Heydon, M. Mitzenmacher, and M.A. Najork (1999), Harvard University.
“Measuring Index Quality Using Random Walks on the Web,” In RobotsExclusion, “The Robots Exclusion Protocol,”
Proceedings of the Eighth International World Wide Web Conference, info.webcrawler.com/mak/projects/robots/
pp. 213–225. exclusion.html.
Heydon, A. and M. Najork (1999), “Performance Limitations of the Java Smith, Z. (1997), “The Truth About the Web: Crawling Towards Eternity,”
Core Libraries,” In Proceedings of the 1999 ACM Java Grande Con- Web Techniques Magazine 2, 5.
ference, pp. 35–41.

You might also like