Wednesday, May 07, 2008

Crawling is harder than it looks

The best paper award at WWW 2008 went to a paper on large-scale crawling titled "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov.

I have never been that interested in crawling, so I almost missed this talk. But, I am glad I didn't.

The paper is a fascinating look at the difficulty of doing very large scale web crawling, focusing on avoiding web spam and pathological link structures that could trap a crawler, making sure to not overwhelm crawled webservers, and using high performance techniques for duplicate detection.

Some extended excerpts:
The web has changed significantly since the days of early crawlers, mostly in the area of dynamically generated pages and web spam. With server-side scripts that can create infinite loops, high-density link farms, and unlimited number of hostnames, the task of web crawling has changed from simply doing a BFS scan of the WWW to deciding in real time which sites contain useful information and giving them higher priority as the crawl progresses.

The first performance bottleneck we faced was caused by the complexity of verifying uniqueness of URLs and their compliance with robots.txt. As N scales into many billions ... [prior] algorithms ... no longer keep up with the rate at which new URLs are produced by our crawler (i.e., up to 184K per second) ... [A] new technique called Disk Repository with Update Management (DRUM) ... can store large volumes of arbitrary hashed data on disk and implement very fast check, update, and check+update operations using bucket sort ... DRUM can be thousands of times faster than prior disk-based methods.

In order to determine the legitimacy of a given domain, we use a very simple algorithm based on the number of incoming links from assets that spammers cannot grow to infinity. Our algorithm, which we call Spam Tracking and Avoidance through Reputation (STAR), dynamically allocates the budget of allowable pages for each domain and all of its subdomains in proportion to the number of in-degree links from other domains.

We ran IRLbot on a [single] quad-CPU AMD Opteron 2.6 GHz server (16 GB RAM, 24-disk RAID-5) attached to a 1-gb/s link ... [for a] total active crawling span of 41.27 days. During this time, IRLbot attempted 7,606,109,371 connections and received 7,437,281,300 valid HTTP replies ... IRLbot ended up with N = 6,380,051,942 responses with status code 200 and content-type text/html.

The average download rate during this crawl was 319 mb/s (1,789 pages/s) with the peak 10-minute average rate of 470 mb/s (3,134 pages/s). The crawler received 143 TB of data, out of which 254 GB were robots.txt files, and transmitted 1.8 TB of HTTP requests. The parser processed 161 TB of HTML code.
At very large scale and from a single box, those are pretty remarkable crawling rates, 2k pages/second. The details of the bottlenecks they encountered and how they overcame them made this paper a quite enjoyable read.

I did end with a question about part of the work. The authors say (in 7.1 and Figure 6) that the probability of finding a unique page never dropped below .11 and, earlier (in 1.4) say that "we believe a good fraction of the 35B URLs not crawled in this experiment [contain] useful content." However, they define a unique page as the same URL, so it seems like it could easily be the case that those 35B URLs seen but not crawled could have duplicate or near duplicate content to the 6B crawled.

Update: One topic the IRLbot paper ignored was how frequently we should recrawl a page. Another WWW 2008 paper, "Recrawl Scheduling Based on Information Longevity" (PDF) by Chris Olston and Sandeep Pandey, has a nice overview of that issue and then extends the prior work by focusing on the persistence of a web page. The authors end up arguing that it is not worth recrawling pages that change very rapidly very frequently because it is impossible to ever get the index to match the actual content of the page.

Update: On the question of what content is useful, both in terms of crawling and recrawling, I always find myself wondering how much we should care about pages that never get visited by anyone. Shouldn't our focus in broadening a crawl beyond 6B pages and in recrawling pages in our index depend on how much users seem to be interested in the new content?

6 comments:

vicaya said...

Crawling is hard for sure.

Kudos to the authors for publishing the paper, but the techniques described in the paper just rediscovered (with minor interesting variations) a small fraction of tricks used by G and Y for years (it's sad that with all those PhDs they have, they don't publish anything new in this area, in the name of "competitive advantages").

6 billion+ valid responses recorded in 40+ days is not that impressive, esp. when they were not storing all the content. The numbers are pretty good (average 319mbps on a 1gb uplink. A state of the art web crawler these days should be able to saturate any pipe in existence) but really didn't blow me away given the hardware spec. The comment in the paper about scaling to multiple nodes is naive at best.

Anonymous said...

@vicaya:

Even if the paper brings nothing new when compared to the current state of the art in the industry, this contribution is very important.

What is not published is not available for future generations, or even developing nations. Science needs to look only to publicly available knowledge. Knowledge inside Google or Yahoo! is restricted to a minority.

vicaya said...

@anon: Agreed. It's an important contribution. I just lamented the fact that neither G nor Y published anything on crawling since the altavista papers.

AFAIK, both G & Y's crawls are now driven by click data, with G even using GA and AS data. So, yes, nobody is catching up to G anytime soon, unless they commit seppuku themselves.

Anonymous said...

Vicaya should probably keep in mind that Google/Yahoo use clusters of thousands of servers to crawl the web. Also the paper says the crawler was limited by the university bandwidth.

I would imagine that with a mere $20M invested into the crawler, this project could do 6B pages in 1 day.

majorlinux said...

"...G even using GA and AS data"

What does 'AS' refer to ?

Praveen said...

AdSense...