Comments on Geeking with Greg: Crawling is harder than it looks

AdSense...

2008-06-04T06:17:00.000-07:00

AdSense...

"...G even using GA and AS data"What does 'AS' ref...

2008-05-24T01:12:00.000-07:00

"...G even using GA and AS data"

What does 'AS' refer to ?

Vicaya should probably keep in mind that Google/Ya...

2008-05-20T00:27:00.000-07:00

Vicaya should probably keep in mind that Google/Yahoo use clusters of thousands of servers to crawl the web. Also the paper says the crawler was limited by the university bandwidth.

I would imagine that with a mere $20M invested into the crawler, this project could do 6B pages in 1 day.

@anon: Agreed. It's an important contribution. I j...

2008-05-09T19:22:00.000-07:00

@anon: Agreed. It's an important contribution. I just lamented the fact that neither G nor Y published anything on crawling since the altavista papers.

AFAIK, both G & Y's crawls are now driven by click data, with G even using GA and AS data. So, yes, nobody is catching up to G anytime soon, unless they commit seppuku themselves.

@vicaya:Even if the paper brings nothing new when ...

2008-05-08T02:24:00.000-07:00

@vicaya:

Even if the paper brings nothing new when compared to the current state of the art in the industry, this contribution is very important.

What is not published is not available for future generations, or even developing nations. Science needs to look only to publicly available knowledge. Knowledge inside Google or Yahoo! is restricted to a minority.

Crawling is hard for sure.Kudos to the authors for...

2008-05-07T21:38:00.000-07:00

Crawling is hard for sure.

Kudos to the authors for publishing the paper, but the techniques described in the paper just rediscovered (with minor interesting variations) a small fraction of tricks used by G and Y for years (it's sad that with all those PhDs they have, they don't publish anything new in this area, in the name of "competitive advantages").

6 billion+ valid responses recorded in 40+ days is not that impressive, esp. when they were not storing all the content. The numbers are pretty good (average 319mbps on a 1gb uplink. A state of the art web crawler these days should be able to saturate any pipe in existence) but really didn't blow me away given the hardware spec. The comment in the paper about scaling to multiple nodes is naive at best.