Wednesday, February 01, 2006

Relevance rank and broad queries

I was playing with a few broad search queries recently and the results surprised me.

I was expecting any query that returns most of the Web (such as [the]) to yield links to the sites with the highest PageRank. Such an indiscriminating query would seem to provide very little basis for doing much else.

So, I would expect Google search results for the most popular English words ([the], [of], [to], [and]) to be the same, as also the search results for [* *] (any two words separated by a space), [the * the] ("the" followed by some words followed by "the"), and [1..1000] (any number between 1 and 1000).

As you can see by clicking on those links, all of those queries return most of the Web, between 5B pages (for [the * the]) and 20B pages (for [and] and [to]). But they return very different results.

The top result for [the] is "The Onion". The top result for [of] is "The Library of Congress". For [to], "Welcome to the White House". For [and], NASA.

The results for [1..1000] seem to be the closest to what I was expecting. It shows links to Netscape, Mozilla, Microsoft, IE, Macromedia Flash, Apple Quicktime, perhaps the most linked to sites on the web?

But, no, even those do not appear to be in PageRank order. For example, the first search result, the Netscape site, only has a PageRank of 8/10 and the Macromedia Flash site a PageRank of 5/10. Huh, again, not quite what I was expecting.

Curious. Why would these results differ so wildly? All of these pages contain these words. Why the difference in what makes it to the top?

Perhaps we should look at what other sites do as well. What do the results look like for [the], [of], [to], and [and] on Yahoo Search? They also differ from each other and from Google's. In fact, these results seem even more strange, with Crate and Barrel making it to the top on [and] and some site called To-Done being the top for [to]. Hmm...

Only MSN Search behaves even remotely close to what I was expecting. Results for [of], [to], [the], and [and] are fairly similar.

Looking at the Google results again, it may be the case that page title and text in links counts for a fair amount. "The Onion" may get its prime spot on a search for [the] because a lot of people link to it as The Onion. But, does that explain why the White House gets top billing for [to] when only the page title ("Welcome to the White House") has that word?

Perhaps this is just a spot where small weightings deep in the Google guts make a nonsensical difference. When there's so little information about searcher intent -- a search for [the] or [and] -- it matters little what you show. Little tinkers here and there that might make a difference when intentions are clearer are probably just revealing themselves in odd ways for these broad queries.

Nevertheless, I thought it was curious. Not what I expected to see.


Dave said...

What would someone want to find by searching for [the], [of], or [to]? I suppose there could be organizations with those acronyms for a name, or perhaps non-English languages for which those are not stop words.

As you said, given the complete lack of information about the searcher's intent, you may as well return pages in random order. The actual results might only be interesting to a reverse engineer(er?)...

Anonymous said...

Indeed, PageRank has next to zero effect on search engine placement. PageRank affects how much Google Juice you have to *give* as opposed to how much placement love you *receive*. I actually just published a long article on this yesterday, complete with vacuum tests: Lessons From The Roundabout SEO Test.