Monday, July 07, 2008

Google Toolbar data and the actual surfer model

There were a few interesting developments in the past couple weeks from Google that appear to relate to their ability to track much of the movements on the web.

As MG Siegler from VentureBeat noted, among many others, Google Trends now offers a feature that shows the traffic many websites get, much like Alexa does using data from the Alexa toolbar. The new functionality also shows similar pages and queries, people who visited site X also visited Y and people who visited site X also searched for Y.

Danny Sullivan, when talking about similar features Google launched in a new tool called Google Ad Planner, wrote:
I specifically asked ... [if] the [Google] toolbar is NOT [being used], and [a] "secret sauce" reply ... is all I got.

That makes me think that toolbar data IS being used. In particular, the focus on Google Analytics data feels like a sideshow. Google can't rely on Google Analytics as a core data source for this information, because of the simple reason that not every site runs it. In contrast, using Google Toolbar data would give them a nearly complete sample of all sites out there.
Erick Schonfeld at TechCrunch followed up with a post, "Is Google Ad Planner Getting Its Data From The Google Toolbar?

There isn't much new about using this data in the way Google Trends has revealed. Alexa and others have been doing it with their toolbars for many years. What is new is that Google Toolbar is installed much more widely, including on every Dell computer and in every installation of Adobe Flash.

I have no information about whether Google is using their toolbar data, but I have a hard time believing they could resist it. Not only can it be used for things like Google Trends, but it could have a dramatic impact in core web search.

PageRank, the original core of Google's search relevance, is an analysis of the links between websites that simulates someone randomly surfing across the links, the so-called random surfer model.

With toolbar data, you no longer needs this approximation of a random surfer model. You have the actual surfer model.

You know exactly how people move across the web. You know which sites are popular and which sites are never visited. You know which links are traversed and how often. You know everything about where people go and what people want.

Data like that should allow tremendous advances in relevance. It is hard for me to believe that Google would not be using their toolbar data, not just for Google Trends, but also in search and advertising.

Please see also my earlier post, "Ranking using Indiana University's user traffic", which discusses a paper from WSDM 2008 that attempts to supplement the PageRank random surfer model with an actual surfer model.

12 comments:

jeremy Pickens said...

..including on every Dell computer and in every installation of Adobe Flash.

Didn't the software industry already go through this ten years ago, and collectively determine that bundled software is kinda.. evil?

rubens said...

I didn´t find any authorization for this in the private policy of Google Toolbar.

Hellblazer said...

Greg, what is the penetration percentage of Google Tool Bar? Single digits? Double? High double?

Seems to me that if you're claiming that they no longer need a model, in that they have the actual system tracked, I'd like to think there's some actual statistical reasoning behind that claim. Unless Google Toolbar is used by the median surfer, they better have darn near 100% penetration... Something that I doubt...

Eric Goldman said...

As you point out, Google is going to have to tap into toolbar data at some point (if it isn't already) because desktop activities are the best source of information about consumer preferences. Eric.

Greg Linden said...

Hellblazer, I'm not sure why you think they'd need 100% penetration? You don't think they could do the actual surfer model with a sample?

On your question, I don't know what the penetration of Google Toolbar is. I doubt anyone outside of Google knows.

However, the penetration of Adobe Flash appears to be 98.8%. Google Toolbar is installed by default for Windows/IE users whenever Flash is installed or updated from Adobe's website, so I suspect Google Toolbar penetration is quite good.

Greg Linden said...

Ruben, I don't want to get into a discussion of Google Toolbar's privacy policy, but let me at least point out that it does say, "Uses: We process your requests in order to operate and improve the Google Toolbar and other Google services" (emphasis added).

Hellblazer said...

I'm not sure why you think they'd need 100% penetration? You don't think they could do the actual surfer model with a sample?

Well, sure - for a model. But I thought you were making a far stronger claim in - you know - saying that they didn't need a model and that they now had data on the real thing.

I doubt anyone outside of Google knows.

But this would seem to be critical to the understanding of their model. Sample bias is one of the nastiest things in building such a model. If it's a bunch of bleeding edge types who never click on ads (I haven't and don't know anyone who ever does, for example) then the model derived from such data would have a profound impact on its usefulness for - well - ads...

jeremy said...

Greg, from the link you sent in your original blogpost:

For Adobe Reader and Adobe Flash Player, the Google Toolbar is only being offered from the Adobe website and is not part of the Reader or Flash Player installers. Visitors to the Reader Download Center and Flash Player Download Center on Adobe.com may be presented with the option to install the Google Toolbar as part of the download process.

So it appears that Google Toolbar is not automatically installed, when a Windows user installs Flash. The user is given the option of saying no. So Google Toolbar has got to be less than 98.8%.

Gojomo said...

Sure, the sample of users who let toolbars report their every click elsewhere is likely biased. (It might be biased in a useful way, though -- these trusting folk may be the best ad-clickers and online-spenders, too!)

But they can get other samples many other ways. Even Analytics, AdSense, or widget insert gives insight into the slice of web traffic passing through the instrumented pages (possibly including both inbound and outbound destinations). Every Google-hosted service (Blogger, Pages, Appspot, AJAX Libraries API, etc.) gives another glimpse.

These slices could be correlated with toolbar data and Google registrations to understand exactly what kinds of users (by geography, browser, tastes, click habits, implied-demographics, etc.) the toolbar misses.

They can also purchase bulk user browsing data from ISPs. Even anonymized, it's another broad sample that people can't opt-out of, and which can be contrasted with toolbar data to get an idea of the relative biases of each dataset.

Even without anywhere near 100% coverage, the samples could be excellent for Google's purposes -- understanding searchers and buyers -- and let them weight the edges of the web graph by the actual propensity of people to follow those edges.

Anonymous said...

Another issue that has popped up with news Google Website Trends is the fact that Google is hiding information from its own domains.

Try getting Website trends for blogspot.com or google.com. It is blocked.

I think this is a very worrying sign that clearly shows where Google draws the line.

As someone said,

"if the user is #1 for Google, then Google is #0."

This is an important decision that clearly confronts people with the fact that they control information and they can decide which information to show and which not to show.

burtonator said...

Sample bias is one issue you're going to hit if you plan on just using toolbar data... this has been mentioned in the comments already.

One thing to note is that if you use an 'actual surfer model' you turn control of ranking pages to users who can easily become spammers.

The amount of botnets now is non-trivial and one could easily launch a rank augmentation crawl.

Note that in the WSDM paper you cited they are crawling from a fairly sane university network which wouldn't necessarily exhibit this type of botnet activity.

martind said...

Yeah AdSense is probably a big source of attention data at this point, even if not as detailed as via their toolbar.

Also a potential source: the Firefox phishing protection service, which can be configured to check every visited site against a Google service (as opposed to the default setting, which just checks a local blacklist.)