Friday, September 01, 2006

Google Analytics and Bigtable

Another tidbit I found curious in the Google Bigtable paper was the massive size of the Google Analytics data set stored in Bigtable.

The paper says that 250 terabytes of Google Analytics data are stored in Bigtable. That's more than all the images for Google Earth (71T). It is the second largest data set in Bigtable, behind only the 850T of the Google crawl.

Why is it so big? The way I had assumed Google Analytics worked is that it maintained only the summary data for each website. That would be a very small amount of data, nowhere near 250T.

Instead, it appears Google Analytics keeps all the information about user behavior on all sites using Google Analytics permanently, online, and available for various analyses. That would explain 250T of data.

What data does Google Analytics collect? From the Google Analytics help page:
Google Analytics uses a first-party cookie and JavaScript code to collect information about visitors and to track your advertising campaign data.

Google Analytics anonymously tracks how visitors interact with a website, including where they came from, what they did on a site, and whether they completed any of the site's conversion goals.

Analytics also keeps track of your e-commerce data, and combines this with campaign and conversion information to provide insight into the performance of your advertising campaigns.
The data would be quite useful. The Google crawl only gives information about the web pages out there. Google search logs only give information about what people search for and what search results they click on.

Google Analytics data would tell Google what people are doing on other websites, including how often they go to the site, where they came from, and what they do when they get there. It could be quite useful as part of determining the relevance of sites and how people transition between sites.

See also my previous posts, "Google Personalized Search and Bigtable" and "Google Bigtable paper".


Anonymous said...

Greg: You are absolutley right, GA will keep all the data for every visit (session) for all its websites and it would very easily trump any other source of data that Google has. My hypothesis is that GA will overtake even the search crawl in terms of size in due course, for many reasons. :)

Thanks (your blog is quite insightful).

chad said...

what i don't understand is why search personalization is more effective than drilldown. so...lets say google has a specialized index tailored towards medical. if i have a profile that indicates i'm a medical specialist, fine give me biased results. but...if i have a session going and they've never seen me before, i might be some kinf of phd researcher. i type in a few terms, and it can then bias my results if i keep refining. right? wheres the win here, exactly?

Greg Linden said...

Hi, Avinash. I think you are right that Google Analytics data will be the largest data set in a year or two. Thanks, it is good to hear you enjoy my weblog.

Chad, the idea behind personalized search is that it can supplement your search with additional information without requiring you to explicitly enter more information. More gain for less work.

I agree that tools to help people refine a query are helpful, but they require effort. I also think advanced search is quite good for helping people find what they want, but only 1% of people use advanced search. Why so few? Because it requires more effort.

We need to help people find what they need with as little effort as possible.

Anonymous said...

Thanks for digging these gems out of google's papers.

I've been using Google Analytics on some sites (the blog above is a journal of some of that work). But I haven't yet talked about the use of analytics.

Reading your post, it brings to mind the interesting privacy implications in a web service heavy world -- where a single page load is hitting and pinging all kinds of services that the user may not realize.

chad said... seems to me though that personalized search on a google style engine doesn't have a big win available though. because all it takes is one query and all of a sudden the engine has as much info as it takes to crunch a bunch of data and then figure out the user is interested in X topic a lot...and if the user is really looking for a certain topic a lot there isn't a huge difference from helping bias results from them, and from letting an anonymous user just go over to that section in the google index via a starter query.

no sure if i'm being clear here but it feels like a solution looking for a problem.

if a user really needs to find some info they'll do a drilldown anyway.

personalized search probably makes more sense in the context of an RSS aggregator type experience where a user is constantly looking for stuff to be directed at them even though they aren't explicitly searching for it. thats useful.

Meme chose said...

I search in different modes depending on my purpose, and sometimes other people use my computer. My problem with personalization is that it's not transparent - once the search results I see are monkeyed with I have no audit trail to see what they did - I've lost the sense that I'm seeing what any other web user can see; the web is no longer a common shared resource.

Is it enough to flush my cookies on a regular basis in order to ensure access to an 'unbiased/unpersonalized' search experience, or do you think they are going to start customizing search results by IP address?

Maybe some of this activity by Google and others is ultimately self-defeating, by promoting the tendency to erase cookies, or may even end up placing a premium on access to dynamically allocated IP addresses?

Anonymous said...

Hi Greg,

I hope you realize Analytics is used by any site which chooses to apply for their use, this includes many large sites. Look for the urchin script in html code of these websites.
That over months shows why the database grows.
And it is free.