Tuesday, May 09, 2006

Wikipedia and databases

Brion Vibber from Wikimedia gave an interesting talk recently at Google about Wikipedia. Most of the talk was about scaling Wikipedia under the recent massive spike in demand.

Their scaling strategy relies heavily on a large caching layer that focuses on caching entire HTML pages. The idea here is that the vast majority of accesses to Wikipedia are anonymous reads, so the same pages can be served up to those people.

This does work pretty well -- apparently, 78% of accesses are served from the Squid caches and another 7% on top of that get served from Memcached pages -- but it appears they basically have to toss the cache if anything on the page is different, including logged in users.

During the talk, I was a little surprised they were not more focused on caching at the data layer, focusing on making the underlying databases serve data rapidly instead of trying to avoid the databases.

If I understand it right, the Wikipedia architecture has all the Wikipedia data thrown into one giant database. Everything is in there, all the tables, even the large indexes for full text search. Then, they tossed on a few slave databases that appear to take replicated copies of the entire database.

All this data on one machine appears to mean they are hitting disk a lot on the database servers.

That doesn't seem necessary. The entire Wikipedia database appears to be a few hundred gigabytes. It should be possible to get most of that in-memory in a horizontally partitioned database cluster of a couple dozen machines.

In an ideal world, we'd be talking about something like the Google Cluster, where shards of the data are distributed across many machines and accessed in parallel, or a big virtual database like I craved in an earlier post.

But, let's stick with low hanging fruit here. So, to start, I would pull text search out of MySQL. Yes, I know, it's so lazily convenient to use MySQL full text search, but the performance doesn't seem to be where it needs to be. Moreover, it almost certainly is desirable for something like search to have its own dedicated hardware, not to be competiting for resources with an RDMS on the same box.

Then, I'd partition the data. Get it so each box only has a shard of the data big enough that the disk is quiet. I really am tempted to start rambling about something other than a simple partition of the existing MySQL tables and using MySQL replication here, but we're talking about low hanging fruit, so I'll stick with MySQL and what is simple to get done.

If almost all the Wikipedia database accesses never touched a disk, I suspect performance of the database layer might be good enough that parts of that HTML caching layer become unnecessary or even counterproductive. If so, the freed up boxes could be shifted from the HTML caching layer to the database layer, improving performance and scalability even further.

Near the end of the talk, Brion seemed to suggest that Wikipedia was going to focus their scaling efforts on boosting cache hit rates, more work on that last layer right before the client. It appeared to me that it might involve some fairly complicated work to figure out exactly when parts of the coarse-grained, HTML cache are valid.

I wonder if the focus might be better spent on the data layer, getting the performance there to the point that caching further out becomes unnecessary.

I have to say, it is great fun looking at this kind of large scale problem. I wish I could dig in deeper and pour over profile data for the site.

The efforts of the tiny Wikipedia development team are really impressive. Not only have they managed to scale to crushing loads, but also they have done it in an unbelievably meager budget. They have much deserved loyalty from a community that wants to see them continue to thrive and grow.

Update: Interesting March 2006 messages ([1] [2]) on the Wikitech mailing list from Brion Vibber show that Wikipedia is entertaining the idea of switching to MySQL Cluster. An excerpt:
We don't support MySQL Cluster at this time, as it's currently limited in many ways. (Everything must fit in memory, can't have certain field types, etc.)

Currently waiting for the upcoming versions which allow disk-backed data, etc.

No point in rushing in when our own contributors who are MySQL employees are telling us to wait for it to mature a bit.
MySQL Cluster looks tempting but, as Brian said, it wouldn't work right now since everything wouldn't fit in memory within the maximum cluster size and other restrictions. Probably better to look at manually partitioning the data to get it in memory.

See also my post, "MySQL Cluster and my big virtual database".


Anonymous said...

Full text search on wikipedia is currently done by Lucene and by redirecting users to yahoo, google and other local search engines.

Greg Linden said...

Thanks, Anonymous. I did not realize Wikipedia was already using Lucene. It looks like that was a recent change, as of April 2005. Definitely a good idea.

Redirecting to web search engines is a way to quickly and easily resolve the search performance problem, but I suspect it would cause searchers to miss recent updates to Wikipedia. Is that not the case?

Anonymous said...

Google seems to cache it very very quikly. I added the Gerotor article and it was added to google almost immedaiatly.

Anonymous said...

In fact, the Google search is much more up-to-date than the Lucene search, which can take months before it's updated.

Greg Linden said...

Then I guess we're back to where we started. Search does need some serious work at Wikipedia.

The search should be fast, efficient, and be able to see updates to Wikipedia in real time.

If moving from MySQL full text to Lucene caused the search index to be so stale as to be useless -- better off querying Google, as you two said -- then that clearly doesn't work.

So, why is the Lucene index stale? Why can't it be updated in near real-time or supplemented with another search index that is updated in near-real time?

Anonymous said...

believing that we run a single database cluster is quite false assumption. I'd suggest to check posts / presentations about our wikipedia data layout on my site :-)