Googler Daniel Tunkelang recently wrote a post, "LinkedIn Search: A Look Beneath the Hood", that has slides from a talk by LinkedIn engineers along with some commentary on LinkedIn's search architecture.
What makes LinkedIn search so interesting is that the search does real-time updates (the "time between when user updates a profile and being able to find him/herself by that update need to be near-instantaneous"), faceted search (">100 OR clauses", "NOT support", complex boolean logic, some facets are hierarchical, some are dynamic over time), and personalized relevance ranking of search results (ordered by distance in your LinkedIn social graph).
LinkedIn appears to use a combination of aggressive partitioning, keeping data in-memory, and a lot of custom code (mostly modifications to Lucene, some of which have been released open source) to handle these challenges. One interesting tidbit is, going against current conventional wisdom, LinkedIn appears to only use caching minimally, preferring to spend their efforts and machine resources on making sure they can recompute computations quickly than on hiding poor performance behind caching layers.
Subscribe to:
Post Comments (Atom)
6 comments:
Hi Greg:
This is John Wang from LinkedIn who gave the talk and also a follower of your blog.
A reason we can't rely on caching is the rate the underlying index is changing (for our realtime requirements). The conventional way of "caching" would not work in our case.
-John
Yep, totally agree, John. In fact, I think caching is badly overdone by many out there. It's not uncommon to have caching layers as large as the database layer, but those resources likely would produce better speedups and less complexity if allocated to the database layer.
I have some older posts -- "Put that database in memory" and "Replication, caching, and partitioning", for example -- if anyone is interested in hearing more from me on that topic.
I'd love to understand how you handle index updates at that frequency with indexes that large. My experience with Lucene indexes with millions of documents means updates take non-trivial amounts of time - like dozens of seconds.
I agree that it's silly to use a complex caching architecture when the main concern is i/o and everything can be fit into an in-memory index / database. My concern here is that some of LinkedIn's queries sounds CPU-intensive, or at least require lots of random access. I have over 400k degree-two connections, and I'm sure lots of people have more. Sounds like something you don't want to recompute on every query--but you do have to keep it somewhat up to date as the network changes.
Hi Jim:
Hate to hijack Greg's blog. Email me directly if you want chat chat in detail about our Lucene customizations.
We did open source our indexing engine: http://code.google.com/p/zoie.
-John
i agree with Daniel. it's a mystery that why they avoid using caching for some queries. what's more surprising is that he says caching == cheating. maybe John can elaborate more on this.
Post a Comment