Tuesday, April 03, 2007

Recap of recent posts

There has been a flurry of long posts on papers and lectures here in the last week. It might have been a bit overwhelming. I would not be surprised if your eyes glazed over on a Monday morning -- ugh, too much to read -- and the posts passed you by.

But, there is some really good stuff in there. In case you missed them, I wanted to highlight a couple key posts on a couple particularly interesting topics:

"Knowledge extraction from search queries" talks about a couple papers out of Google on extracting facts from the Web. Question answering -- correctly answering questions such as "How old is Larry Page?" -- is an important and promising path to improving web search. This Google work is particularly unusual in that they propose using query logs and the information in them to help with knowledge extraction and question answering.

"The end of federated search?" and "Google and the deep web" discuss Google's efforts to crawl the deep web, data normally hidden in private databases behind html forms. Deep web data would make web search more comprehensive and, because the data often is well structured, could be particularly useful for improving question answering. The key part of the Google work is that it rejects a common technique of accessing deep web data in real-time, instead proposing copying everyone else's data to Google's servers.

"More on data center in a trailer" talks about Microsoft's and others' efforts to factory-install thousands of computers in a shipping container and the efficiencies gained from that approach.


Andrew Hitchcock said...

I enjoyed the eye-glazing posts and would like more :). I've been reading through the links and watching the videos in the last few days.

Also, speaking of data centers in a truck, I saw one of those Sun systems parked in E1 (the large UW parking lot near Husky Stadium) on Monday.

Anonymous said...

To some extent, deep Web providers have already been doing this to facilitate Google (and other search engine crawlers).

They essentially "print" HTML pages from their databases and make them accessible to crawlers.

Yellow Page providers are a good example. Do a search like "plumbers 22201" and you'll see content that has been surfaced by the database owners. Newspaper archives are another example. (