Thursday, July 01, 2004

Text-only option from Google cache

Google Blogscoped noticed that Google now allows you to see a text-only version of cached pages. For example, try the normal and text-only cached versions of AnandTech. If a site is down or very slow, this is a convenient feature.

This reminds me of a prototype that I wrote a few years ago that allowed people to browse the web entirely using the Google cache. All URLs were rewritten on the fly to point back to the Google cache, so, once you started browsing the web using the tool, you stayed on Google cache for all your web browsing. It was a cute idea and worked fairly well, but I didn't pursue it beyond the prototype.

While you're at Google Blogscoped, check out their little EgoBot toy. I love the answer to "What is... personalization?"

I'm curious how EgoBot is implemented. Again, a couple years ago, I had a prototype called "The Oracle of Google" that took a question, executed a query against Google, and them did some simple natural language processing to try to extract likely answers from the blurbs in the Google results. It worked surprisingly well. I'm guessing this does something similar.

If you're interested in question answering using search, there's some research work that might be useful to you. One of my favorite papers is by Cwok, Weld, and Etzioni. It's called "Scaling Question Answering to the Web" and it describes a more principled approach to this problem, leveraging WordNet and some other publicly available tools for the natural language processing. They had problems with scalability -- the natural language processing was quite expensive -- but it's a fascinating approach to the problem.

Update: Microsoft Research has been doing a fair amount of work on question answering. Here's a paper I found particularly interesting.

No comments: