Saturday, February 10, 2007

Better understanding through big data

An interesting tidbit from Googler (and recent ACM Fellow) Peter Norvig in an interview with Matt Marshall:
The way to get better understanding of text is through statistics rather than through hand-crafted grammars and lexicons.

The statistical approach is cheaper, faster, more robust, easier to internationalize, and so far more effective.
This reminds me a bit of what Peter said in some of his recent talks:
Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.

Worry about the data first before you worry about the algorithm.

Having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys ... It's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days.
Learning from big data is what Google's infrastructure was built to do. It is particularly obvious in their results in machine translation, but also impacts everything they do, from search to ad targeting to personalization.

The rest of the interview is worth reading. Unfortunately, it is appended to yet another hype piece on vaporware from PowerSet, but just ignore the first part of the article and get to the good stuff at the end.

Update: Matthew Hurst has some interesting thoughts after reading Peter's interview:
The huge redundancy in ... documents suggests approaches to serving the user that don't require the perfect analysis of every document.

The basic [paradigm] of text mining ... the one document at a time pipeline ... is limiting. It fails to leverage redundancy ... [and assumes] that perfection is required at every step.

The key to cracking the problem open is the ability to measure, or estimate, the confidence in the results ... Given 10 different ways in which the same information is presented, one should simply pick the results which are associated with the most confident outcome - and possibly fix the other results in that light.
Update: I also liked the challenge Matthew Hurst described in this later post:
A more sophisticated search engine would be explicit about ambiguity (rather than let the user and documents figure this out for themselves) and would take information from many sources to resolve ambiguity, recognize ambiguity and synthesize results.


Anonymous said...

What have you got against PowerSet? That a little startup is challenging the status quo?

Greg Linden said...

Just that they are getting so much press with vaporware. Reminds me of Riya.

Maybe their product really will be a revolution in NLP when it launches, but it is hard to believe.

Paul Kedrosky and Danny Sullivan also express a lot of skepticism about PowerSet's claims if you are interested in taking a look at their thoughts.

Anonymous said...


It'd be great if you could write your thoughts describing, comparing and constrasting some of the major natural language projects out there - Oren Etzioni's KnowItAll, Doug Lenat's Cyc, Powerset, and other notable ones that I might have missed.

I know about them, I understand at a shallow level what they do, but I don't know where philosophically all these people stand (for example, Peter Norvig favors statistical learning over linguistic/semantic understanding), and how these products stack up in terms of NLP evolution.

I am sure such a post would be interesting to a number of your readers.

Greg Linden said...

Hi, Pranav. Thanks, that is flattering, but I am not sure I know enough about NLP to be confident about my ability to do that. Moreover, many of these projects -- Powerset and Cyc in particular -- are not generally accessible to the public, so they are hard to evaluate. Sorry that I cannot be more helpful on this.

Anonymous said...

I do not see why the two approaches are so incompatible. Why not apply NLP to the dataset, first, and then compute and utilize statistics of the recognized grammatical forms? Think of NLP as just "smart" pre-processing. Smart pre-processing + big data has got to work better than big data alone. Smart pre-processing makes big data more robust.

Despite the current media hype around Powerset, I am very hesitant to dismiss them immediately and outright, as so much of the Google-fanbased blogosphere has done. Powerset may never "kill" Google, but they do have the potential to shake up some of our collective, previously held biases.

It may be time for a little history lesson. We all tend to think that Google invented hyperlink analysis for the purpose of improving ad hoc retrieval results. But remember that Eugene Garfield first used this concept in the 1970s. He treated citations in scientific papers as "hyperlinks". Papers that received a lot of incoming citation "links" ("EugeneRank" or "Garfield-Juice") were ranked higher than other papers with equally keyword matches, but fewer incoming citations.

However, this work was not widely adopted or furthered until 20 years later. The community as a whole did not see the value in this work until Google proved it on a mass scale. And I am sure that, in that 20-year time, there were skeptics like Danny Sullivan and Paul Kedrosky saying things like "for twenty years now we know about citation link analysis, and I have yet to see it be used in any mass, proven manner" and "citation link analysis won't work; people want information, not popularity contests".

So when I see Danny dismissing natural language search because people only type in queries such as "beach" and "heide klum", or because he has been listening to the hype for 10 years, I become skeptical of the skeptic. Link analysis was poo-pooed (or at least ignored) for 20 years before it launched a $150 billion company.

Powerset still might not succeed. But it is a sad day when our first reaction is to dismiss the attempt, or to say "it has been tried before, therefore it will never work", or to say "people will never search for information in any other manner than typing in 1-3 keywords, so we should never try and offer anything different".

Some folks are taking a more tempered approach, and not saying that they are against Powerset, but instead adopting a "wait and see" attitude. I guess that is a little better, but personally I think our attitude should be more generous than "wait and see". I think we should be actively encouraging and cheering any and all search companies, Google included when it happens, that are trying to challenge the status quo. We are at such an early stage in search, even though it has been around for 40 years offline and 10 years online, that we should be going out of our way to actively encourage this sort of work.

The blogosphere should be abuzz with ideas and chatter about how we can take concepts and lessons learned from NLP and actually create a better search engine, instead of dismissing or "waiting and seeing" anyone that tries.

Greg Linden said...

Hi, Jeremy. I agree that we should actively encourage and cheer search companies and research projects that challenge the status quo.

My problem is more with the absurd press hype that Powerset has encouraged. The product is vaporware, so pronouncements of "changing the very nature of search" or "unprecedented consumer search capabilities" seem questionable. The entire thing feels similar to me to when Riya overpromised and underdelivered on their facial recognition technology.

I share your excitement and enthusiasm over the promise of NLP in search. I agree that it would be fantastic if Powerset can deliver on the claims they have made.

Anonymous said...

Mmm.. yes. I am totally sympathetic to hype-sensitivity. I do not like it, either, and am not trying to defend it.

But there are many others out there, not yourself but others, that are reacting strictly against the Powerset message, rather than against the hype. Danny's comments and critiques, for example, are against the very notion of NLP search, saying that users won't type more than 2-3 words. Where is NLP needed in the query "beach", he asks.

With no criticism of Danny, personally, it is that sort of thinking that I wish we could get away from. It just feels like, "why build an airplane? No person has ever flown before, so what would we need an airplane for?" Know what I mean?

Greg Linden said...

Hey, Jeremy. I read Danny's comments a little differently, I think.

I think he was reacting to the hype, saying that progress on NLP is likely to be slow and that PR claims of miraculous progress should be met with skepticism.

On Danny's concern that people will only enter a couple words, I think he is talking about the short-term and the difficulty of changing behavior. That says more about Powerset's likelihood of immediately "changing the very nature of search" than of the long term prospects for major innovation in search because of NLP advances.

Personally, I see NLP as one of the most promising paths for making substantial progress in search. Yet, as Danny pointed out, the field is littered with the remains of those who made triumphant claims about NLP, and we may want to be careful of believing those who have not yet proven their promises.

I don't think it is inconsistent to be skeptical of the PR while enthusiastic about the prospects, is it?

Anonymous said...

No, it is not inconsistent at all, to be both skeptical and enthusiastic at the same time. Hype totally turns me off, too. I just didn't hear any enthusiasm at all in Danny's comments. But maybe that's just the way I read it.

In fact, I had another post in this thread a few days ago that did not go through, in which I confessed some hypocrisy in this whole matter. I called for more enthusiasm amidst all the skepticism for NLP, but in all my other posts on your blog I express skepticism without enthusiasm for personalization. I need to be more simultaneously enthusiastic about personalization, too :-)

I am just feeling like there is a little too much hype with personalization right now, too. I have seen it tried (in the form of things like user modeling, etc.) over and over, in academic papers, throughout the past 10-15 years to little or no avail. And I have yet to see anything convincing from Google yet, other than artificial examples about Miami dolphins versus oceanic dolphins, which is really not a personalization issue.

So I think we are each just reacting to hype sore spots, in one form or another.