Monday, June 26, 2006

Foundations of Statistical NLP

Every once and a while, I come across a book so good that I feel foolish for not reading it sooner.

"Foundations of Statistical Natural Language Processing" by Chris Manning and Hinrich Schutze is a remarkable survey, not only in breadth, but also in its deep, critical analysis of each technique's strength and weaknesses.

I was particularly excited by the focus on practical approaches with massive data sets.

For example, when discussing clustering, the authors warn of efficiency issues with hierarchical clustering or EM algorithms and say that K-means "should probably be used first ... because its results are often sufficient."

On text categorization, they talk about other techniques, but then point out that k nearest neighbor (kNN) is a "simple method that often performs well."

When discussing latent semantic indexing (LSI) and other forms of dimensionality reduction, they mention that pseudo feedback -- query expansion by adding terms from the top results for a search for the original query -- can be cheaper and more effective, depending on your needs.

They criticize hidden Markov models (HMMs) because of the large state spaces required to model many real-world problems, but discuss variants that try to mitigate this issue. They then follow up when talking about part-of-speech tagging by offering transformational (rule-based) approaches as a fast and effective alternative to HMMs.

The book also is full of examples of the ambiguity of language, especially in the sections on disambiguation, parsing, and machine translation. The authors tease you with what looks like a problem with a simple solution, then offer examples that tear apart your naive attempts at cleverness.

Though they focus on very large data sets, Manning and Schutze do not see massive data as the entire solution. At one point, when talking about N-gram models, they say:
One might hope that by collecting much more data that the problem of data sparseness would simply go away ... In practice it is never a general solution to the problem. While there are a limited number of frequent events in language, there is a seemingly never ending tail to the probability distribution of rarer and rarer events, and we can never collect enough data to get to the end of the tail.
Despite the huge amount of text out there, not everything that can be said already has been said.

A great book. I only regret that I took so long to read it.

Update: Some good discussion in the comments for this post. Don't miss the link to Peter Norvig's book reviews.


jeff.dalton said...

Good reivew.

I've read a few chapters of the book and found it extremely interesting and helpful.

Another complementary book is: Speech and Language Processing by Jurafsky and Martin

Have you read it? If so, how to do you think it compares?

Greg Linden said...

Hi, Jeff. I haven't read Jurafsky and Martin, so I can't really comment on it.

I can say that Manning and Schutze are focused on processing very large data sets. I'm not sure other textbooks that cover some of the same algorithms would have that emphasis.

Anonymous said...

Peter Norvig reviews both of the books and a bunch of others -- quite insightful:

(If the link below doesn't work, google for "norvig reviews".)
Peter Norvig's reviews @


Greg Linden said...

Thanks, Anonymous. Peter Norvig's reviews are interesting!

I was particularly keen to see what he said about the Jurafsky and Martin text compared to Manning and Schutze. Peter wrote, "If your needs are more focused on the algorithms for lower-level text processing with statistical techniques, then Manning and Schutze is far more comprehensive. If you're a serious student or professional in NLP, you just have to have both."

I noticed that Peter also wrote a very positive review for Managing Gigabytes. That is another of my favorites.

Another book that I really enjoyed that I don't think Peter mentions is Applied Cryptography by Bruce Schneier. It's a remarkable text.

Anonymous said...

The Norvig review is what sold me on the Manning and Schutze book about four or five months ago.

I must add that this is been one of the most fun to read tech books I've read it ~3 years. Overall it's a close second only to AIMA in the level of pleasure, challange, and knowledge I've gained from it.

I did enjoy Managaing Gigabytes, but I certainly didn't expect compression to be the topic I enjoyed most.

Greg, if you find any other gems please share! :)

Martyloo said...

I used Manning and Schutze as an undergraduate roughly around 2001, and it was originally published in 1999, making it 7 years old. It was a great book back then: is it ready for an update?