I like my ideas so much I am going to post my comm...

2006-08-24T07:57:00.000-07:00

I like my ideas so much I am going to post my comments on my blog at techrepublic.

Yes, I am very humble. ;-)

"... may have to parse and minimally understand......

2006-08-24T07:34:00.000-07:00

"... may have to parse and minimally understand..."

This is a really interesting problem. I have given a little thought (I really mean just a little, some people have made a lifetime study of this) to the complexities of natural language parsing, and at first, I thought it would be fairly easy to do. Yes, you can laugh and throw things now if you want.

I started to think of a simple parser. What if it analysed a sentence at a time, and categorised each word by type, for instance noun, preposition, etc. To determine if a sentence is valid, you just see if the word types fall into recognisable sentence patterns.

Then I realised how easy it would be to fake this out - by doing what they are already doing, namely stealing bits from others' blogs, or two, by using a very simplistic Madlibs style generator that just made nonsense that was grammatically correct.

I started to think about language, and how we learn it. I came to realise that language is how we express everything we understand about the world. It is not just nouns and adjectives, it is emotions, style, subtle shades of meaning. One word can have several different meanings depending on its context.

To simply try and put all this knowledge into a huge database will not help either, because people add facts to their knowledge base differently depending on pre-existing worldviews, and sometimes a convincing argument can sway their world view. A computer cannot do this, and (to my limited thinking) will battle to simulate it.

My theory on preventing spam, is that instead of pushing the extremely difficult task of natural language processing onto the computer, which is very ill-suited for the task, onto the blog spamming software.

CAPTCHA images and simple validation schemes are problematic because of that - they are simple for computers to parse. They don't require intelligence and experience, merely crude pattern recognition. The only way to foil auto readers is to add noise (distortion, background noise) which also makes life more difficult for the humans.

What about this instead: a simple database with a pool of words, a pool of questions, and a linking table which puts words in certain categories. You pose a question like "Type in the colour from the list above" and list words like "aerosol, blue, printer". Screen readers can read it. People can solve this easily, not so the spam bots. Yes, they can just try and post with every word in the list, but you can blacklist or tarpit IPs with frequent failed tries.

What's more, the system is easily upgradeable. Spammers starting to write programs to find colours? Just update your database. Now start asking what are metal objects. Ask for the third last word in the original blog post. The possibilities are endless, and so darned easy to code.

This type of test is very difficult for computers, but simple for humans. I say let's use computers' weaknesses against the spammers, not put them in a position to use computers' strengths against us.

Comments on Geeking with Greg: Blog spam at AIRWeb

I like my ideas so much I am going to post my comm...

"... may have to parse and minimally understand......