Saturday, December 10, 2022

ML and flooding the zone with crap

Wisdom of the few is often better than wisdom of the crowds.

If the crowd is shilled and fake, most of the data isn't useful for machine learning. To be useful, you have to pull out the scarce wisdom in the sea of noise.

Gary Marcus looked at this in his latest post, "AI's Jurassic Park moment". Gary talks about how ChatGPT makes it much cheaper to produce huge amounts of reasonable-sounding bullshit and post it on community sites, then he said:

For Stack Overflow, the issue is literally existential. If the website is flooded with worthless code examples, programmers will no longer go there, its database of over 30 million questions and answers will become untrustworthy, and the 14 year old website will die.
StackOverflow added:
Overall, because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking or looking for correct answers.

The primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good and the answers are very easy to produce. There are also many people trying out ChatGPT to create answers, without the expertise or willingness to verify that the answer is correct prior to posting. Because such answers are so easy to produce, a large number of people are posting a lot of answers.

There was a 2009 SIGIR paper, "The Wisdom of the Few", that cleverly pointed out that a lot of this is unnecessary. For recommender systems, trending algorithms, reviews, and rankers, only the best data is needed to produce high quality results. Once you use the independent, reliable, high quality opinions, adding more big data can easily make things worse. Less is more, especially in the presence of adversarial attacks on your recommender system.

When using behavior data, ask what would happen if you could sort by usefulness to the ML algorithm and users. You'd go down the sorted list, then stop at some point when the output no longer improved. That stopping point would be very early if a lot of the data is crap.

In today's world, with fake crowds and shills everywhere, wisdom of the crowds fails. Data of unknown quality or provable spam should be freely ignored. Only use reliable, independent behavior data as input to ML.

No comments: