Monday, December 04, 2023

The failure of big data

For decades, the focus in machine learning has been big data.

More data beats better algorithms, said an early 2001 result from Banko and Brill at Microsoft Research. This was hugely influential on the ML community. For years, most people found it was roughly true that if you get more data, ML works better.

Those days have come to an end. Nowadays, big data often is worse. This is because low quality or manipulated data wrecks everything.

There was a quiet assumption behind big data that any bad data in the big data is noise that averages out. That is wrong in most real world data where the bad data is skewed.

This problem is acute with user behavior data, like clicks, likes, links, or ratings. ML trying to use user behavior is trying to do wisdom of crowds, summarizing the opinions of many independent sources to produce useful information.

Adversaries can purposely skew user behavior data. When they do, using that data will yield terrible results in ML algorithms because the adversaries are able to make the algorithms show whatever they like. That includes the important ranking algorithms for search, trending, and recommendations that we use every day to find information on the internet.

ML using behavior data is doing wisdom of crowds. Wisdom of crowds assumes the crowd is full of real, unbiased, non-coordinating voices. It doesn't work when the crowd is not real. In cases where you are not sure, it's better to discard much of the data, anything not reliable.

Better data often beats big data if you measure by what is useful to people. ML needs data from reliable, representative, independent, and trustworthy sources to produce useful results. If you aren't sure about the reliability, throw that data out, even if you are throwing most of the data away in the end. Seek useful data, not big data.

No comments: