Saturday, March 20, 2021

Wisdom of the trusted

Flood-the-zone disinformation is a problem for crowdsourced data. Wisdom of the crowds, mass amateurization, and rejection of gatekeepers no longer works with coordinated disinformation campaigns overwhelming rankers, recommenders, and content with shills and spam.

Two decades ago, a lot of us underestimated the negative effects of lower costs for communication and information sharing. While good, it also made propaganda, shilling, and manipulation far easier, and our defenses against disinformation campaigns proved weak.

We in tech were overly idealistic about what would happen as the cost of information and communication dropped. Many thought propaganda would be harder as people could now easily access the truth.

But you can't source reviews from your customers anymore if the vast majority of reviews are paid shills. You can't rank using usage data if ratings and clicks are mostly fake.

Crowdsourced information, including web crawls, reviews, and commentary, only works when almost everyone is independent and unbiased. Coordinated disinformation breaks crowdsourcing.

Flood-the-zone shouldn't have been a surprise, but it was. Propaganda and manipulation are winning because we still treat inauthentic behavior as real.

While there is plenty of mostly-deserved love for big data, often less is more when you live in an adversarial, flood-the-zone world. Wisdom of the crowds has an assumption of independence between agents, which now has been broken by coordinated disinformation campaigns.

If you are looking at garbage, there is no information. Adding disinformation to good data purely makes things worse. It's like making a milkshake, then eying a huge putrid sack of night soil nearby. Sure, you could add some of that to what you made, but even a little is going to make it worse. If there is crap everywhere, you might want to stick with what you can prove to be good.

Polling, one of the oldest forms of crowdsourced information, has been impacted too. The trend in recent years is that low response rates and shilling make it so expensive to poll that Pew Research gets better data cheaper by forming and managing a large paid panel of trusted experts.

For those working in machine learning, for those trying to work with big data, reputation and reputable sources have to be the response in a flood-the-zone world. When most of the data is bad, how you filter your data becomes the most important thing.

We have a big challenge ahead, countering disinformation using reputation and lack of reputation. In a flood-the-zone world, most data out there is now bad to useless. Isolating the useful requires skepticism toward data, like TrustRank, starting untrusted, bad until proven good.

Reviews should discard anything even resembling a shill, giving visibility only to reviews from independent and trustworthy customers. Recommender systems and rankers should focus on the data from and related to proven sources, and weight anything unknown as questionable at best and likely worthless. Most crowdsourced data for machine learning, from clicks to content, is going to have to be viewed with skepticism.

Inauthentic behavior and coordinated disinformation campaigns have shilled wisdom of the crowd to death. For reliable big data in a flood-the-zone world, it will have to be wisdom of the trusted.