Tuesday, December 19, 2023

Book excerpt: Wisdom of the trustworthy

(This is an excerpt from drafts of my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Wisdom of the crowds is the idea that combining the opinions of a lot of people will often get a very useful result, usually one that is better and more accurate than almost all of the opinions on their own.

Wisdom of the trustworthy is the same idea, just with the addition of only using the opinions of people that are provably real, independent people.

Discard all opinions from known shills, spammers, and propagandists. Then also discard all the opinions from accounts that may be shills, spammers, and propagandists, even if you do not know. Only use proven, trustworthy behavior data.

Wisdom of the trustworthy is a necessary reaction to online anonymity. Accounts are not people. In fact, many people have multiple accounts, often for reasons that have nothing to do with trying to manipulate algorithms. But the ease of creating accounts, and difficulty of verifying that accounts are actual people, means that we should be skeptical of all accounts.

On an internet where anyone can create and control hundreds or even thousands of accounts, trust should be hard to gain and easy to lose. New accounts should not be considered reliable. Over time, if an account behaves independently, interacts with other accounts in a normal manner, does not shill or otherwise coordinate with others, and doesn’t show robot behaviors such as liking or sharing posts at rates not possible for humans, it may start to be trusted.

The moment an account engages in anything that resembles coordinated shilling, all trust should be lost, and the account should go back to untrusted. Trust should be hard to gain and easy to lose.

Wisdom of the trustworthy makes manipulation much more costly, time-consuming, and inefficient for spammers, scammers, propagandists, and other adversaries. No longer would creating a bunch of accounts that are used by wisdom of the crowd algorithms be easy or cheap. Now, it would be a slow, cumbersome process, trying to get the accounts trusted, then having the accounts ignored again the moment they shilled anything.

Over a decade ago, a paper called “Wisdom of the Few” showed that recommender systems can do as well using only a much smaller number of carefully selected experts as they would using all available data on every user. The insight was that high quality data often outperforms badly noisy data, especially if the badly noisy data is not merely noisy but actually manipulated by adversaries. Less is more, the researchers argued, if less means only using provably good data for recommendations.

Big data has become a mantra in computer science. The more data, the better, it is thought, spurred on by an early result by Michele Banko and Eric Brill at Microsoft Research that showed that accuracy on a natural language task depended much more on how much training data was used than on which algorithm they used. Results just kept getting better and better the more data they used. As others found similar things in search, recommender systems, and other machine learning tasks, big data became popular.

But big data cannot mean corrupted data. Data that has random noise is usually not a problem; averaging over large amounts of the data usually eliminates the issue. But data that has been purposely skewed by an adversary with a different agenda is very much a problem.

In search, web spam has been a problem since the beginning, including widespread manipulation of the PageRank algorithm first invented by the Google founders. PageRank worked by counting the links between web pages as votes. People created links between pages on the Web, and each of these links could be viewed as votes that some person thought that page was interesting. By recursively looking at who linked to who and who linked to that, the idea was wisdom could be found in the crowd of people who created those links.

It wasn’t long until people started creating lots of links and lots of pages that linked to a page that they wanted amplified by the search engines. This was the beginning of link farms, massive collections of pages that effectively voted that they were super interesting and should be shown high up in the search results.

The solution to web spam was TrustRank. TrustRank starts with a small number of trusted websites, then starts trusting only sites they link to. Untrusted or unknown websites are largely ignored when ranking. Only the votes from trusted websites count to determine what search results to show to people. A related idea, Anti-TrustRank, starts with all the known spammers, shills, and other bad actors, and marks them and everyone they associate with as untrusted.

E-mail spam had a similar solution. Nowadays, trusted senders can send e-mail, which includes major companies and people you have interacted with in the past. Unknown senders are viewed skeptically, sometimes allowed into your inbox, sometimes not, depending on what they have done in the past and what others seem to think of the e-mails they send, but often will go straight to spam. And untrusted senders, you never see their e-mails at all, as they are blocked or sent straight to spam without any risk of being featured in your inbox.

The problem on social media is severe. Professor Fil Menczer described how “Social media users have in past years become victims of manipulation by astroturf causes, trolling and misinformation. Abuse is facilitated by social bots and coordinated networks that create the appearance of human crowds.” The core problem is bad actors creating a fake crowd. They are pretending to be many people and effectively stuffing the ballot box of algorithmic popularity.

For wisdom of the crowd algorithms such as rankers and recommenders, only use proven trustworthy behavior data. Big data is useless if the data is manipulated. It should not be possible to use fake and controlled accounts to start propaganda trending and picked up by rankers and recommenders.

To avoid manipulation, any behavior data that may involve coordination should be discarded and not used by the algorithms. Discard all unknown or known bad data. Keep only known good data. Shilling will kill the credibility and usefulness of a recommender.

It should be wisdom of independent reliable people. Do not try to find wisdom in big corrupted crowds full of shills.

There is no free speech issue with only considering authentic data when computing algorithmic amplification. CNN analyst and Yale lecturer Asha Rangappa described it well: “Social media platforms like Twitter are nothing like a real public square ... in the real public square ... all of the participants can only represent one individual – themselves. They can’t duplicate themselves, or create a fake audience for the speaker to make the speech seem more popular than it really is.” However, on Twitter, “the prevalence of bots, combined with the amplification features on the platform, can artificially inflate the ‘value’ of an idea ... Unfortunately, this means that mis- and disinformation occupy the largest ‘market share’ on the platform.”

Professor David Kirsch echoed this concern that those who create fake crowds are able to manipulate people and algorithms on social media. Referring to the power of fake crowds to amplify, Kirsch said, “It matters who stands in the public square and has a big megaphone they’re holding, [it’s] the juice they’re able to amplify their statements with.”

Unfortunately, on the internet, amplified disinformation is particularly effective. For example, bad actors can use a few dozen very active controlled accounts to create the appearance of unified opinion in a discussion forum, shouting down anyone who disagrees and controlling the conversation. Combined with well-timed likes and shares from multiple controlled accounts, they can overwhelm organic activity.

Spammers can make their own irrelevant content trend. These bad actors create manufactured popularity and consensus with all their fake accounts.

Rangappa suggested a solution based on a similar idea to prohibiting collusion in the free market: “In the securities market, for example, we prohibit insider trading and some forms of coordinated activity because we believe that the true value of a company can only be reflected if its investors are competing on a relatively level playing field. Similarly, to approximate a real marketplace of ideas, Twitter has to ensure that ideas can compete fairly, and that their popularity represents their true value.”

In the Facebook Papers published in The Atlantic, Adrienne LaFrance talked about the problem at Facebook, saying the company “knows that repeat offenders are disproportionately responsible for spreading misinformation ... It could tweak its algorithm to prevent widespread distribution of harmful content ... It could also automatically throttle groups when they’re growing too fast, and cap the rate of virality for content that’s spreading too quickly ... Facebook could shift the burden of proof toward people and communities to demonstrate that they’re good actors — and treat reach as a privilege, not a right ... It could do all of these things.”

Former Facebook data scientist Jeff Allen similarly proposed, “Facebook could define anonymous and unoriginal content as ‘low quality’, build a system to evaluate content quality, and incorporate those quality scores into their final ranking ‘value model’.” Allen went on to add in ideas similar to TrustRank, saying that accounts that produce high quality content should be trusted, accounts that spam and shill should be untrusted, and then trust could be part of ranking.

Allen was concerned about the current state of Facebook, and warned of the long-term retention and growth problems Facebook and Twitter later experienced: “The top performing content is dominated by spammy, anonymous, and unoriginal content ... The platform is easily exploited. And while the platform is vulnerable, we should expect exploitative actors to be heavily [exploiting] it.”

Bad actors running rampant is not inevitable. As EFF and Stack Overflow board member Anil Dash said, fake accounts and shilling is “endemic to networks that are thoughtless about amplification and incentives. Intentionally designed platforms have these issues, but at a manageable scale.”

Just as web spam and email spam were reduced to almost nothing by carefully considering how to make them less effective, and just as many community sites like Stack Overflow and Medium are able to counter spam and hate, Facebook and other social media websites can too.

When algorithms are manipulated everyone but the spammers lose. Users lose because the quality of the content is worse, with shilled scams and misinformation appearing above content that is actually popular and interesting. The business loses because its users are less satisfied, eventually causing retention problems and hurting long-term revenue.

The idea of only using trustworthy accounts in wisdom of the crowd algorithms has already been proven to work. Similar ideas are widely used already for reducing web and email spam to nuisance levels. Wisdom of the trustworthy should be used wherever and whenever there are problems with manipulation of wisdom of the crowd algorithms.

Trust should not be easy to get. New accounts are easy for bad actors to create, so should be viewed with skepticism. Unknown or untrusted accounts should have their content downranked, and their actions should be mostly ignored by ranker and recommender algorithms. If social media companies did this, then shilling, spamming, and propaganda by bad actors would pay off far less often, making them too costly for many efforts to continue.

In the short-term, with the wrong metrics, it looks great to allow bots, fake accounts, fake crowds, and shilling. Engagement numbers go up, and you see many new accounts. But it’s not real. These aren’t real people who use your product, helpfully interact with other people, and buy things from your advertisers. Allowing untrustworthy accounts and fake crowds hurts customers, advertisers, and the business in the long-term.

Only trustworthy accounts should be amplified by algorithms. And trust should be hard to get and easy to lose.

No comments: