Wednesday, December 27, 2023

Book excerpt: First pages of the book

(This is an excerpt from my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It". The first sentence and first page of a book hook readers in. This book starts with an entertaining tale about algorithms and their importance at the beginning of Amazon.com)

The old brick building for Amazon headquarters in 1997 was in a grimy part of downtown Seattle across from a needle exchange and Wigland.

There were only a couple dozen of us, the software engineers. We sat at desks made of unfinished four-by-fours bolted to what should have been a door. Exhausted from work, sometimes we slept on the floor of our tiny offices.

In my office, from the look of the carpet, somebody had spilled coffee many times. A soft blue glow from a screen showing computer code lit my face. I turned to find Jeff Bezos in my doorway on his hands and knees.

He touched his forehead down to the filthy floor. Down and up again, his face disappeared and reappeared as he bowed.

He was chanting: “I am not worthy. I am not worthy.”

What could cause the founder of Amazon, soon to be one of the world’s richest men, to bow down in gratitude to a 24-year-old computer programmer? An algorithm.

Algorithms are computer code, often created in the wee hours by some geek in a dingy room reeking of stale coffee. Algorithms can be enormously helpful. And they can be equally harmful. Either way they choose what billions of people see online every day. Algorithms are power.

What do algorithms do? Imagine you are looking for something good to read. You remember your friend recently read the book Good Omens and liked it. You go to Amazon and search for [good omens]. What happens next?

Your casually dashed-off query immediately puts thousands of computers to work just to serve you. Search and ranker algorithms run their computer code in milliseconds, then the computers talk to each other about what they found for you. Working together, the computers comb through what are billions of potential options, filtering and sorting among them, to surface what you might be seeking.

And look at that! The book Good Omens is the second thing Amazon shows you, just below the recent TV series. That TV series looks fun too. Perhaps you’ll watch that later. For now, you click on the book.

As you look at the Good Omens book, more algorithms spring into action looking for more ways to help you. Perhaps there are similar books you might enjoy? Recommender algorithms follow their instructions, sorting through what millions of other customers found to show you what “customers who liked Good Omens also liked.” Maybe there is something that might be enticing, that gets you to click “buy now”.

And that’s why Jeff Bezos was on my office floor, laughing and bowing.

The percentage of Amazon sales that come through recommender algorithms is much higher than what you’d expect. In fact, it’s astounding. For years, about a third of Amazon’s sales came directly through content suggested by Amazon’s recommender algorithms.

Most of the rest of Amazon’s revenue comes through Amazon’s search and ranking algorithms. In total, nearly all of Amazon’s revenue comes from content suggested, surfaced, found, and recommended by algorithms. At a brick and mortar bookstore, a clerk might help you find the book you are looking for. At “Earth’s Biggest Bookstore”, algorithms find or suggest nearly everything people buy.

How algorithms are optimized, and what they show to people, is worth billions every year. Even small changes can make enormous differences.

Jeff Bezos celebrated that day because what algorithms show to people matters. How algorithms are tuned, improved, and optimized matters. It can change a company’s fortunes.

One of Amazon’s software engineers just found an improvement that made the recommender algorithms much more effective. So there Jeff was, bobbing up and down. Laughing. Celebrating. All because of how important recommender algorithms were to Amazon and its customers.

Tuesday, December 19, 2023

Book excerpt: Wisdom of the trustworthy

(This is an excerpt from drafts of my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Wisdom of the crowds is the idea that combining the opinions of a lot of people will often get a very useful result, usually one that is better and more accurate than almost all of the opinions on their own.

Wisdom of the trustworthy is the same idea, just with the addition of only using the opinions of people that are provably real, independent people.

Discard all opinions from known shills, spammers, and propagandists. Then also discard all the opinions from accounts that may be shills, spammers, and propagandists, even if you do not know. Only use proven, trustworthy behavior data.

Wisdom of the trustworthy is a necessary reaction to online anonymity. Accounts are not people. In fact, many people have multiple accounts, often for reasons that have nothing to do with trying to manipulate algorithms. But the ease of creating accounts, and difficulty of verifying that accounts are actual people, means that we should be skeptical of all accounts.

On an internet where anyone can create and control hundreds or even thousands of accounts, trust should be hard to gain and easy to lose. New accounts should not be considered reliable. Over time, if an account behaves independently, interacts with other accounts in a normal manner, does not shill or otherwise coordinate with others, and doesn’t show robot behaviors such as liking or sharing posts at rates not possible for humans, it may start to be trusted.

The moment an account engages in anything that resembles coordinated shilling, all trust should be lost, and the account should go back to untrusted. Trust should be hard to gain and easy to lose.

Wisdom of the trustworthy makes manipulation much more costly, time-consuming, and inefficient for spammers, scammers, propagandists, and other adversaries. No longer would creating a bunch of accounts that are used by wisdom of the crowd algorithms be easy or cheap. Now, it would be a slow, cumbersome process, trying to get the accounts trusted, then having the accounts ignored again the moment they shilled anything.

Over a decade ago, a paper called “Wisdom of the Few” showed that recommender systems can do as well using only a much smaller number of carefully selected experts as they would using all available data on every user. The insight was that high quality data often outperforms badly noisy data, especially if the badly noisy data is not merely noisy but actually manipulated by adversaries. Less is more, the researchers argued, if less means only using provably good data for recommendations.

Big data has become a mantra in computer science. The more data, the better, it is thought, spurred on by an early result by Michele Banko and Eric Brill at Microsoft Research that showed that accuracy on a natural language task depended much more on how much training data was used than on which algorithm they used. Results just kept getting better and better the more data they used. As others found similar things in search, recommender systems, and other machine learning tasks, big data became popular.

But big data cannot mean corrupted data. Data that has random noise is usually not a problem; averaging over large amounts of the data usually eliminates the issue. But data that has been purposely skewed by an adversary with a different agenda is very much a problem.

In search, web spam has been a problem since the beginning, including widespread manipulation of the PageRank algorithm first invented by the Google founders. PageRank worked by counting the links between web pages as votes. People created links between pages on the Web, and each of these links could be viewed as votes that some person thought that page was interesting. By recursively looking at who linked to who and who linked to that, the idea was wisdom could be found in the crowd of people who created those links.

It wasn’t long until people started creating lots of links and lots of pages that linked to a page that they wanted amplified by the search engines. This was the beginning of link farms, massive collections of pages that effectively voted that they were super interesting and should be shown high up in the search results.

The solution to web spam was TrustRank. TrustRank starts with a small number of trusted websites, then starts trusting only sites they link to. Untrusted or unknown websites are largely ignored when ranking. Only the votes from trusted websites count to determine what search results to show to people. A related idea, Anti-TrustRank, starts with all the known spammers, shills, and other bad actors, and marks them and everyone they associate with as untrusted.

E-mail spam had a similar solution. Nowadays, trusted senders can send e-mail, which includes major companies and people you have interacted with in the past. Unknown senders are viewed skeptically, sometimes allowed into your inbox, sometimes not, depending on what they have done in the past and what others seem to think of the e-mails they send, but often will go straight to spam. And untrusted senders, you never see their e-mails at all, as they are blocked or sent straight to spam without any risk of being featured in your inbox.

The problem on social media is severe. Professor Fil Menczer described how “Social media users have in past years become victims of manipulation by astroturf causes, trolling and misinformation. Abuse is facilitated by social bots and coordinated networks that create the appearance of human crowds.” The core problem is bad actors creating a fake crowd. They are pretending to be many people and effectively stuffing the ballot box of algorithmic popularity.

For wisdom of the crowd algorithms such as rankers and recommenders, only use proven trustworthy behavior data. Big data is useless if the data is manipulated. It should not be possible to use fake and controlled accounts to start propaganda trending and picked up by rankers and recommenders.

To avoid manipulation, any behavior data that may involve coordination should be discarded and not used by the algorithms. Discard all unknown or known bad data. Keep only known good data. Shilling will kill the credibility and usefulness of a recommender.

It should be wisdom of independent reliable people. Do not try to find wisdom in big corrupted crowds full of shills.

There is no free speech issue with only considering authentic data when computing algorithmic amplification. CNN analyst and Yale lecturer Asha Rangappa described it well: “Social media platforms like Twitter are nothing like a real public square ... in the real public square ... all of the participants can only represent one individual – themselves. They can’t duplicate themselves, or create a fake audience for the speaker to make the speech seem more popular than it really is.” However, on Twitter, “the prevalence of bots, combined with the amplification features on the platform, can artificially inflate the ‘value’ of an idea ... Unfortunately, this means that mis- and disinformation occupy the largest ‘market share’ on the platform.”

Professor David Kirsch echoed this concern that those who create fake crowds are able to manipulate people and algorithms on social media. Referring to the power of fake crowds to amplify, Kirsch said, “It matters who stands in the public square and has a big megaphone they’re holding, [it’s] the juice they’re able to amplify their statements with.”

Unfortunately, on the internet, amplified disinformation is particularly effective. For example, bad actors can use a few dozen very active controlled accounts to create the appearance of unified opinion in a discussion forum, shouting down anyone who disagrees and controlling the conversation. Combined with well-timed likes and shares from multiple controlled accounts, they can overwhelm organic activity.

Spammers can make their own irrelevant content trend. These bad actors create manufactured popularity and consensus with all their fake accounts.

Rangappa suggested a solution based on a similar idea to prohibiting collusion in the free market: “In the securities market, for example, we prohibit insider trading and some forms of coordinated activity because we believe that the true value of a company can only be reflected if its investors are competing on a relatively level playing field. Similarly, to approximate a real marketplace of ideas, Twitter has to ensure that ideas can compete fairly, and that their popularity represents their true value.”

In the Facebook Papers published in The Atlantic, Adrienne LaFrance talked about the problem at Facebook, saying the company “knows that repeat offenders are disproportionately responsible for spreading misinformation ... It could tweak its algorithm to prevent widespread distribution of harmful content ... It could also automatically throttle groups when they’re growing too fast, and cap the rate of virality for content that’s spreading too quickly ... Facebook could shift the burden of proof toward people and communities to demonstrate that they’re good actors — and treat reach as a privilege, not a right ... It could do all of these things.”

Former Facebook data scientist Jeff Allen similarly proposed, “Facebook could define anonymous and unoriginal content as ‘low quality’, build a system to evaluate content quality, and incorporate those quality scores into their final ranking ‘value model’.” Allen went on to add in ideas similar to TrustRank, saying that accounts that produce high quality content should be trusted, accounts that spam and shill should be untrusted, and then trust could be part of ranking.

Allen was concerned about the current state of Facebook, and warned of the long-term retention and growth problems Facebook and Twitter later experienced: “The top performing content is dominated by spammy, anonymous, and unoriginal content ... The platform is easily exploited. And while the platform is vulnerable, we should expect exploitative actors to be heavily [exploiting] it.”

Bad actors running rampant is not inevitable. As EFF and Stack Overflow board member Anil Dash said, fake accounts and shilling is “endemic to networks that are thoughtless about amplification and incentives. Intentionally designed platforms have these issues, but at a manageable scale.”

Just as web spam and email spam were reduced to almost nothing by carefully considering how to make them less effective, and just as many community sites like Stack Overflow and Medium are able to counter spam and hate, Facebook and other social media websites can too.

When algorithms are manipulated everyone but the spammers lose. Users lose because the quality of the content is worse, with shilled scams and misinformation appearing above content that is actually popular and interesting. The business loses because its users are less satisfied, eventually causing retention problems and hurting long-term revenue.

The idea of only using trustworthy accounts in wisdom of the crowd algorithms has already been proven to work. Similar ideas are widely used already for reducing web and email spam to nuisance levels. Wisdom of the trustworthy should be used wherever and whenever there are problems with manipulation of wisdom of the crowd algorithms.

Trust should not be easy to get. New accounts are easy for bad actors to create, so should be viewed with skepticism. Unknown or untrusted accounts should have their content downranked, and their actions should be mostly ignored by ranker and recommender algorithms. If social media companies did this, then shilling, spamming, and propaganda by bad actors would pay off far less often, making them too costly for many efforts to continue.

In the short-term, with the wrong metrics, it looks great to allow bots, fake accounts, fake crowds, and shilling. Engagement numbers go up, and you see many new accounts. But it’s not real. These aren’t real people who use your product, helpfully interact with other people, and buy things from your advertisers. Allowing untrustworthy accounts and fake crowds hurts customers, advertisers, and the business in the long-term.

Only trustworthy accounts should be amplified by algorithms. And trust should be hard to get and easy to lose.

Wednesday, December 13, 2023

Extended book excerpt: Computational propaganda

(This is a long excerpt about manipulation of algorithms by adversaries from my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Inauthentic activity is designed to manipulate social media. It exists because there is a strong incentive to manipulate wisdom of the crowd algorithms. If someone can get recommended by algorithms, they can get a lot of free attention because they now will be the first thing many people see.

For adversaries, a successful manipulation is like a free advertisement, seen by thousands or even millions. On Facebook, Twitter, YouTube, Amazon, Google, and most other sites on the internet, adversaries have a very strong incentive to manipulate these company’s algorithms.

For some governments, political parties, and organizations, the incentive to manipulate goes beyond merely shilling some content for the equivalent of free advertising. These adversaries engage thousands of controlled accounts over long periods of time in disinformation campaigns.

The goal is to promote a point of view, shut down those promoting other points of view, obfuscate unfavorable news and facts, and sometimes even create whole other realities that millions of people believe are true.

These efforts by major adversaries are known as “computational propaganda.” Computational propaganda unites many terms — “information operations,” “information warfare,” “influence operations,” “online astroturfing,” “cybertufing,” “disinformation campaigns,” and many others — and is defined as “the use of automation and algorithms in the manipulation of public opinion.”

More simply, computational propaganda is an attempt to give “the illusion of popularity” by using a lot of fake accounts and fake followers and make something look far more popular than it actually is. It creates “manufactured consensus,” the appearance that many people think something is interesting, true, and important when, in fact, it is not.

It is propaganda by stuffing the ballot box. The trending algorithm on Twitter and the recommendation engine on Facebook look at what people are sharing, liking, and commenting on as votes, votes for what is interesting and important. But “fringe groups that were five or 10 people could make it look like they were 10 or 20,000 people,” reported PBS’ The Facebook Dilemma. “A lot of people sort of laughed about how easy it was for them to manipulate social media.” They run many “accounts on Facebook at any given time and use them to manipulate people.”

This is bad enough when it is done for profit, to amplify a scam or just to try to sell more of some product. But when governments get involved, especially autocratic governments, reality itself can start to warp under sustained efforts to confuse what is real. “It’s anti-information,” said historian Heather Cox Richardson. Democracies rely on a common understanding of facts, of what is true, to function. If you can get even a few people to believe something that is not true, it changes how people vote, and can even “alter democracy.”

The scale of computational propaganda is what makes it so dangerous. Large organizations and state-sponsored actors are able to sustain thousands of controlled accounts pounding out the same message over long periods of time. They can watch how many real people react to what they do, learn what is working and what is failing to gain traction, and then adapt, increasing the most successful propaganda.

The scale is what creates computational propaganda from misinformation and disinformation. Stanford Internet Observatory’s RenĂ©e DiResta provided an excellent explanation in The Yale Review: “Misinformation and disinformation are both, at their core, misleading or inaccurate information; what separates them is intent. Misinformation is the inadvertent sharing of false information; the sharer didn’t intend to mislead people and genuinely believed the story. Disinformation, by contrast, is the deliberate creation and sharing of information known to be false. It’s a malign narrative that is spread deliberately, with the explicit aim of causing confusion or leading the recipient to believe a lie. Computational propaganda is a suite of tools or tactics used in modern disinformation campaigns that take place online. These include automated social media accounts that spread the message and the algorithmic gaming of social media platforms to disseminate it. These tools facilitate the disinformation campaign’s ultimate goal — media manipulation that pushes the false information into mass awareness.”

The goal of computational propaganda is to bend reality, to make millions believe something that is not true is true. DiResta warned: “As Lenin purportedly put it, ‘A lie told often enough becomes the truth.’ In the era of computational propaganda, we can update that aphorism: ‘If you make it trend, you make it true’”

In recent years, Russia was particularly effective at computational propaganda. Adversaries created fake media organizations that looked real, created fake accounts with profiles and personas that looked real, and developed groups and communities to the point they had hundreds of thousands of followers. Russia was “building influence over a period of years and using it to manipulate and exploit existing political and societal divisions,” DiResta wrote in the New York Times.

The scale of this effort was remarkable. “About 400,000 bots [were] engaged in the political discussion about the [US] Presidential election, responsible for roughly 3.8 million tweets, about one-fifth of the entire conversation,” said USC researchers.

Only later was the damage at all understood. In the book Zucked, Roger McNamee summarized the findings: “Facebook disclosed that 126 million users had been exposed to Russian interference, as well as 20 million users on Instagram ... The user number represents more than one-third of the US population, but that grossly understates its impact. The Russians did not reach a random set of 126 million people on Facebook. Their efforts were highly targeted. On the one hand, they had targeted people likely to vote for Trump with motivating messages. On the other, they identified subpopulations of likely Democratic voters who might be discouraged from voting ... In an election where only 137 million people voted, a campaign that targeted 126 million eligible voters almost certainly had an impact.”

These efforts were highly targeted, trying to pick out parts of the US electorate that might be susceptible to their propaganda. The adversaries worked over a long period of time, adapting as they discovered what was getting traction.

By late 2019, as reported by MIT Technology Review, “all 15 of the top pages targeting Christian Americans, 10 of the top 15 Facebook pages targeting Black Americans, and four of the top 12 Facebook pages targeting Native Americans were being run by …. Eastern European troll farms.”

These pages “reached 140 million US users monthly.” They achieved this extraordinary reach not by people seeking them out on their own, but by manipulating Facebook’s “engagement-hungry algorithm.” These groups were so large and so popular because “Facebook’s content recommendation system had pushed [them] into their news feeds.” Facebook’s optimization process for their algorithms was giving these inauthentic actors massive reach for their propaganda.

As Facebook data scientists warned inside of the company, “Instead of users choosing to receive content from these actors, [Facebook] is choosing to give them an enormous reach.” Real news, trustworthy information from reliable sources, took a back seat to this content. Facebook was amplifying these troll farms. The computational propaganda worked.

The computational propaganda was not limited to Facebook. The efforts spanned many platforms, trying the same tricks everywhere, looking for flaws to exploit and ways to extend their reach. The New York Times reported that the Russian “Internet Research Agency spread its messages not only via Facebook, Instagram and Twitter ... but also on YouTube, Reddit, Tumblr, Pinterest, Vine and Google+” and others. Wherever they were most successful, they would do more. They went wherever it was easiest and most efficient to spread their false message to a mass audience.

It is tempting to question how so many people could fall for this manipulation. How could over a hundred million Americans, and hundreds of millions of people around the world, see propaganda and believe it?

But this propaganda did not obviously look like Russian propaganda. The adversaries would impersonate Americans using fake accounts with descriptions that appeared to be authentic on casual inspection. Most people would have no idea they were reading a post or joining a Facebook Group that was created by a troll farm.

Instead “they would be attracted to an idea — whether it was guns or immigration or whatever — and once in the Group, they would be exposed to a steady flow of posts designed to provoke outrage or fear,” said Roger McNamee in Zucked. “For those who engaged frequently with the Group, the effect would be to make beliefs more rigid and more extreme. The Group would create a filter bubble, where the troll, the bots, and the other members would coalesce around an idea floated by the troll.”

The propaganda was carefully constructed, using amusing memes and emotion-laden posts to lure people in, then using manufactured consensus through multiple controlled accounts to direct and control what people saw afterwards.

Directing and controlling discussions only requires a small number of accounts if well-timed and coordinated. Most people reading a group are passive. Most people are not actively posting to the group. And far more people read than like, comment, or reshare.

Especially if adversaries do the timing well to get the first few comments and likes, then “as few as 1 to 2 percent of a group can steer the conversation if they are well- coordinated. That means a human troll with a small army of digital bots— software robots— can control a large, emotionally engaged Group.” If any real people start to argue or point out that something is not true, they can be drowned out by the controlled accounts simultaneously slamming them in the comments, creating an illusion of consensus and keeping the filter bubble intact.

This spanned the internet, on every platform and across seemingly-legitimate websites. Adversaries tried many things to see what worked. When something gained traction, they would “post the story simultaneously on an army of Twitter accounts” along with their controlled accounts saying, “read the story that the mainstream media doesn’t want you to know about.” If any real journalist eventually wrote about the story, “The army of Twitter accounts— which includes a huge number of bots— tweets and retweets the legitimate story, amplifying the signal dramatically. Once a story is trending, other news outlets are almost certain to pick it up.”

In the most successful cases, what starts as propaganda becomes misinformation, with actual American citizens unwittingly echoing Russian propaganda, now mistakenly believing a constructed reality was actually real.

By no means was this limited to only within the United States or only by Russians. Many large scale adversaries, including governments, political campaigns, multinational corporations, and organizations, are engaging in computational propaganda. What they have in common is using thousands of fake, hacked, controlled, or paid accounts to rapidly create messages on social media and the internet. They create manufactured consensus around their message and flood confusion around what is real and what is not. They have been seen “distorting political discourse, including in Albania, Mexico, Argentina, Italy, the Philippines, Afghanistan, South Korea, Bolivia, Ecuador, Iraq, Tunisia, Turkey, Taiwan, Paraguay, El Salvador, India, the Dominican Republic, Indonesia, Ukraine, Poland and Mongolia,” wrote the Guardian.

Computational propaganda is everywhere in the world. It “has become a regular tool of statecraft,” said Princeton Professor Jacob Shapiro, “with at least 51 different countries targeted by government-led online influence efforts” in the last decade.

An example in India is instructive. In the 2019 general election in India, adversaries used “hundreds of WhatsApp groups,” fake accounts, hacked and hijacked accounts, and “Tek Fog, a highly sophisticated app” to centrally control activity on social media. In a published paper, researchers wrote that adversaries “were highly effective at producing lasting Twitter trends with a relatively small number of participants.” This computational propaganda amplified “right-wing propaganda … making extremist narratives and political campaigns appear more popular than they actually are.” They were remarkably effective: “A group of public and private actors working together to subvert public discourse in the world’s largest democracy by driving inauthentic trends and hijacking conversations across almost all major social media platforms.”

Another recent example was in Canada, the so-called “Siege of Ottawa.” In the Guardian, Arwa Mahdawi wrote about how it came about: “It’s an astroturfed movement – one that creates an impression of widespread grassroots support where little exists – funded by a global network of highly organised far-right groups and amplified by Facebook ... Thanks to the wonders of modern technology, fringe groups can have an outsize influence ... [using] troll farms: organised groups that weaponise social media to spread misinformation.”

Computational propaganda “threatens democracies worldwide.” It has been “weaponized around the world,” said MIT Professor Sinan Aral in the book The Hype Machine. In the 2018 general elections in Sweden, a third of politics-related hashtagged tweets “were from fake news sources.” In the 2018 national elections in Brazil, “56 percent of the fifty most widely shared images on [popular WhatsApp] chat groups were misleading, and only 8 percent were fully truthful.” In the 2019 elections in India, “64 percent of Indians encountered fake news online.” In the Philippines, there was a massive propaganda effort against Maria Ressa, a journalist “working to expose corruption and a Time Person of the Year in 2018.” Every democracy around the world is seeing adversaries using computational propaganda.

The scale is what makes computational propaganda so concerning. The actors behind computational propaganda are often well-funded with considerable resources to bring to bear to achieve their aims.

Remarkably, there is now enough money involved that there are private companies “offering disinformation-for-hire services.” Computational propaganda “has become more professionalised and is now produced on an industrial scale.” It is everywhere in the world. “In 61 countries, we found evidence of political parties or politicians running for office who have used the tools and techniques of computational propaganda,” said researchers at University of Oxford. The way they work is always the same. “Automated accounts are often used to amplify certain narratives while drowning out others ... in order to game the automated systems social media companies use.” It is spreading propaganda using manufactured consensus at industrial scale.

Also concerning is that computational propaganda can target the just most vulnerable and the most susceptible and still achieve its aims. In a democracy, the difference between winning an election and losing is often just a few percentage points.

To change the results of an election, you don’t have to influence everyone. The target of computational propaganda is usually “only 10-20% of the population.” Swaying even a fraction of this audience by convincing them to vote in a particular way or discouraging them from voting at all “can have a resounding impact,” shifting all the close elections favorably, and leading to control of a closely-contested government.

To address the worldwide problem of computational propaganda, it is important to understand why it works. Part of why computational propaganda works is the story of why propaganda has worked throughout history. Computational propaganda floods what people see with a particular message, creating an illusion of consensus while repeating the same false message over and over again.

This feeds the common belief fallacy, even if the number of controlled accounts is relatively small, by creating the appearance that everyone believes this false message to be true. It creates a firehose of falsehood, flooding people with the false message, creating confusion about what is true or not, and drowning out all other messages. And the constant repetition, seeing the message over and over, fools our minds using the illusionary truth effect, which tends to make us believe things we have seen many times before, “even if the idea isn’t plausible and even if [we] know better.”

As Wharton Professor Ethan Mollick wrote, “The Illusionary Truth Effect supercharges propaganda on social media. If you see something repeated enough times, it seems more true.” Professor Mollick went on to say that studies found it works on the vast majority of people even when the information isn’t plausible and merely five repetitions were enough to start to make false statements seem true.

The other part of why computational propaganda works is algorithmic amplification by social media algorithms. Wisdom of the crowd algorithms, which are used in search, trending, and recommendations, work by counting votes. They look for what is popular, or what seems to be interesting to people like you, by looking at what people seemed to have enjoyed in the recent past.

When the algorithms look for what people are enjoying, these algorithms assume each person is a real person and each person is acting independently. When adversaries create many fake accounts or coordinate between many controlled accounts, they are effectively voting many times, fooling the algorithms with an illusion of consensus.

What the algorithm thought was popular and interesting turns out to be shilled. The social media post is not really popular or interesting, but the computational propaganda effort made it look to the algorithm that it is. And so the algorithm amplifies the propaganda, inappropriately showing it to many more people, and making the problem far worse.

Both people using social media and the algorithms picking what people see on social media are falling victim to the same technique, manufactured consensus, the propagandist creating “illusory notions of ... popularity because of this same automated inflation of the numbers.” It is adversaries using bots and coordinated accounts to mimic real users.

“They can drive up the number of likes, re-messages, or comments associated with a person or idea,” wrote the authors of Social Media and Democracy. “Researchers have catalogued political bot use in massively bolstering the social media metrics.”

The fact that they are only mimicking real users is important to addressing the problem. They are not real users, and they don’t behave like real users.

For example, when the QAnon conspiracy theory was growing rapidly on Facebook, it grew using “minimally connected bulk group invites. One member sent over 377,000 group invites in less than 5 months.” There were very few people responsible. According to reporter David Gilbert, there are “a relatively few number of actors creating a large percentage of the content.” He said a “small group of users has been able to hijack the platform.”

To shill and coordinate between many accounts pushing propaganda, adversaries have to behave in ways that are not human. Bots and other accounts that are controlled by just a few people all “pounce on fake news in the first few seconds after it’s published, and they retweet it broadly.” The initial spreaders of the propaganda “are much more likely to be bots than humans” and often will be the same accounts, superspreaders of propaganda, acting over and over again.

Former Facebook data scientist Sophie Zhang talked about this in a Facebook internal memo, reported by BuzzFeed: “thousands of inauthentic assets ... coordinated manipulation ... network[s] of more than a thousand actors working to influence ... The truth was, we simply didn’t care enough to stop them.” Despairing about the impact of computational propaganda on people around the world, Zhang went on to lament, “I have blood on my hands.”

Why do countries, and especially authoritarian regimes, create and promote propaganda? Why do they bother?

The authors of the book Spin Dictators write that, in recent years, because of globalization, post-industrial development, and technology changes, authoritarian regimes have “become less bellicose and more focused on subtle manipulation. They seek to influence global opinion, while co-opting and corrupting Western elites.”

Much of this is simply that it in recent decades has become cheaper and more effective to maintain power through manipulation and propaganda, in part due to lower costs on communication such as disinformation campaigns on social media, in part due to the economic benefits of openness that raise the costs of use of violence.

“Rather than intimidating citizens into submission, they use deception to win the people over.” Nowadays, propaganda is easier and cheaper. “Their first line of defense, when the truth is against them, is to distort it. They manipulate information ... When the facts are good, they take credit for them; when bad, they have the media obscure them when possible and provide excuses when not. Poor performance is the fault of external conditions or enemies ... When this works, spin dictators are loved rather than feared.”

Nowadays, it is cheaper to become loved than feared. “Spin dictators manipulate information to boost their popularity with the general public and use that popularity to consolidate political control, all while pretending to be democratic.”

While not all manipulation of wisdom of the crowd algorithms is state actors, adversarial states are a big problem: “The Internet allows for low-cost, selective censorship that filters information flows to different groups.” Propaganda online is cheap. “Social networks can be hijacked to disseminate sophisticated propaganda, with pitches tailored to specific audiences and the source concealed to increase credibility. Spin dictators can mobilize trolls and hackers ... a sophisticated and constantly evolving tool kit of online tactics.”

Unfortunately, internet “companies are vulnerable to losing lucrative markets,” so they are not always quick to act when they discover countries manipulating their rankers and recommender algorithms; authoritarian governments often play to this fear by threatening retaliation or loss of future business in the country.

Because “the algorithms that decide what goes viral” are vulnerable to shilling, it is also easy for “spin dictators use propaganda to spread cynicism and division.” And “if Western publics doubt democracy and distrust their leaders, those leaders will be less apt to launch democratic crusades around the globe.” Moreover, they can spread the message that “U.S.-style democracy leads to polarization and conflict” and corruption. This reduces the threats to an authoritarian leader and reinforces their own popularity.

Because the manipulation is all adversaries trying to increase their visibility, downranking or removing accounts involved in computational propaganda has little business risk. New accounts and any account involved in shilling, coordination, or propaganda could largely be ignored for the purpose of algorithmic amplification, and repeat offenders could be banned entirely.

Computational propaganda exists because it is cost effective to do at large scale. Increasing the cost of propaganda reaching millions of people may be enough to vastly reduce its impact. As Sinan Aral writes in the book The Hype Machine, “We need to cut off the financial returns to spreading misinformation and reduce the economic incentive to create it in the first place.”

While human susceptibility to propaganda is difficult to solve, on the internet today, a big part of the problem of computational propaganda comes down to how easy it is for adversaries to manipulate wisdom of the crowd algorithms and see their propaganda cheaply and efficiently amplified by algorithms.

Will Oremus blamed recommendation and other algorithms in the Washington Post making it far too easy for the bad guys. “The problem of misinformation on social media has less to do with what gets said by users than what gets amplified — that is, shown widely to others — by platforms’ recommendation software,” he said. Raising the costs to manipulating the recommendation engine is key to reducing the effectiveness of computational propaganda.

Wisdom of the crowds depends on the crowd consisting of independent voices voting independently. When that assumption is violated, adversaries can force the algorithms to recommend whatever they want. Computational propaganda uses a combination of bots and many controlled accounts, along with so-called “useful idiot” shills, to efficiently and effectively manipulate trending, ranker, and recommender algorithms.

Allowing their platforms to be manipulated by computational propaganda makes the experience on the internet worse. University of Oxford researchers found that “globally, disinformation is the single most important fear of internet and social media use and more than half (53%) of regular internet users are concerned about disinformation [and] almost three quarters (71%) of internet users are worried about a mixture of threats, including online disinformation, fraud and harassment.” At least in the long-term, it is in everyone’s interest to reduce computational propaganda.

When adversaries have their bots and coordinated accounts like, share, and post, none of that is authentic activity. None of that shows that people actually like the content. None of that content is actually popular nor interesting. It is all manipulation of the algorithms and only serves to make relevance and the experience worse.

Sunday, December 10, 2023

Book excerpt: The rise and fall of wisdom of the crowds

(This is an excerpt from drafts of my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Wisdom of the crowds is the epiphany that combining many people's opinions is useful, often more useful than expert opinions.

Customer reviews summarize what thousands of people think of movies, books, and everything else you might want to buy. Customer reviews can be really useful for knowing if you want to buy something you've never tried before.

When you search the internet, what thousands of people clicked on before you helps determine what you see. Most of the websites on the internet are useless or scams; wisdom of the crowd filters all that out and helps you find what you need.

When you read the news online, you see news first that other people think is interesting. What people click determines what information you see about what's going on in the world.

Algorithms on the internet take the wisdom of the crowds to a gargantuan scale. Algorithms process all the data, summarizing it all down, until you get millions of people helping millions of people find what they need.

It sounds great, right? And it is. But once you use wisdom of the crowds, scammers come in. They see dollar signs in fooling those algorithms. Scammers profit from faking crowds.

When manipulated, wisdom of the crowds can promote scams, misinformation, and propaganda. Spammers clog up search engines until we can't see anything but scams. Online retailers are filled with bogus positive customer reviews of counterfeit and fraudulent items. The bad guys astroturf everything using fake crowds. Foreign operatives are able to flood the zone on social media with propaganda using thousands of fake accounts.

What we need is an internet that works for us. We need an internet that is useful and helpful, where we can find what we need without distractions and scams. Wisdom of the crowds and the algorithms that use wisdom of the crowds are the key to getting us there. But wisdom of the crowds can fail.

It's tricky to get right. Good intentions can lead to destructive outcomes. When executives tell their teams to optimize for clicks, they discover far too late that going down that path optimizes for scams and hate. When teams use big data, they're trying to make their algorithms work better, but they often end up sweeping up manipulated data that skews their results toward crap. Understanding why wisdom of the crowds fails and how to fix it is the key to getting us the internet we want.

The internet has come a long way. In the mid-1990s, it was just a few computer geeks. Nowadays, everyone in the world is online. There have been hard lessons learned along the way. These are the stories of unintended consequences.

Good intentioned efforts to tell teams to increase engagement caused misinformation and spam. Experimentation and A/B testing helped some teams help customers, but also accidentally sent some teams down dark paths of harming customers. Attempts to improve algorthms easily can go terribly wrong.

The internet has grown massively. During all of that growth, many internet companies struggled with figuring out how to make a real business. At first, Google had no revenue and no idea how to make money off web search. At first, Amazon had no profits and it was unclear if it ever would.

Almost always, people at tech companies had good intentions. We were scrambling to build the right thing. What we ended up building was not always the right thing. The surprising reason for this failure is what gets built depends not so much on the technology but the incentives people have.

Friday, December 08, 2023

Book excerpt: Manipulating likes, comments, shares, and follows

(This is an excerpt from drafts of my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

“The systems are phenomenally easy to game,” explained Stanford Internet Observatory’s Renee DiResta.

The fundamental idea behind the algorithms used by social media is that “popular content, as defined by the crowd” should rise to the top. But “the crowd doesn’t have to be real people.”

In fact, adversaries can get these algorithms to feature whatever content they want. The process is easy and cheap, just pretend to be many people: “Bots and sockpuppets can be used to manipulate conversations, or to create the illusion of a mass groundswell of grassroots activity, with minimal effort.”

Whatever they want — whether it is propaganda, scams, or just flooding-the-zone with disparate and conflicting misinformation — can appear to be popular, which trending, ranker, and recommender algorithms will then dutifully amplify.

“The content need not be true or accurate,” DiResta notes. All this requires is a well-motivated small group of individuals pretending to be many people. “Disinformation-campaign material is spread via mass coordinated action, supplemented by bot networks and sockpuppets (fake people).”

Bad actors can amplify propaganda on a massive scale, reaching millions, cheaply and easily, from anywhere in the world. “Anyone who can gather enough momentum from sharing, likes, retweets, and other message-amplification features can spread a message across the platforms’ large standing audiences for free,” DiResta continued in an article for Yale Review titled "Computational Propaganda": “Leveraging automated accounts or fake personas to spread a message and start it trending creates the illusion that large numbers of people feel a certain way about a topic. This is sometimes called ‘manufactured consensus’.”

Another name for it is astroturf. Astroturf is feigning popularity by using a fake crowd of shills. It's not authentic. Astroturf creates the illusion of popularity.

There are even businesses set up to provide the necessary shilling, hordes of fake people on social media available on demand to like, share, and promote whatever you may want. As described by Sarah Frier in the book No Filter: “If you searched [get Instagram followers] on Google, dozens of small faceless firms offered to make fame and riches more accessible, for a fee. For a few hundred dollars, you could buy thousands of followers, and even dictate exactly what these accounts were supposed to say in your comments.”

Sarah Frier described the process in more detail. “The spammers ... got shrewder, working to make their robots look more human, and in some cases paying networks of actual humans to like and comment for clients.” They found “dozens of firms” offering these services of “following and commenting” to make content falsely appear to be popular and thereby get free amplification by the platforms wisdom of the crowd algorithms. “It was quite easy to make more seemingly real people.”

In addition to creating fake people by the thousands, it is easy to find real people who are willing to be paid to shill, some of which would even “hand over the password credentials” for their account, allowing the propagandists to use their account to shill whenever they wished. For example, there were sites where bad actors could “purchase followers and increase engagement, like Kicksta, Instazood, and AiGrow. Many are still running today.” And in discussion groups, it was easy to recruit people who, for some compensation, “would quickly like and comment on the content.”

Bad actors manipulate likes, comments, shares, and follows because it works. When wisdom of the crowd algorithms look for what is popular, they pick up all these manipulated likes and shares, thinking they are real people acting independently. When the algorithms feature manipulated content, bad actors get what is effectively free advertising, the coveted top spots on the page, seen by millions of real people. This visibility, this amplification, can be used for many purposes, including foreign state-sponsored propaganda or scams trying to swindle.

Professor Fil Menczer studies misinformation and disinformation on social media. In our interview, he pointed out that it is not just wisdom of the crowd algorithms that fixate on popularity, but a “cognitive/social” vulnerability that “we tend to pay attention to items that appear popular … because we use the attention of other people as a signal of importance.”

Menczer explained: “It’s an instinct that has evolved for good reason: if we see everyone running we should run as well, even if we do not know why.” Generally, it does often work to look at what other people are doing. “We believe the crowd is wise, because we intrinsically assume the individuals in the crowd act independently, so that the probability of everyone being wrong is very low.”

But this is subject to manipulation, especially online on social media “because one entity can create the appearance of many people paying attention to some item by having inauthentic/coordinated accounts share that item.” That is, if a few people can pretend to be many people, they can create the appearance of a popular trend, and fool our instinct to follow the crowd.

To make matters worse, there often can be a vicious cycle where some people are manipulated by bad actors, and then their attention, their likes and shares, is “further amplified by algorithms.” Often, it is enough to merely start some shilled content trending, because “news feed ranking algorithms use popularity/engagement signals to determine what is interesting/engaging and then promote this content by ranking it higher on people’s feeds.”

Adversaries manipulating the algorithms can be clever and patient, sometimes building up their controlled accounts over a long period of time. One low cost method of making a fake account look real and useful is to steal viral content and share it as your own.

In an article titled “Those Cute Cats Online? They Help Spread Misinformation,” New York Times reporters described one method of how new accounts manage to quickly gain large numbers of followers. The technique involves reposting popular content, such as memes that previously went viral, or cute pictures of animals: “Sometimes, following a feed of cute animals on Facebook unknowingly signs [people] up” for misinformation. “Engagement bait helped misinformation actors generate clicks on their pages, which can make them more prominent in users’ feeds in the future.”

Controlling many seemingly real accounts, especially accounts that have real people following them to see memes and cute pictures of animals, allows bad actors to “act in a coordinated fashion to increase influence.” The goal, according to researchers at Indiana University, is to create a network of controlled shills, many of which might be unwitting human participants, that are “highly coordinated, persistent, homogeneous, and fully focused on amplifying” scams and propaganda.

This is not costless for social media companies. Not only are people directly misled, and even sometimes pulled into conspiracy theories and scams, but amplifying manipulated content including propaganda rather than genuinely popular content will “negatively affect the online experience of ordinary social media users” and “lower the overall quality of information” on the website. Degradation of the quality of the experience can be hard for companies to see, only eventually showing up in poor retention and user growth when customers get fed up and leave in disgust.

Allowing fake accounts, manipulation of likes and shares, and shilling of scams and propaganda may hurt the business in the long-term, but, in the short-term, it can mean advertising revenue. As Karen Hao reported in MIT Technology Review, “Facebook isn’t just amplifying misinformation. The company is also funding it.” While some adversaries manipulate wisdom of the crowd algorithms in order to push propaganda, some bad actors are in it for the money.

Social media companies allowing this type of manipulation does generate revenue, but it also reduces the quality of the experience, filling the site with unoriginal content, republished memes, and scams. Hao detailed how it works: “Financially motivated spammers are agnostic about the content they publish. They go wherever the clicks and money are, letting Facebook’s news feed algorithm dictate which topics they’ll cover next ... On an average day, a financially motivated clickbait site might be populated with ... predominantly plagiarized ... celebrity news, cute animals, or highly emotional stories—all reliable drivers of traffic. Then, when political turmoil strikes, they drift toward hyperpartisan news, misinformation, and outrage bait because it gets more engagement ... For clickbait farms, getting into the monetization programs is the first step, but how much they cash in depends on how far Facebook’s content-recommendation systems boost their articles.”

The problem is that this works. Adversaries have a strong incentive to manipulate social media’s algorithms if it is easy and profitable.

But “they would not thrive, nor would they plagiarize such damaging content, if their shady tactics didn’t do so well on the platform,” Hao wrote. “One possible way Facebook could do this: by using what’s known as a graph-based authority measure to rank content. This would amplify higher-quality pages like news and media and diminish lower-quality pages like clickbait, reversing the current trend.” The idea is simple, that authoritative, trustworthy sources should be amplified more than untrustworthy or spammy sources.

Broadly this type of manipulation is spam, much like spam that technology companies have dealt with for years in email and on the Web. If social media spam was not cost-effective, it would not exist. Like with web spam and email spam, the key with social media spam is to make it less effective and less efficient. As Hao suggested, manipulating wisdom of the crowd algorithms could be made to be less profitable by viewing likes and shares from less trustworthy accounts with considerable skepticism. If the algorithms did not amplify this content as much, it would be much less lucrative to spammers.

Inside of Facebook, data scientists proposed something similar. Billy Perrigo at Time magazine reported that Facebook “employees had discovered that pages that spread unoriginal content, like stolen memes that they’d seen go viral elsewhere, contributed to just 19% of page-related views on the platform but 64% of misinformation views.” Facebook data scientists “proposed downranking these pages in News Feed ... The plan to downrank these pages had few visible downsides ... [and] could prevent all kinds of high-profile missteps.”

What the algorithms show is important. The algorithms can amplify a wide range of interesting and useful content that enhances discovery and keeps people on the platform.

Or the algorithms can amplify manipulated content, including hate speech, spam, scams, and misinformation. That might make people click now in outrage, or perhaps fool them for a while, but will cause people to leave in disguist eventually.

Tuesday, December 05, 2023

Book excerpt: Bonuses and promotions causing bad incentives

(This is an excerpt from my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Bonuses are a powerful incentive. Technology companies are using them more than ever. Most technology companies cap salaries and instead use bonuses and stock grants as most of their compensation for employees.

These bonuses are often tied to key metrics. For example, imagine that if you deploy a change to the recommendation algorithms that boosts revenue by a fraction of a percent, you would get the maximum bonus, a windfall of a million dollars, into your pocket.

What are you going to do? You’re going to try to get that bonus. In fact, you’ll do anything you can to get that bonus.

The problem comes when the criteria for what gets the bonus isn’t exactly correct. It doesn’t matter if it is mostly correct — increasing revenue is mostly correct as a goal — what matters is if there is any way, any way at all, to get that bonus in a way that doesn’t help the company and customers.

Imagine you find a way to increase revenue by biasing the recommendations toward outright scams, snake oil salesmen selling fake cures to the desperate. Just a twiddle to the algorithms and those scams show up just a bit more often, and that nudges the revenue just that much higher, at least when you tested it for a couple days.

Do you roll out this new scammy algorithm to everyone? Should everyone see more of these scams? And what happens to customers, and the company, if people see all these scams?

But that bonus. That tasty, tasty bonus. $1 million dollars. Surely, if you weren’t supposed to do this, they wouldn’t give you that bonus. Would they? This has to be the right thing. Isn’t it?

People working within technology companies have to make decisions like this every day. Examples abound of ways to generate more revenue that ultimately are harmful to the company, including increasing the size and number of paid promotions, salacious or otherwise inappropriate content, deceptive sales pitches, promoting lower quality items where you receive compensation, spamming people with takeover or pop-up advertising, and feeling strong emotions such as hatred.

As an article in Wired titled “15 Months of Fresh Hell Inside Facebook” described, this is a real problem. There easily can be “perverse incentives created by [the] annual bonus program, which pays people in large part based on the company hitting growth targets.”

“You can do anything, no matter how crazy the idea, as long as you move the goal metrics,” added Facebook whistleblower Francis Haugen. If you tell people their bonus depends on moving goal metrics, they will do whatever it takes to move those goal metrics.

This problem is why some tech companies reject using bonuses as a large part of their compensation. As Netflix founder and CEO Reed Hastings explained, “The risk is that employees will focus on a target instead of spotting what’s best for the company in the present moment.”

When talking about bonuses in our interview, a former executive who worked at technology startups gave the example of teams meeting their end-of-quarter quotas by discounting, which undermines pricing strategy and can hurt the long-term of the company. He also told of an executive who forced a deal through that was bad for the company because signing the deal ensured he hit his quarterly licensing goal and got his bonus. This other executive, when challenged by the CEO, defended his choice by saying he was not given the luxury of long-term thinking.

“We learned that bonuses are bad for business,” Netflix CEO Reed Hastings said. “The entire bonus system is based on the premise that you can reliably predict the future, and that you can set an objective in any given moment that will continue to be important down the road.”

The problem is that people will work hard to get a bonus, but it is hard to set a criteria for bonuses that cannot be abused in some way. People will try many, many things seeking to find something that wins the windfall the company is dangling in front of them. Some of the innovations might be real. But others may actually cause harm, especially over long periods of time.

As Reed Hastings went on to say, what companies need to be able to do is “adapt direction quickly” and have creative freedom to do the right thing for the company, not to focus on what “will get you that big check.” It’s not just how much you pay people, it’s also how you pay them.

Similarly, the people working on changing and tuning algorithms want to advance in their careers. How people are promoted, who is promoted, and for what reason creates incentives. Those incentives ultimately change what wisdom of the crowd algorithms do.

If people are promoted for helping customers find and discover what they need and keeping customers satisfied, people inside the company have more incentive to target those goals. If people are promoted for getting people to click more regardless of what they are clicking, then those algorithms are going to get more clicks, so more people get those promotions.

In the book An Ugly Truth, the authors found Facebook “engineers were given engagement targets, and their bonuses and annual performance reviews were anchored to measurable results on how their products attracted more users or kept them on the site longer.” Performance reviews and promotions were tied with making changes that kept people engaged and clicking. “Growth came first,” they found. “It’s how people are incentivized on a day-to-day basis.”

Who gets good performance reviews and promotions determines which projects get done. If a project that reduces how often people see disinformation from adversaries is both hard and gets poor performance reviews for its team, many people will abandon it. If another project that promotes content that makes people angry gets its team promoted because they increased engagement, then others will look over and say, that looks easy, I can do that too.

In the MIT Technology Review article “How Facebook Got Addicted to Spreading Misinformation,” Karen Hao described the incentives: “With their performance reviews and salaries tied to the successful completion of projects, employees quickly learned to drop those that received pushback and continue working on those dictated from the top down.”

The optimization of these algorithms is a series of steps, each one a small choice, about what people should and shouldn’t do. Often, the consequences can be unintended, which makes it that much more important for executives to check frequently if they are targeting the right goals. As former Facebook Chief Security Officer Alex Stamos said, “Culture can become a straightjacket” and force teams down paths that eventually turn out to be harmful to customers and the company.

Executives need to be careful of the bonus and promotion incentives they create for how their algorithms are tuned and optimized. What the product does depends on what incentives teams have.

Monday, December 04, 2023

The failure of big data

For decades, the focus in machine learning has been big data.

More data beats better algorithms, said an early 2001 result from Banko and Brill at Microsoft Research. This was hugely influential on the ML community. For years, most people found it was roughly true that if you get more data, ML works better.

Those days have come to an end. Nowadays, big data often is worse. This is because low quality or manipulated data wrecks everything.

There was a quiet assumption behind big data that any bad data in the big data is noise that averages out. That is wrong in most real world data where the bad data is skewed.

This problem is acute with user behavior data, like clicks, likes, links, or ratings. ML trying to use user behavior is trying to do wisdom of crowds, summarizing the opinions of many independent sources to produce useful information.

Adversaries can purposely skew user behavior data. When they do, using that data will yield terrible results in ML algorithms because the adversaries are able to make the algorithms show whatever they like. That includes the important ranking algorithms for search, trending, and recommendations that we use every day to find information on the internet.

ML using behavior data is doing wisdom of crowds. Wisdom of crowds assumes the crowd is full of real, unbiased, non-coordinating voices. It doesn't work when the crowd is not real. In cases where you are not sure, it's better to discard much of the data, anything not reliable.

Better data often beats big data if you measure by what is useful to people. ML needs data from reliable, representative, independent, and trustworthy sources to produce useful results. If you aren't sure about the reliability, throw that data out, even if you are throwing most of the data away in the end. Seek useful data, not big data.

Friday, December 01, 2023

Book excerpt: Manipulating customer reviews

(This is an excerpt from my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Amazon is the place people shop online. Over 40% of all US e-commerce spending was on Amazon.com in recent years.

Amazon also is the place for retailers to list their products for sale. Roughly 25% of all US e-commerce spending recently was third-party marketplace sellers using the Amazon.com website to sell their goods. Amazon is the place for merchants wanting to be seen by customers.

Because the stakes are so high, sellers have a strong incentive to have positive reviews of their products. Customers not only look at the reviews before buying, but also filter what they search for based on the reviews.

“Reviews are meant to be an indicator of quality to consumers," Zoe Schiffer wrote for The Verge, “[And] they also signal to algorithms whose products should rise to the top.”

For example, when a customer searches on Amazon for [headphones], there are tens of thousands of results. Most customers will only look at the first few of those results. The difference between being one of the top results for that search for headphones and being many clicks down the list can make or break a small manufacturer.

As Wired put it in an article titled “How Amazon’s Algorithms Curated a Dystopian Bookstore”: “Amazon shapes many of our consumption habits. It influences what millions of people buy, watch, read, and listen to each day. It’s the internet’s de facto product search engine — and because of the hundreds of millions of dollars that flow through the site daily, the incentive to game that search engine is high. Making it to the first page of results for a given product can be incredibly lucrative.”

But there is a problem. “Many curation algorithms can be gamed in predictable ways, particularly when popularity is a key input. On Amazon, this often takes the form of dubious accounts coordinating.”

The coordination of accounts often takes the form of paying people to write positive reviews whether they have used the item or not. It is not hard to recruit people to write a bogus positive review. A small payment and being allowed to keep the product for free is usually enough. There are even special discussion forums where people wait to be offered the chance to post a false positive review, ready and available recruits for the scam.

BuzzFeed described the process in detail in an investigative piece, “Inside Amazon’s Fake Review Economy.” They discuss “a complicated web of subreddits, invite-only Slack channels, private Discord servers, and closed Facebook groups.” They went on to detail how “sellers typically pay between $4 to $5 per review, plus a refund of the product ... [and] reviewers get to keep the item for free.”

Why do merchants selling on Amazon do this? As Nicole Nguyen explained in that BuzzFeed article, “Being a five-star product is crucial to selling inventory at scale in Amazon’s intensely competitive marketplace — so crucial that merchants are willing to pay thousands of people to review their products positively.”

Only one product can appear at the top of an Amazon search for [headphones]. And the top result will be the one most customers see and buy. It is winner take all.

“Reviews are a buyer’s best chance to navigate this dizzyingly crowded market and a seller’s chance to stand out from the crowd ... Online customer reviews are the second most trusted source of product information, behind recommendations from family and friends ... The best way to make it on Amazon is with positive reviews, and the best way to get positive reviews is to buy them.”

Because so few customers leave reviews, and even fewer leave positive reviews, letting the natural process take its course means losing to another less scrupulous merchant who is willing to buy as many positive reviews as they need. The stakes are high, and those who refuse to manipulate the reviews usually lose.

“Sellers trying to play by the rules are struggling to stay afloat amid a sea of fraudulent reviews,” Nguyen wrote. It is “really hard to launch a product without them.”

More recently, Facebook Groups have grown in popularity, generally and as a way to recruit people to write fake reviews. UCLA researchers described in detail how it works, finding “23 [new] fake review related groups every day. These groups are large and quite active, with each having about 16,000 members on average, and 568 fake review requests posted per day per group. Within these Facebook groups, sellers can obtain a five-star review that looks organic.” They found the cost of buying a fake review to be quite cheap, “the cost of the product itself,” because “the vast majority of sellers buying fake reviews compensate the reviewer by refunding the cost of the product via a PayPal transaction after the five-star review has been posted” with only a small number of the bad sellers also offering money in addition to a refund of the cost of the product.

Washington Post reporters also found “fraudulent reviews [often] originate on Facebook, where sellers seek shoppers on dozens of networks, including Amazon Review Club and Amazon Reviewers Group, to give glowing feedback in exchange for money or other compensation.”

You might think that getting caught manipulating reviews, and through fake reviews also getting featured in search and in recommendations, might have some cost for sellers if they were to get caught. However, Brad Stone in Amazon Unbound found that “sellers [that] adopted deceitful tactics, like paying for reviews on the Amazon website” faced almost no penalties. “If they got caught and their accounts were shut down, they simply opened new ones.”

Manipulating reviews, search rankings, and recommendations hurts Amazon customers and, eventually, will undermine trust in Amazon. While Amazon reviews have been viewed as a useful and trusted way to figure out what to buy on Amazon, fake reviews threaten to undermine that trust.

“It’s easy to manipulate ratings or recommendation engines, to create networks of sockpuppets with the goal of subtly shaping opinions, preying on proximity bias and confirmation bias,” wrote Stanford Internet Observatory’s Renee DiResta. Sockpuppets are fake accounts pretending to be real people. When bad actors create many sockpuppets, they can use those fake accounts to feign popularity and dominate conversations. “Intentional, deliberate, and brazen market manipulation, carried out by bad actors gaming the system for profit ... can have a profound negative impact.”

The bad guys manipulate ranking algorithms through a combination of fake reviews and coordinated activity between accounts. A group of people, all working together to manipulate the reviews, can change what algorithms like the search ranker or the recommendation engine think are popular. Wisdom of the crowd algorithms, including reviews, require all the votes to be independent, and coordinated shilling breaks that assumption.

Nowadays, Amazon seems to be saturated with fake reviews. The Washington Post found that “for some popular product categories, such as Bluetooth headphones and speakers, the vast majority of reviews appear to violate Amazon’s prohibition on paid reviews.”

This hurts both Amazon customers and other merchants trying to sell on Amazon. “Sellers say the flood of inauthentic reviews makes it harder for them to compete legitimately and can crush profits.” Added one retailer interviewed by the Washington Post, “These days it is very hard to sell anything on Amazon if you play fairly.”

Of course, this also means the reviews no longer indicate good products. Items with almost entirely 5 star reviews may be an “inferior or downright faulty products.” Customers are “left in the dark” using “seemingly genuine reviews” but end up buying “products of shoddy quality.” As Buzzfeed warned, “These reviews can significantly undermine the trust that consumers and the vast majority of sellers and manufacturers place in Amazon, which in turn tarnishes Amazon’s brand.”

Long-term harm to customer trust could eventually lead people to shop on Amazon less. Customer Reports, in an article titled “Hijacked Reviews on Amazon Can Trick Shoppers,” went as far as to warn against using the average review score at all: “Fraudulent reviews are a well-known pitfall for shoppers on Amazon ... never rely on just looking at the number of reviews and the average score ... look at not only good reviews, but also the bad reviews.”

Unfortunately, Amazon executives may have to see growth and sales problems, due to lack of customer trust in the reviews, before they are willing to put policies in place to change the incentives for sellers. For now, as Consumer Reports said, Amazon's customer reviews can no longer be trusted.