Saturday, October 07, 2023

Book excerpt: Metrics chasing engagement

(This is an excerpt from the draft of my book. Please let me know if you like it and want more.)

Let’s say you are in charge of building a social media website like Facebook. And you want to give your teams a goal, a target, some way to measure that what they are about to launch on the website is better than what came before.

One metric you might think of might be how much people engage with the website. You might think, every click, every like, every share, you can measure those. The more the better! We want people clicking, liking, and sharing as much as possible. Right?

So you tell all your teams, get people clicking! The more likes the better! Let’s go!

Teams are always looking for ways to optimize the metrics. Teams are constantly changing algorithms. If you tell your teams to optimize for clicks, what you will see is that soon recommender and ranker algorithms will change what they show. Up at the top of any recommendations and search results will be the posts and news predicted to get the most clicks.

Outside of the company, people will also notice and change what they do. They will say, this article I posted didn’t get much attention. But this one, wow, everyone clicked on it and reshared it. And people will create more of whatever does well on your site with the changes your team made to your algorithms.

All sounds great, right? What could go wrong?

The problem is what attracts the most clicks. What you are likely to click on are things that provoke strong emotions, such as hatred, disbelief, anger, or lust. This means what gets the most clicks are things that are lies, sensationalistic, provoking, or pornographic. The truth is boring. Posts of your Aunt Mildred’s flowers might make you happy. But they won’t get a click. But, oh yeah, that post with scurrilous lies about some dastardly other, that likely will get engagement.

Cecilia Kang and Sheera Frenkel wrote a book about Facebook, An Ugly Truth. In it, they describe the problem with how Facebook optimized its algorithms: “Over the years, the platform’s algorithms had gotten more sophisticated at identifying the material that appealed most to individual users and were prioritizing it at the top of their feeds. The News Feed operated like a finely tuned dial, sensitive to that photograph a user lingered on longest, or the article they spent the most time reading. Once it had established that the user was more likely to view a certain type of content, it fed them as much of it as possible.”

The content the algorithms fed to people, the content the algorithms chose to put on top and amplify, was not what made people content and satisfied. It was whatever would provide a click right now. And what would provide a click right now was often enraging lies.

“Engagement was 50 percent higher than in 2018 and 10 percent higher than in 2017,” wrote the author of the book The Hype Machine. “Each piece of content is scored according to our probabilities of engaging with it, across the several dozen engagement measures. Those engagement probabilities are aggregated into a single relevance score. Once the content is individually scored (Facebook’s algorithm considers about two thousand pieces of content for you every time you open your newsfeed), it’s ranked and shown in your feed in order of decreasing relevance.”

Most people will not read past the top few items in search results or on recommendations. So what is at the top is what matters most. In this case, by scoring and ordering content by likelihood of engagement, the content being amplified was the most sensationalistic content.

Once bad actors outside of Facebook discovered the weaknesses of the metrics behind the algorithms, they exploited it. From an article titled “Troll Farms Reached 140M Americans,” these are “easily exploited engagement based ranking systems … At the heart of Feed ranking, there are models that predict the probability a user will take an engagement action. These are colloquially known as P(like), P(comment), and P(share).” That is, the models use a prediction of the probability that people will like the content, the probability that they will share it, and so forth. Hao cited an internal report from Facebook that said that these “models heavily skew toward content we know to be bad.” Bad content includes hate speech, lies, and plagiarized content.

“Bad actors have learned how to easily exploit the systems,” said former Facebook data scientist Jeff Allen. “Basically, whatever score a piece of content got in the models when it was originally posted, it will likely get a similar score the second time it is posted … Bad actors can scrape … and repost … to watch it go viral all over again.” Anger, outrage, lies, and hate, all of those performed better on engagement metrics. They don’t make people satisfied. They make people more likely to leave in disgust than keep coming back. But they do make people likely to click right now.

It is by no means necessary to optimize for short-term engagement. Sarah Frier in the book No Filter describes how Instagram, in its early years, looked at what was happening at Facebook and made a different choice: “They decided the algorithm wouldn’t be formulated like the Facebook news feed, which had a goal of getting people to spend more time on Facebook … They knew where that road had led Facebook. Facebook had evolved into a mire of clickbait … whose presence exacerbated the problem of making regular people feel like they didn’t need to post. Instead Instagram trained the program to optimize for ‘number of posts made.’ The new Instagram algorithm would show people whatever posts would inspire them to create more posts.” While optimizing for the number of posts made also could have bad incentives, such as encouraging spamming, most important is considering the incentives created by the metrics you pick and questioning whether your current metrics are the best thing for the long-term of your business.

YouTube is an example of a company that picked problematic metrics years ago, but then questioned what was happening, noticed the problem, and then fixed their metrics in recent years. While researchers noted problems with YouTube’s recommender system amplifying terrible content many years ago, in recent years they have mostly concluded that YouTube no longer algorithmically amplifies — though they do still host — hate speech and other harmful content.

The problem started, as described by the authors of the book System Error, when a Vice President at YouTube “wrote an email to the YouTube executive team arguing that ‘watch time, and only watch time’ should be the objective to improve at YouTube … He equated watch time with user happiness: if a person spends hours a day watching videos on YouTube, it must reveal a preference for engaging in that activity.” The executive went on to claim, “When users spend more of their valuable time watching YouTube videos, they must perforce be happier with those videos.”

It is important to realize that YouTube is a giant optimization machine, with teams and systems targeting whatever metric it is given to maximize that metric. In the paper “Deep Neural Networks for YouTube Recommendations,” YouTube researchers describe it: “YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence … In a live experiment, we can measure subtle changes in click-through rate, watch time, and many other metrics that measure user engagement … Our goal is to predict expected watch time given training examples that are either positive (the video impression was clicked) or negative (the impression was not clicked).”

The problem is that optimizing your recommendation algorithm for immediate watch time, which is an engagement metric, tends to show sensationalistic, scammy, and extreme content, including hate speech. As BuzzFeed reporters wrote in an article titled “We Followed YouTube’s Recommendation Algorithm Down the Rabbit Hole”: “YouTube users who turn to the platform for news and information — more than half of all users, according to the Pew Research Center — aren’t well served by its haphazard recommendation algorithm, which seems to be driven by an id that demands engagement above all else.”

The reporters described a particularly egregious case: “How many clicks through YouTube’s Up Next recommendations does it take to go from an anodyne PBS clip about the 116th United States Congress to an anti-immigrant video from a designated hate organization? Thanks to the site’s recommendation algorithm, just nine.” But the problem was not isolated to just a small number of examples. At the time, there were a “high percentage of users who say they’ve accepted suggestions from the Up Next algorithm — 81%.” The problem is that the optimization engines for their recommender algorithms. “It’s an engagement monster.”

The “algorithm decided which videos YouTube recommended that users watch next; the company said it was responsible for 70 percent of the one billion hours a day people spent on YouTube. But it had become clear that those recommendations tended to steer viewers toward videos that were hyperpartisan, divisive, misleading or downright false.” The problem was optimizing for an engagement metric like watch time.

Why does this happen? In any company, in any organization, you get what you measure. When you tell your teams to optimize for a certain metric, that they will get bonuses and be promoted if they optimize for that metric, they will optimize the hell out of that metric. As Bloomberg reporters wrote in an article titled “YouTube Executives Ignored Warnings,” “Product tells us that we want to increase this metric, then we go and increase it … Company managers failed to appreciate how [it] could backfire … The more outrageous the content, the more views.”

This problem was made substantially worse at YouTube by outright manipulation of YouTube’s wisdom of the crowd algorithms by adversaries, who effectively stuffed the ballot box for what is popular and good with votes from fake or controlled accounts. As Guardian reporters wrote, “Videos were clearly boosted by a vigorous, sustained social media campaign involving thousands of accounts controlled by political operatives, including a large number of bots … clear evidence of coordinated manipulation.”

The algorithms optimized for engagement, but they were perfectly happy to optimize for fake engagement, clicks and views from accounts that were all controlled by a small number of people. By pretending to be a large number of people, adversaries easily could make whatever they want appear popular, and also then get it amplified by a recommender algorithm that was greedy for more engagement.

In a later article, “Fiction is Outperforming Reality,” Paul Lewis at the Guardian wrote, “YouTube was six times more likely to recommend videos that aided Trump than his adversary. YouTube presumably never programmed its algorithm to benefit one candidate over another. But based on this evidence, at least, that is exactly what happened … Many of the videos appeared to have been pushed by networks of Twitter sock puppets and bots.” That is, Trump videos were not actually better to recommend, but manipulation by bad actors using a network of fake and controlled accounts caused the recommender to believe that it should recommend those videos. Ultimately, the metrics they picked, metrics that emphasized immediate engagement rather than the long-term, were at fault.

“YouTube’s recommendation system has probably figured out that edgy and hateful content is engaging.” As sociologist Zeynep Tufekci described it, “This is a bit like an autopilot cafeteria in a school that has figured out children have sweet teeth, and also like fatty and salty foods. So you make a line offering such food, automatically loading the next plate as soon as the bag of chips or candy in front of the young person has been consumed.” If the target of the optimization of the algorithms is engagement, the algorithms will be changed over time to automatically show the most engaging content, whether it contains useful information or full of lies and anger.

The algorithms were “leading people down hateful rabbit holes full of misinformation and lies at scale.” Why? “Because it works to increase the time people spend on the site” watching videos.

Later, YouTube stopped optimizing for watch time, but only years after seeing how much harmful content was recommended by YouTube algorithms. At the time, chasing engagement metrics changed both what people watched on YouTube and what videos got produced for YouTube. As one YouTube creator said, “We learned to fuel it and do whatever it took to please the algorithm.” Whatever metrics the algorithm was optimizing for, they did whatever it takes to please it. Pick the wrong metrics and the wrong things will happen, for customers and for the business.

(This was an excerpt from the draft of my book. Please let me know if you like it and want more.)

1 comment:

Raf said...

Would love to read it :)