Thursday, January 04, 2024

Book excerpt: Data and metrics determine what algorithms do

(This is an excerpt from drafts of my book, "Algorithms and Misinformation: Why Wisdom of the Crowds Failed the Internet and How to Fix It")

Wisdom of the crowd algorithms, including rankers and recommenders, work from data about what people like and do. Teams inside tech companies gather user behavior data then tune and optimize algorithms to maximize measurable targets.

The data quality and team incentives control what the algorithms produce and how useful it is. When the behavior data or goal metrics are bad, the outcome will be bad. When the wisdom of the crowds data is trustworthy and when the algorithms are optimized for the long-term, algorithms like recommendations will be useful and helpful.

Queensland University Professor Rachel Thomas warned that “unthinking pursuit of metric optimization can lead to real-world harms, including recommendation systems promoting radicalization .... The harms caused when metrics are overemphasized include manipulation, gaming, a focus on short-term outcomes to the detriment of longer-term values ... particularly when done in an environment designed to exploit people’s impulses and weaknesses."

The problem is that “metrics tend to overemphasize short-term concerns.” Thomas gave as an example the problems that YouTube had before 2017 because they years earlier picked “watch time” (how long people spend watching videos) as a proxy metric for user satisfaction. An algorithm that tries to pick videos people will watch right now will tend to show anything to get a click including risqué videos or lies that get people angry. So YouTube struggled with their algorithms amplifying sensationalistic videos and scams. These clickbait videos looked great on short-term metrics like watch time but repelled users in the long-term.

“AI is very effective at optimizing metrics,” Thomas said. Unfortunately, if you pick the wrong metrics, AI will happily optimize for the wrong thing. “The unreasonable effectiveness of metric optimization in current AI approaches is a fundamental challenge to the field and yields an inherent contradiction: solely optimizing metrics leads to far from optimal outcomes.”

Unfortunately, it’s impossible to get a perfect success metric for algorithms. Not only are metrics “just a proxy for what you really care about,” but also all “metrics can, and will be gamed.” The goal has to be to make the success metrics as good as possible and keep fixing the metrics as they drift away from the real goal of the long-term success of the company. Only by constantly fixing the metrics will teams optimize the algorithms to help the company grow and profit over the years.

A classic article by Steven Kerr, “On the folly of rewarding A while hoping for B,” was originally published back in 1975. The author wrote: “Many managers seek to establish simple, quantifiable standards against which to measure and reward performance. Such efforts may be successful in highly predictable areas within an organization, but are likely to cause goal displacement when applied anywhere else.”

Machine learning algorithms need a target. Teams need to have success metrics for algorithms so they know how to make them better. But it is important to recognize that metrics are likely to be wrong and to keep trying to make them better.

You get what you measure. When managers pick a metric, there are almost always rewards and incentives tied to that metric. Over time, as people optimize for the metric, you will get that metric maximized, often at the expense of everything else, and often harming the true goals of the organization.

Kerr went on to say, “Explore what types of behavior are currently being rewarded. Chances are excellent that ... managers will be surprised by what they find -- that firms are not rewarding what they assume they are.” An editor when Kerr's article was republished in 1995 summarized this as, “It’s the reward system, stupid!”

Metrics are hard to get right, especially because they often end up being a moving target over time. The moment you put a metric in place, people both inside and outside the company will start to find ways to succeed against that metric, often finding cheats and tricks that move the metric without helping customers or the company. It's as Goodhart’s Law says: “When a measure becomes the target, it ceases to be an effective measure.”

One familiar example to all of us is the rapid growth of clickbait headlines -- “You won’t believe what happens next” -- that provide no value but try to get people to click. This happened because the headline writers were rewarded for getting a click, whether or not they do it through deception. When what the organization optimizes is getting a click, teams will drive clicks.

Often companies pick poor success metrics such as clicks just because it is too hard to measure the things that matter most. Long-term metrics that try to be good proxies for what we really care about such as retention, long-term growth, long-term revenue, and customer satisfaction can be costly to measure. And, because of Goodhart’s Law, the metrics will not work forever and will need to be changed over time. Considerable effort is necessary.

Many leaders don’t realize the consequences of not putting in that effort. You will get what you measure. Unless you reward teams for the long-term growth and profitability of the company, teams will not optimize for the success of the company or shareholders.

What can companies do? Professor Thomas went on to say that companies should “use a slate of metrics to get a fuller picture and reduce gaming” which can “keep metrics in their place.” The intent is that gaming of one metric may be visible in another, so a slate with many metrics may show problems that otherwise might be missed. Another idea is changing metrics frequently, which also can reduce gaming and provides an opportunity to adjust metrics so they are closer to the true target.

Getting this wrong causes a lot of harm to the company and sometimes to others as well. “A modern AI case study can be drawn from recommendation systems,” Thomas writes. “Platforms are rife with attempts to game their algorithms, to show up higher in search results or recommended content, through fake clicks, fake reviews, fake followers, and more.”

“It is much easier to measure short-term quantities [such as] click-through rates,” Thomas said. But “many long-term trends have a complex mix of factors and are tougher to quantify.” There is a substantial risk if teams, executives, and companies get their metrics wrong. “Facebook has been the subject of years’ worth of ... scandals ... which is now having a longer-term negative impact on Facebook’s ability to recruit new engineers” and grow among younger users.

As Googler and AI expert François Chollet once said, “Over a short time scale, the problem of surfacing great content is an algorithmic problem (or a curation problem). But over a long time scale, it's an incentive engineering problem.”

It is the optimization of the algorithms, not the algorithms themselves, that determine what they show. Incentives, rewards, and metrics that determine what wisdom of the crowd algorithms do. That is why metrics and incentives are so important.

Get the metrics wrong, and the long-term costs for the company — stalled growth, poor retention, poor reputation, regulatory risk — become worse and worse. Because the algorithms are optimized over time, it is important to be constantly fixing the data and metrics to make sure they are trustworthy and doing the right thing. Trustworthy data and long-term metrics lead to algorithms that minimize scams and maximize long-term growth and profits.

No comments: