Monday, September 14, 2009

Experiments and performance at Google and Microsoft

Despite frequently appearing together at conferences, it is fairly rare to see public debate on technology and technique between people from Google and Microsoft. A recent talk on A/B testing at Seattle Tech Startups is a fun exception.

In the barely viewable video of the talk, the action starts at the Q&A around 1:28:00. The presenters of the two talks, Googler Sandra Cheng and Microsoft's Ronny Kohavi, aggressively debate the importance of performance when running weblabs, with others chiming in as well. Oddly, it appears to be Microsoft, not Google, arguing for faster performance.

Making this even more amusing is that both Sandra and Ronny cut their experimenting teeth at Amazon.com. Sandra Cheng now is the product manager in charge of Google Website Optimizer. Ronny Kohavi now runs the experimentation team at Microsoft. Amazon is Experimentation U, it seems.

By the way, if you have not seen it, the paper "Online Experimentation at Microsoft" (PDF) that was presented at a workshop at KDD 2009 has great tales of experimentation woe at the Redmond giant. Section 7 on "Cultural Challenges" particularly is worth a read.

22 comments:

Unknown said...

Great paper. Thanks for pointing out the link.

jeremy said...

Remember back in 2006, your blogpost about Marissa Mayer claiming that A/B testing led to the conclusion that users don't want to see more than 10 web page results per search?

Well, despite that data-driven experimentation, the new Google Fast Flip interface shows, by default, 30 web page results per search. (see: http://fastflip.googlelabs.com/search?q=health+care)

I don't get it. If A/B testing already showed the former to be true, why are they now not implementing that knowledge -- and instead doing something different?

It can't be because users changed, can it? The web changes quickly; users don't. People are people.

Or is this common in A/B testing.. draw a conclusion from some experiment, and then revisit that experiment every few years, just in case you were wrong the first time?

I don't get it.

Greg Linden said...

Not sure its the same test, Jeremy, search results in Fast Flip versus search results in web search.

In particular, I suspect a major difference is that web search generates context-sensitive snippets for each result, making it take longer for generating 30 results than 10. Fast Flip just shows titles and standard sized images (which render after the entire page has displayed), so there may be little or no time delay to show 30 rather than 10.

I don't know for sure, but I would assume Google A/B tested how many results to show in Fast Flip and got this conclusion that 30 is best.

jeremy said...

In particular, I suspect a major difference is that web search generates context-sensitive snippets for each result, making it take longer for generating 30 results than 10.

So you're saying that if web search could produce results just as quickly as in Fast Flip, people would also want 30 results per page?

Greg Linden said...

If I remember it correctly, the story Marissa Mayer tells is that they decided to run the A/B test because they had evidence that suggested that people wanted more than 10 search results for web search. When they ran the test live, the problem wasn't that people didn't like more results on the page, not that there was some paradox of choice problem going on. The problem turned out to be that putting more results on the page slowed down the page.

jeremy said...

Ok, so again, if their A/B test for result set size only told them about speed, rather than about number of results per page, why didn't they do another test in which they actually kept the speed constant, but varied the number of results?

One possible way to do this would be to hardcode the first 10 results, as normal, and then AJAX the remaining 20 results. By the time the user glances at the first few regular results, the remaining 20 will have filled in, with no perceptual different to the user.

I guess the conclusion I draw from these stories is that it is very difficult to control for the variables that you want to control for. Not that you can't measure what you want to measure (an objection that Kohavi tried to dispel in the paper), but that you can't separate out what you want to measure from the dozens of other factors that are happening. Speed. Change blindness. User experience and familiarity with the existing system, etc.

Ronny Kohavi said...

Jeremy, in an online controlled experiment you evaluate the idea as implemented. The two are coupled.
One of the things I often emphasize as a problem with controlled experiments is that we can tell you how much better/worse your treatment did, but not why.
In Section 6.1.2 of http://exp-platform.com/hippo_long.aspx, we warned about this exact situation: your idea may be good, but the delay may cause it to lose. The "answer" for whether to launch the feature or not is, however, correct: do not launch the feature as implemented if it loses in a controlled experiment.
What you can do is to try and control for more factors if you believe the idea if fundamentally good. In this case, your proposal to use AJAX is a different implementation that may be superior and may help in testing the hypothesis that timing was the critical factor.
It is not uncommon for treatments to fail because of a poor implementation (e.g., bugs), not because the idea itself is bad. One needs to be careful not to over-generalize from: "this implementation of the idea didn't work" to "the idea is bad." You’ll see many people “explain” why an experiment lost or won, but the reasons are speculation in most cases.
The other thing I emphasize a lot (including in the Seattle Tech Startup talk) is that getting a number is easy, but getting a number you can trust is much harder. I'll state, for example, that I have little trust in Marissa's Web 2.0 claim (at least was described by Greg in http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html) that the half second made a 20% difference. Everything we know tells us that while 500msec is hurting users, its impact is much much smaller than 20%. See for example the joint talk by Microsoft and Google at Velocity 2009: http://en.oreilly.com/velocity2009/public/schedule/detail/8523
In recent presentations, I’ve seen more realistic numbers: http://www.siliconbeat.com/2009/06/24/google-reveals-speed-as-a-secret-to-its-success/

-- Ronny Kohavi

jeremy said...

Ronny:

The "answer" for whether to launch the feature or not is, however, correct: do not launch the feature as implemented if it loses in a controlled experiment.

This is my main question: How do you know how well the experiment really is controlled, and whether you've controlled for what you think you're controlling for?

Ancillary question: Often you have to make a lot of assumptions and ad hoc interpretations of what various signals/data/information in your experiment mean. At the end of the day, how do you know that you're really testing what you want to test, or whether you're just testing your a priori interpretations of the data?

For example, this whole thing about traffic dropping. What you're really trying to measure is user satisfaction. Correct? At the end of the day, that is the objective function that you're trying to improve. So there is an (implicit? explicit?) assumption here that less traffic means lower user satisfaction. User needs are not being satisfied well enough, and therefore users are doing fewer queries.

But you can't know satisfaction. Without asking every single person. So you make an assumption, an interpretive leap, and equate traffic volume with satisfaction. The more traffic there is, the better a new A/B tested feature is, and vice versa. Correct?

Well, how do you know that is true? What if.. what if.. what is happening when you return 30 results and speed goes down by 500ms (or whatever), the following is happening:

(Explanation 1) The users spend more time scrolling down through the ranked list, and see the answer to their question at rank 17 or 24, and therefore don't have to ask a second question. It's called "good abandonment", and it might happen more often when you show 30 results than when you show 10. Therefore, traffic drops because the info need is being satisfied in the 1 query x 30 results, rather than in 3 queries x 10 results each. Fewer queries = less traffic.

(Explanation 2) The users aren't getting better answers in the top 30, but they (unconsciously?) have noticed the speed drop. And this speed drop has caused a subtle shift in their behavior. Instead of just querying before thinking, as many tend to do when the speed is high (like jackrabbit driving), users start contemplating a little bit more, before they enter the query. This leads them to ask better questions. When they ask better questions, they get better answers, and thus have to do fewer total queries, because they can get their answer in 2 queries rather than their original 3. Therefore, traffic drops.

In both cases, traffic dropped, but satisfaction actually went up!

So it seems to me that all you're doing when A/B testing is measuring your metric: traffic. You've not actually measured user satisfaction. You've had to make the assumption that traffic = satisfaction, but your A/B test is not capable of telling you whether or not that assumption is true.

Right?

Or do I need to go back and read your paper a little more carefully again, i.e. do you feel that you've already adequately addressed the concerns (and even specific examples?) that I am raising?

Aside: Apropos of nothing, I do find it interesting that you don't trust Mayer's numbers. She repeated them incessantly for almost two years in various talks. Hmm.

Anonymous said...

Jeremy asked: How do you know how well the experiment really is controlled, and whether you've controlled for what you think you're controlling for?

The beauty of a randomized design is that you don't have to worry about controlling for all factors: you're randomizing over them. If the stock market makes a difference or the weather, the variants tested get approximately the same people across these factors. (Statistical tests need to be run to make sure the result is statistically significant to reduce the probability of an unlikely event where approximate isn't good enough.)

Your assumption that the OEC for search is obviously "traffic" is incorrect. Sites should determine their OEC carefully. For example, an OEC (or better, a factor in the OEC) for search may be number of visits per user, where a visit is defined until there's a gap between queries of X minutes (say 10). In this case, if users are coming back more often, your winning treatment is probably better. Coming up with a good OEC is a really hard problem.

-- Ronny

jeremy said...

Your assumption that the OEC for search is obviously "traffic" is incorrect. Sites should determine their OEC carefully. For example, an OEC (or better, a factor in the OEC) for search may be number of visits per user, where a visit is defined until there's a gap between queries of X minutes (say 10). In this case, if users are coming back more often, your winning treatment is probably better.

I never concretely defined "traffic"; I was requoting Mayer's words, not mine. Well, at one point above I did say "fewer queries", but even that is vague. So fine, let's use your definition: Instead of saying number of queries, let's say number of "query sessions, where 10 minute gaps define boundaries".

That concrete definition doesn't change my question: How do you know you're not just testing your measurements, rather than testing user satisfaction?

Example: I don't like to waste my time in front of the computer. I like to get in, get something done, and leave. If I can't find what I need immediately using search, I get frustrated, leave for a while, then come back and try again. And again. Multiple sessions on the same search topic.

However, if I get 30 results back, rather than 10, I might find what I need more often, because I am more likely to continually scroll than I am to click to the next page. And so I won't have to get frustrated, give up, and come back half an hour later to try again.

I will have fewer query sessions, and will be happier, because I can get on with my life.

So depending on how you choose your OEC, the winner might appear to be the loser, and vice versa. I appreciate that picking the OEC is hard. But I still don't understand how you really know whether what you've picked is measuring what you think it is.

I also don't see how randomization helps. Suppose everyone engages in fewer query sessions if shown 30 rather than 10 results. So you randomize, and randomly pick 50% of the people to see the existing 10 results/page (A), and 50% of the people to see 30 results/page (B). And then, surprise, you find that the number of query sessions goes down for group B. It's statistically different than (A). But was it a success, or a failure? That all comes back to your a priori interpretation of what more vs. fewer sessions means, does it not?

I appreciate not letting the HiPPO decide what design to use. But aren't you just shifting the problem from the HiPPO deciding the design, to the HiPPO deciding the a priori interpretation of a measurement of a design? Doesn't the HiPPO still choose whether more = better or fewer = better?

I read your paper, the one that Greg cites above. It either didn't address this issue, or I was too dense to understand the point where you were addressing this issue (I'm not ruling out the latter explanation :-) In case it's the former, though, do you have another paper you could point me to, that talks about this? Thanks.

Ronny said...

Jeremy> aren't you just shifting the problem from the HiPPO deciding the design, to the HiPPO deciding the a priori interpretation of a measurement of a design

The business can decide what it's optimizing for (the OEC), and then align behind it and optimize for it. Just like a business has to pick its strategy, you can never "prove" that it's optimal.

Devoting significant resources to a hard problem once allows you to then use the decision to guide product development and many other decisions. You're absolutely right that it may not be perfect and you can usually find edge cases where the OEC is off. However, remember that even if it is mostly (e.g., directionally, or better, approximately) correct, it provides immense value and experiments will help optimize it.

Consider the alternative: features launch without a clear understand of what metrics they are supposed to improve. What's the alternative that you're proposing?

-- Ronny

jeremy said...

Ronny,

The business can decide what it's optimizing for (the OEC), and then align behind it and optimize for it. Just like a business has to pick its strategy, you can never "prove" that it's optimal...You're absolutely right that it may not be perfect and you can usually find edge cases where the OEC is off. However, remember that even if it is mostly (e.g., directionally, or better, approximately) correct, it provides immense value and experiments will help optimize it.

My concern here is not proving whether or not something is optimal. My concern is how you know that something is even directionally/approximately correct, as you say. Depending on whether you choose your OEC to be "more (of some user action) is better" versus "less (of that same action) is better", you'll be pointing in two completely opposite directions. And optimizing and tweaking your software into two completely different directions.

It doesn't matter how much you average or randomize, because your OEC will still point you in one or the other opposite direction, depending on what you choose it to mean, a priori.

I have no doubt that experiments will help you increase/decrease whatever OEC metric that you pick. But what is the difference to the posterior distribution of final outcomes between an intuitively-engineered system with no A/B testing (e.g. traditional MS culture) and a heavily A/B-engineered system, using intuitively-chosen OECs to guide the direction?

The most important driving force that shapes your software is still intuition (or the HiPPO). You're just inserting that intuition (HiPPO) into different points in the whole process chain..early versus late.

I'm not proposing an alternative. I'm simply asking whether the point at which the HiPPO is inserted into the development process (early on via choosing the OEC vs. later on via choosing the final design) a difference to the final outcome. A real difference, a statistically significant difference.

jeremy said...

And to verify whether or not there is a difference in early vs. late insertion, I would have to see some sort of distribution over OECs. For example, let's do a hypothetical comparison:

[Pre Hoc]
Suppose a company has two prominent HiPPOs, and each HiPPO has two ideas or intuitions about what to use as an OEC. So there are four OEC candidates. The two HiPPOs fight it out politically and eventually choose an OEC (let's call it OEC_X) and use OEC_X to A/B test six different interface designs. Design #4 wins.

[Post Hoc]
The same company has these same two HiPPOs. The same six interface design candidates are created. And then each of these six designs is politically (rather than experimentally) argued about and ranked: Let's say design #3 wins.

[Discussion]
Now, it's true that with the OEC that was actually chosen (OEC_X), the non-experimental intuition failed to match the experimental numbers...exactly the thing that you demonstrated at your CIKM 2008 talk and the reason given for doing A/B testing.

However, using a different OEC (OEC_Y), design #3 wins. For the exact same experiment, the exact same numbers, OEC_Y interprets the results differently. And had the HiPPO chosen OEC_Y instead of OEC_X, the experimental results would have matched the non-experimental intuitions. And if it was also the case that design #3 would have won, had the HiPPO chosen OEC_Z and OEC_W also, then I would probably have more confidence in the post hoc intuition of the HiPPO (non-experimental design selection) than I would in the OEC_X-driven experiment.

So that's what I'm trying to understand, here. I'm trying to understand exactly how stable the posterior distribution over designs #1-#6 are, given various HiPPO-selected prior distributions on OECs. If design #3 tends to win experimentally under most if not all reasonable, HiPPO-created, intuitively-chosen OECs, then I would be convinced of the efficacy of A/B testing. However, if the interface that wins experimentally varies wildly with but the slightest tweak in OEC, then your experimental data is telling you more about your OEC than it is telling you about your interface design candidates.

How do you know which is which? How do you know if your experiment is really testing your designs vs. testing your OEC(s)?

Again, you and Greg know orders of magnitude more about this than I do. So if either of you could point me to a paper.. a book.. I'd appreciate it.

jeremy said...

I'm still scratching my head over this question. Greg, do you know of any papers/etc that would help me out?

Greg Linden said...

Hi, Jeremy. I think you're asking for something that's really difficult to do, an experiment with business practices to prove which is best.

This reminds me of the dream of political scientists that we might be able to replay historical events with different incentives or actors to see how it affects outcomes. It would be incredibly useful, but just can't be done.

Likewise, business books are full of difficult to verify claims about how managers should behave, only a few of which are backed by actual studies, and even those are only over historical outcomes, not anything resembling a random trial.

All that being said, I understand your point that management intuition is entering these decisions somewhere, whether it is in the definition of the metric for which we are optimizing or in management simply dictating all outcomes from the beginning.

However, some of these strategies involve more use of information than others. At least with experimental platforms, there can be a transparent debate about the data and what is best for the company. When decisions are made from the gut of the leadership, it is unclear if we are getting an outcome because of information or just because some executive had a particularly gassy day.

That may not be fair -- I don't mean to be flippant -- so let me try to respond to your request for references.

There is work in the business literature looking at historical outcomes of companies that use more bottom-up, information and data-driven styles of management and how they tend to produce superior outcomes relative to competitors. Jeffrey Pfeffer's work (such as "Hard Facts, Dangerous Half-Truths, & Total Nonesense"), for example, is a good read and could offer plenty of references to trace if you want to pursue this.

jeremy said...

All that being said, I understand your point that management intuition is entering these decisions somewhere, whether it is in the definition of the metric for which we are optimizing or in management simply dictating all outcomes from the beginning.

Yes, exactly. That's really the core of my question.

And yes, it's true that some insertion points into the process allow for more collected information than others. But when there is still a question about what that information means, it puts a lot of uncertainty for me into the whole process.

Last November at CIKM 2008 I was talking to Scott Huffman of Google about this whole issue of how they can tell the difference between a search in which no click was done, because the search was a failure.. and a search in which no click was done, because the search was a success, because the desired information was found in the summary snippet and the user never needed to click.

Well, I was also asking a bunch of folks from Yahoo! and MS also, not just Scott. He was just one of many.

But Scott then went on and submitted a paper on the topic a few months later, to SIGIR 2009:

http://research.google.com/pubs/pub35486.html

And so not only was this paper the first time I'd heard anyone from the A/B tested web world talk about the non-uniformity of a signal like abandonment, it also revealed that a lot of the statistics that you get in that A/B sort of web environment are approximations, educated guesses, and HiPPO-driven expert intuition. At the end of the day, the information that you have is still in large part filtered through the intuitions of that HiPPO.

So yes, you do have more information. But what that information means can (is?) still the subject of heated discussion.

I guess I just hear everyone say about how things are so much easy, because you can now have discussions *with* the data. But for (I don't know how many) things, it seems that you still have to have discussions *about* the data, in the first place. And then you're back into the realm of opinion-driven arguments. An executive might still decide to go with one interpretation of the data over another, because he or she is having a gassy day (as you say).

I appreciate that A/B testing cannot, top-down, tell you how to choose your business goals. And Kohavi mentions that in the papers of his that I've read. I appreciate the offer, but I'm not so interested in the top-down vs. bottom-up debate.

Rather, once you've picked your top-down goal, I'm interested in figuring out how you really know that your bottom up data is telling you anything interesting.. or whether you've simply exchanged one set of eyeglasses/filters for another, and are seeing the data in a way that pre-confirms the goal that you've chosen.

jeremy said...

..because here's the thing (let me give a concrete example using, say, Google): After 11 years of Google existence, the search interface has never become any more complicated than it is now.

That is such a huge, important thing that I would assume that the level of simplicity in the interface wasn't just taken as gospel truth. I would assume that it has had the hell tested out of it, in an A/B manner. Lots of different, more complicated interfaces have been tested, right?

And yet, despite cosmetic changes (font size, pixel width, logo, subtle shades of blue), nothing has changed. For 11 years. No real complexity has been added.

Is that really the sole result of A/B testing? Did they really do hundreds of tests in which they chose a non-trivial increased complexity feature as their B version and conclude 300 of 300 times that the original version A wins?

I'm really asking, because I don't know. But that seems really quite amazing, don't you think? It seems rather unlikely, statistically. Even assuming that the "simpler is generally better" mantra is true, 300 of 300 (or however many tests have been performed) is a little too perfect of a score. Even the Mac OS interface has become more complicated over the past 11 years.

It makes me wonder at least a little bit whether or not there isn't something else going on. Is there a top-level decision overriding, or driving the interpretation of, A/B tests on search interface complexity? Perhaps out of a HiPPO desire to adhere to a particular branding vision?

I say all this cautiously, because of course I don't know for sure. It again just seems highly improbable that every single A/B tested non-trivial complexity change to the search interface has failed. Dozens of tests a year, for 11 years. Every single one, failed?

And so I naturally have to at least question the efficacy of A/B testing. As an external observer of this 11-year-long process, I have a choice between proposing an explanatory hypothesis about the lack of search interface change: Hypothesis 1 is that all 300 of 300 experiments really did fail to show improvement in the B condition. Hypothesis 2 is that a HiPPO somehow "drove" enough of the experiments, and was able to lead them in such a way so as to pre-interpret their outcomes (by chosing the metric?) so that A always wins.

Occam's razor leads me to prefer the HiPPO hypothesis.

Am I crazy to even think this way? I know you didn't mean to be flippant, and you were not. I also do not mean to be flippant. But as an outside observer, someone who has never worked at Amazon, Google, etc. I see A/B testing from the outside, and have to at least ask these questions.

Greg Linden said...

Right, I think the reason that more complexity has not been added is because attempts to add more complexity have failed to show value in A/B tests.

I can't really see the idea that HiPPOs are somehow rigging the A/B tests. Many metrics are reported in these tests and the outcomes hotly debated. HiPPOs can still argue their point, but the transparency is there, and that sunlight on the process is what makes it difficult for a HiPPO to arbitrarily dictate an outcome.

There is another criticism of A/B tests that you do not raise above, that they are incremental and path dependent. A possible explanation for more complex interfaces not winning A/B tests is that people need to be trained on more complex interfaces before liking them. In fact, as Marti Heart said in her new book, Search User Interfaces, people need to be trained on our current simple search interface -- most have a tendency to enter queries in natural language at first and to not iterate on failed queries -- so there is good reason to expect there to be significant transition costs in trying to move to what might ultimately be a better interface. A/B tests do have difficulty compensating for that issue and correctly measuring the long-term impact of a major interface change.

jeremy said...

I can't really see the idea that HiPPOs are somehow rigging the A/B tests.

It's not that the HiPPOs would be rigging any tests, injecting fake data, etc. That's not what I meant at all. Rather, I meant that it is quite possible to produce the results that you want to see (and even do so subconsciously / non-maliciously!) by framing the questions that you ask in a particular way.

Let me let someone else argue the case much better than I can. Check out this TED talk by Dan Ariely:

http://www.ted.com/talks/dan_ariely_asks_are_we_in_control_of_our_own_decisions.html

It's 17 minutes long, but if you want to cut to the chase, here are the few minutes in the video that you should watch:

2:40 - 5:00 minutes is the intro/setup.
5:00 - 6:00 minutes gives the pitch
6:00 - 9:15 minutes delivers the payload, with the main punchline at 6:55 to 8 minutes.

So you really only have to listen to about 5.5 minutes, 2:40 to 9:15.

And, while you're listening, think about the DMV organ donor form in relation to a HiPPO choosing an OEC for the interpretation of the raw data results of an A/B test.

Depending on how the HiPPO sets up the question, your raw data could be interpreted/pointed in completely different directions.

And again, I'm not even saying that the HiPPO is doing / would do this maliciously or intentionally. Only that it can/likely does happen.

jeremy said...

Many metrics are reported in these tests and the outcomes hotly debated.

Ok, this is not a point that often gets made in all the papers and presentations and discussions of A/B testing that I have heard. Rather, folks like Kohavi and others talk about how you pick your evaluation metric ahead of time, so when the raw numbers come rolling in, there is no debate.

If what you're saying is that the evaluation metric itself isn't fixed at all, but rather hotly debated even after the numbers have come in, then A/B testing seems much more sensible. It's like I was asking above, about the OECs W, X, Y, and Z. Instead of choosing one, ahead of time, it sounds like you look at all of them, against all your raw data. And when you do that, then you essentially get a clearer picture over your OEC posterior distribution.

So you've led me in a step toward being more of a believer :-)

Still, that's not the way it gets talked about. I don't think any of the Kohavi papers that I read mention this. They really should.

(And it would also be interesting to have a discussion about whether, even if there is hot debate about the meanings of the numbers, people are in general still not blinded/fooled by illusions in the data, the way Ariely talks about in the talk that I link to above.)

jeremy said...

A/B tests do have difficulty compensating for that issue and correctly measuring the long-term impact of a major interface change.

Interesting, yes!

I was talking to a friend of mine at one of the major search engines the other day, and I raised this exact point.

His response was that search engines know how to deal with it, that compensating for habituated user behaviors is not a problem at all, because they have statistical ways of dealing with that effect.

I didn't really agree, but I don't have the experience that you and others have, to know for sure.

I feel semi-validated that my intuition wasn't totally incorrect...

jeremy said...

There is another criticism of A/B tests that you do not raise above, that they are incremental and path dependent. A possible explanation for more complex interfaces not winning A/B tests is that people need to be trained on more complex interfaces before liking them.

Greg --

I had just one more quick thought.

I fully accept that there is nothing inherent in the A/B test itself that requires the B-condition to be only an incremental improvement over the A-condition. It's a point that I heard Kohavi make at his CIKM 2008 talk, and again in his papers. It makes sense that you can A/B test any thing that you want.

My concern, however, is that the further, conceptually, your B is from your A, the less valid your test will be. And the reason may not be just because users need training on the new, complex interface in order to get up to speed, to make an equal comparison.

It could instead be that the larger the leap that your B version takes away from your A version, the more the task or goal itself changes. Not because you're trying to make the task change; only that it does. And when the task or goal changes, it is no longer possible to have a single OEC that is able to compare apples with apples.

Simple example: Precision-oriented (navigational) search vs. recall-oriented (and exploratory) search. It could very well be that a simpler interface is better for precision-oriented searches, and more complex interface is better for recall-oriented searches. However, your OEC doesn't (isn't able to?) tease these differing goals apart. So if your OEC is oriented toward measuring "goodness" of precision-oriented searches, the more complex interface will never win -- because the more complex interface is addressing a different task.

In that case, it doesn't matter how familiar users are with the newer, complex interface. They could be experts with the new interface. But when the interface evolves by such leaps as to enable the users to solve a different task, and the OEC does not evolve with it, you will forever have this mismatch. Heck, the OEC may even be incapable of evolving with it, because A and B are forever distinct, separate tasks, unable to be unified by a single OEC.

So is that a reasonable thought? Is this an issue that A/B experts like yourself have dealt with? Or am I coming out of left field? Again, I fully accept that one can A/B test any two interfaces/algorithms that one wants to. But the larger the leap, the more the task itself may change, and the more impossible it becomes to capture the overall goodness of the system using a same OEC that you use for both A and B. And yet a single OEC is what Kohavi says one needs, in order to make the A/B test valid.