Sensitivity Improvement of a Linear Combination of Metrics in A/B Tests

In early 2016, at the conference WSDM'2017 (The 10th ACM International Conference on Web Search and Data Mining), Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov have presented the paper "Learning Sensitive Combinations of A/B Test Metrics". We will  briefly explain its essence.

The studied subject. Online controlled experiments, also known as A/B tests, are widely used by modern search engines, in particular, Yandex, to assess quality of the changes in their algorithms, user interface, and products in general. An A/B test compares two variants of a service at a time, usually its current version A (control) and a new one B (treatment), by exposing them to two groups of users. The goal of this experiment is to detect the causal (treatment) effect of the service update on its performance   in terms of  a user behavior metric that is assumed to correlate with the quality of the service. We permanently develop new metrics that surpass the existing ones, and this goal is challenging since an appropriate metric should satisfy two crucial properties: directionality  and sensitivity.

On the one hand, the value of a metric must have a clear interpretation and, more importantly, a clear directional interpretation:  the sign of the detected treatment effect  should align with positive/negative impact of the treatment on user experience. A metric with acceptable directionality allows analysts to be confident in their conclusions about the change in the system's quality, particularly, about the sign and magnitude of that change. Many even popular user behavior metrics may result in contradictory interpretations and their use in practice may be misleading.

On the other hand, the metric must be sensitive: it has to detect the difference between versions A and B at a high level of statistical significance in order to distinguish the existing treatment effect from the noise observed when the effect does not exist. A more sensitive metric allows analysts to make decisions in a larger number of  cases when a subtle change of the service is being tested or a small amount of traffic is affected by the system change. Improvement of sensitivity is also important in the context of optimization of resources used by the experimentation platform since a less sensitive metric consumes more user traffic to achieve a desired level of sensitivity.

The main research question, that was not earlier explored in the A/B testing scenario: How to learn a sensitive metric combinations that (a) agree with a ground-truth metric, and (b) are more sensitive?

Main contributions. In the paper presented at the conference WSDM'2017, the authors

  1. formulate the problem of finding a sensitive metric combination as a data-driven machine learning problem;
  2. propose two intuitive approaches to address the problem:
    • a geometric approach which has a tight connection to Fisher's LDA;
    • an optimization approach for learning sensitive combinations of metrics;
  3. conduct an extensive evaluation study, assessing the performance of our proposed approaches across eight ground-truth  user engagement metrics (including Sessions per User and Absence Time) on a dataset of large-scale real-life A/B experiments of Yandex: the results suggest that a considerable sensitivity improvements over the ground-truth metrics can be achieved by using the proposed approaches.

Learn more about the study in the full text of the article: "Learning Sensitive Combinations of A/B Test Metrics".