Learning Sensitive Combinations of A/B Test Metrics

Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. This restricts their usefulness and applicability. To alleviate the need for large sample sizes in A/B experiments, several approaches were proposed. Primarily, these approaches are based on increasing the sensitivity (informally, the ability to detect changes with less observations) of the evaluation metrics. Such sensitivity improvements are achieved by applying variance reduction methods, e.g. strati cation and control covariates. However, the ability to learn sensitive metric combinations that (a) agree with the ground-truth metric, and (b) are more sensitive, was not explored in the A/B testing scenario. In this work, we aim to close this gap. We formulate the problem of nding a sensitive metric combination as a data-driven machine learning problem and propose two intuitive optimization approaches to address it. Next, we perform an extensive experimental study of our proposed approaches. In our experiments, we use a dataset of 118 A/B tests performed by Yandex and study eight state-of-the-art ground-truth user engagement metrics, including Sessions per User and Absence Time. Our results suggest that a considerable sensitivity improvements over the ground-truth metrics canbe achieved by using our proposed approaches.
Research areas
Published in
ACM Conference on Web Search and Data Mining
20 Feb 2017