Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. This restricts their usefulness and applicability. To alleviate the need for large sample sizes in A/B experiments, several approaches were proposed. Primarily, these
approaches are based on increasing the sensitivity (informally, the ability to detect changes with less observations) of the evaluation metrics. Such sensitivity improvements are achieved by applying variance reduction methods, e.g. stratication and control covariates. However, the ability to learn sensitive metric combinations that (a) agree with the ground-truth metric, and (b) are more sensitive, was not explored in the A/B testing scenario.
In this work, we aim to close this gap. We formulate the problem of nding a sensitive metric combination as a data-driven machine learning problem and propose two intuitive optimization approaches to address it. Next, we perform an extensive experimental study of our proposed approaches. In our experiments, we use a dataset of 118 A/B tests performed by Yandex and study eight state-of-the-art ground-truth user engagement metrics, including Sessions per User
and Absence Time. Our results suggest that a considerable sensitivity improvements over the ground-truth metrics canbe achieved by using our proposed approaches.
ACM Conference on Web Search and Data Mining
20 Feb 2017