Online Evaluation for Effective Web Service Development

Full-day tutorial at the Web Conference 2018

on Tuesday
24th April

Development of the majority of the leading web services and software products today is generally guided by data-driven decisions based on evaluation that ensures a steady stream of updates, both in terms of quality and quantity. Large internet companies use online evaluation on a day-to-day basis and at a large scale. The number of smaller companies using A/B testing in their development cycle is also growing. Web development across the board strongly depends on quality of experimentation platforms. In this tutorial, we will overview state-of-the-art methods underlying everyday evaluation pipelines at some of the leading internet companies.

We invite software engineers, designers, analysts, service or product managers — beginners, advanced specialists, and researchers — to join us at The Web Conference 2018, which will take place in Lyon from 23 to 27 of April, to learn how to make web service development data-driven and do it effectively.

Extended abstract and the full list of references are organized in the following overview article. If you wish to refer to the tutorial in your publication, refer to this paper please.


— Problem statement: evaluation of ongoing updates of a web service
— Online vs offline evaluation
— Main approaches for online evaluation: A/B testing, interleaving, observational studies

Part 1: Statistical foundation
— Statistics for online experiments: 101 (statistical hypothesis testing, causal relationship)

Part 2: Experimentation pipeline and workflow in the light of industrial practice
— Conducting an A/B experiment: Yandex way (what should be analyzed before starting the experiment, experiments’ review, decision making based on results)
— Cases, pitfalls, lessons learned

Part 3: Development of online metrics
— Main components of an online metric
— Main metric properties (sensitivity and directionality)
— Evaluation criteria beyond difference of averages (periodicity, trends, quantiles, etc.)
— Product-driven ideas for metrics (loyalty and interaction metrics, dwell time based metric patching, session metrics and session division)
— Effective criteria for ratio metrics
— Reducing noise in metric measurements

Part 4: Interleaving for online ranking evaluation
— Classic interleaving methods (including their comparison to other evaluation methods)
— Optimized Interleaving
— Multi-leaving

Part 5: Machine learning driven A/B testing
— Randomized experiment vs Observational study
— Variance Reduction Based on Subtraction of Prediction
— Heterogeneous Treatment Effect
— Learning sensitive metric combinations
— Future Prediction Based metrics
— Smart Scheduling of Online Experiments
— Stopping experiments early: sequential testing


Roman Budylin

Experimentation Pipeline, Yandex

Alexey Drutsa

Research Department, Yandex

Gleb Gusev

Research Department, Yandex

Eugene Kharitonov


Pavel Serdyukov

Research Department, Yandex

Igor Yashkov

Experimentation Pipeline, Yandex


Introduction & Part 1
"Statistical foundation"


Part 2
"Experimentation pipeline and workflow in the light of industrial practice"

Part 3
"Development of online metrics"


Part 4
"Interleaving for online ranking evaluation"


Part 5
"Machine learning driven A/B testing"


Fri Aug 31 2018 12:13:11 GMT+0300 (Moscow Standard Time)