Practice of Efficient Data Collection via Crowdsourcing at Large-Scale

Tutorial at
KDD 2019

on Thursday
8th August

In this tutorial, we present you a portion of unique industrial practical experience on efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex. Majority of ML projects require training data, and often this data can only be obtained by human labelling. Moreover, the more applications of AI appear, the more nontrivial tasks for collecting human labelled data arise. Production of such data in a large-scale requires construction of a technological pipeline, what includes solving issues related to quality control and smart distribution of tasks between workers.

We will make an introduction to data labeling via public crowdsourcing marketplaces and will present key components of efficient label collection. This will be followed by a practical session, where participants will choose one of real label collection tasks, experiment with selecting settings for the labelling process, and launch their label collection project at Yandex.Toloka, one of the largest crowdsourcing marketplace. The projects will be run on real crowds within the tutorial session. Finally, participants will receive a feedback about their projects and practical advices to make them more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect labelled data with good quality and do it efficiently.


Alexey Drutsa

Crowdsourcing & Research Departments, Yandex

Valentina Fedorova

Research Department, Yandex

Olga Megorskaya

Crowdsourcing Department, Yandex

Evfrosiniya Zerminova

Crowdsourcing Department, Yandex


— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex's experience on crowdsourcing

Part I: Main components of data collection via crowdsourcing
— Decomposition for effective pipeline
— Task instruction & interface: best practices
— Quality control techniques

Part II: Label collection projects to be done (practical session)
— Dataset and required labels
— Discussion: how to collect labels?
— Data labelling pipeline for implementation

Part III: Introduction to Yandex.Toloka for requesters
— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and download of results

Part IV: Setting up and running label collection projects (practical session)
— You
› create
› configure
› run on real performers
— data labelling projects in real-time

Part V: Theory on efficient aggregation, incremental relabelling, and pricing
— Aggregation models
— Incremental relabelling to save money
— Performance-based pricing

Part VI: Discussion of results 
from the projects and conclusions
— Results of your projects
— Extensions to work on after tutorial
— References to literature and other tutorials




Part 1
"Main components of data collection via crowdsourcing"

Part 2
"Label collection projects to be done"


Part 3
"Introduction to Yandex.Toloka for requesters"

Part 4
"Setting up and running label collection projects"

Part 5
"Theory on efficient aggregation, incremental relabelling, and pricing" 

Part 6
"Discussion of results from the projects" & Сonclusions






Wed Jun 02 2021 17:51:26 GMT+0300 (Moscow Standard Time)