Practice of Efficient Data Collection via Crowdsourcing: Aggregation, Incremental Relabelling, and Pricing

Full-day Tutorial at
WSDM 2020

Room: Sage
Starts at 9:00

on Monday
3rd February

In this tutorial, we present you a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex. Majority of ML projects require training data, and often this data can only be obtained by human labelling. Moreover, the more applications of AI appear, the more nontrivial tasks for collecting human labelled data arise. Production of such data on a large-scale requires construction of a technological pipeline, which includes solving issues related to quality control and smart distribution of tasks between performers.

We will make an introduction to data labelling via public crowdsourcing marketplaces and will present key components of efficient label collection. This will be followed by a practice session, where participants will choose one of the real label collection tasks, experiment with selecting settings for the labelling process, and launch their label collection project on Yandex.Toloka, one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session. Finally, participants will receive a feedback about their projects and practical advice to make them more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect labelled data with good quality and do it efficiently.

Room: Sage

Speakers

Alexey Drutsa

Crowdsourcing Department, Yandex

Valentina Fedorova

Research Department, Yandex

Dmitry Ustalov

Crowdsourcing Department, Yandex

Olga Megorskaya

Crowdsourcing Department, Yandex

Evfrosiniya Zerminova

Crowdsourcing Department, Yandex

Daria Baidakova

Crowdsourcing Department, Yandex
Program


09:00 - 09:20 Introduction
— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex's experience on crowdsourcing



09:20 - 10:00 Part I: Main components of data collection via crowdsourcing
— Decomposition for effective pipeline
— Task instruction & interface: best practices
— Quality control techniques


10:00 - 10:30 Coffee Break


10:30 - 10:55 Part II: Label collection projects to be done (practice session)
— Dataset and required labels
— Discussion: how to collect labels?
— Data labelling pipeline for implementation


10:55 - 11:05 Part III: Introduction to Yandex.Toloka for requesters
— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and results downloading


11:05 - 12:30 Part IV: Setting up and running label collection projects (practice session)
— You
› create
› configure
› run data labelling projects on real performers in real-time


12:30 - 14:00 Lunch Break


14:00 - 14:35 Part V: Interface & quality control
—  Detailed examination of quality control techniques
—  Comprehensive overview of best practices for creating a functional interface


14:35 - 15:00 Part VI: Theory on efficient aggregation
— Aggregation models


15:00 - 15:30 Coffee Break


15:30 - 16:30 Part VII: Setting up and running label collection projects cont. (practice session)
— You
› create
› configure
› run data labelling projects on real performers in real-time


16:30 - 16:50 Part VIII: Theory on incremental relabelling and pricing
— Incremental relabelling to save money
— Performance-based pricing


16:50 - 17:00 Part IX: Discussion of results from the projects and conclusions
— Results of your projects
— Ideas for further work and research
— References to literature and other tutorials

Slides

Introduction



 

Part 1
"Main components of data collection via crowdsourcing"
 

Part 2
"Label collection projects to be done"

 

Part 3
"Introduction to Yandex.Toloka for requesters"
 

Part 4
"Setting up and running label collection projects"
 

Part 5
"Effective quality control and task interface:details"
 

Part 6
"Theory on aggregation"

 

Part 7
"Setting up and running label collection projects cont."
 

Part 8
"Theory on incremental relabelling and pricing"

 

Part 9
"Discussion of the projects’ results" & Сonclusions
 

Instructions

Step-by-step
Instruction

 

Fri Jun 05 2020 13:30:10 GMT+0300 (Moscow Standard Time)