Crowdsourcing Practice for Efficient Data Labeling: Aggregation, Incremental Relabeling, and Pricing

Tutorial at SIGMOD/PODS 2020

Room: Online
Starts: 08:00

on Sunday
14th June

In this tutorial, we present a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex.

We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. This will be followed by a practice session, where participants will choose one of the real label collection tasks, experiment with selecting settings for the labeling process, and launch their label collection project on one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session. While the crowd performers are annotating the project set up by the attendees, we will present the major theoretical results in efficient aggregation, incremental relabeling, and dynamic pricing. We will also discuss the strengths and weaknesses as well as applicability to real-world tasks, summarizing our five year-long research and industrial expertise in crowdsourcing. Finally, participants will receive a feedback about their projects and practical advice on how to make them more efficient.

We invite beginners, advanced specialists, and researchers to learn how to collect high quality labeled data and do it efficiently.

Speakers

Alexey Drutsa

Crowdsourcing Department, Yandex

Valentina Fedorova

Crowdsourcing Department, Yandex

Dmitry Ustalov

Crowdsourcing Department, Yandex

Olga Megorskaya

Crowdsourcing Department, Yandex

Evfrosiniya Zerminova

Crowdsourcing Department, Yandex

Daria Baidakova

Crowdsourcing Department, Yandex
Program

08:00 – 08:15 Introduction
— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex's experience on crowdsourcing

08:15 – 08:35 Part 1: Main components of data collection via crowdsourcing
— Decomposition for effective pipeline
— Task instruction & interface: best practices
— Quality control techniques

08:35 – 08:45 Part 2: Yandex.Toloka requester interface
— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and results downloading

08:45 – 09:00 Part 3: Brainstorming pipeline
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation

09:00 – 10:00 Part 4: Practice I and II (starting collecting labels)
— You
» create
» configure
» run data labeling projects on real performers in real-time

10:00 – 10:30 Part 5: Theory on efficient aggregation
— Aggregation models
— Incremental relabeling
— Dynamic pricing

10:30 – 11:00 Break

11:00 – 11:20 Part 6: Practice III (finishing collecting labels)

11:20 – 11:30 Part 7: Conclusion
— Results of your projects
— Ideas for further work and research
— References to literature and other tutorials

Slides

Introduction



Part 1
«Main components of data collection via crowdsourcing»

Part 2
«Introduction to Yandex.Toloka for requesters»

Part 3
«Brainstorming the pipeline»

Part 4
«Setting up and running label collection projects»

Part 5
«Theory on efficient Aggregation»

Part 6
«Setting up and running label collection projects cont.»

Part 7
«Discussion of the projects’ results and conclusion»

Instruction

Step-by-step
Instruction

Sun Jul 12 2020 18:44:34 GMT+0300 (Moscow Standard Time)