In this tutorial, we present a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex.
We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. This will be followed by a practice session, where participants will choose one of the real label collection tasks, experiment with selecting settings for the labeling process, and launch their label collection project on one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session. While the crowd performers are annotating the project set up by the attendees, we will present the major theoretical results in efficient aggregation, incremental relabeling, and dynamic pricing. We will also discuss the strengths and weaknesses as well as applicability to real-world tasks, summarizing our five year-long research and industrial expertise in crowdsourcing. Finally, participants will receive a feedback about their projects and practical advice on how to make them more efficient.
We invite beginners, advanced specialists, and researchers to learn how to collect high quality labeled data and do it efficiently.
08:00 – 08:15 Introduction
— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex's experience on crowdsourcing
08:15 – 08:35 Part 1: Main components of data collection via crowdsourcing
— Decomposition for effective pipeline
— Task instruction & interface: best practices
— Quality control techniques
08:35 – 08:45 Part 2: Yandex.Toloka requester interface
— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and results downloading
08:45 – 09:00 Part 3: Brainstorming pipeline
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation
09:00 – 10:00 Part 4: Practice I and II (starting collecting labels)
— You
» create
» configure
» run data labeling projects on real performers in real-time
10:00 – 10:30 Part 5: Theory on efficient aggregation
— Aggregation models
— Incremental relabeling
— Dynamic pricing
10:30 – 11:00 Break
11:00 – 11:20 Part 6: Practice III (finishing collecting labels)
11:20 – 11:30 Part 7: Conclusion
— Results of your projects
— Ideas for further work and research
— References to literature and other tutorials