Crowdsourcing Natural Language Data at Scale: 
A Hands-On Tutorial

Tutorial at

Room: Online
Starts: 4pm PDT

Date: June 6

In this tutorial, we present a portion of unique industry experience in efficient natural language data annotation via crowdsourcing shared by both leading researchers and engineers from Yandex. We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. This will be followed by a practical session, where participants address a real-world language resource production task, experiment with selecting settings for the labeling process, and launch their label collection project on one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session and we will present useful quality control techniques and provide the attendees with an opportunity to discuss their own annotation ideas.


Alexey Drutsa

Crowdsourcing Department, Yandex

Valentina Fedorova

Crowdsourcing Department, Yandex

Dmitry Ustalov

Crowdsourcing Department, Yandex

Olga Megorskaya

Crowdsourcing Department, Yandex

Daria Baidakova

Crowdsourcing Department, Yandex

Time Zone: PDT

16:00 – 16:15 Introduction
— The concept of data labeling via crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex's experience with crowdsourcing

16:15 – 16:45 Part 1: Key Components for Efficient Data Collection 
— Decomposition for effective pipeline
— Task instruction & interface: best practices
— Quality control techniques

16:45 – 17:45 Part 2: Practice part 
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation
— You
» create
» configure
» run data labeling projects on real performers in real-time

17:45 – 18:30 Break

18:30 – 19:15 Part 3: Advanced techniques
— Aggregation models
— Incremental relabeling
— Dynamic pricing

19:15 – 19:45 Part 4: Practice part II 
— Finishing up label collection
— Results aggregation

19:45 – 20:00 Part 5: Conclusion 
— Results of your projects
— Ideas for further work and research
— References to literature and other tutorials
— Q&A



Part 1
Key Components for Efficient Data Collection

Part 2
Practice part I

Part 3
Advanced techniques

Part 4
Practice part II

Part 5

Text Aggregation Example
We share with you an example of how to aggregate crowdsourced texts using the Crowd-Kit library for Python.
!pip install pandas
!pip install -U crowd-kit sentence-transformers nltk

import json

import pandas as pd
from crowdkit.aggregation.hrrasa import TextHRRASA
from sentence_transformers import SentenceTransformer

df_toloka = pd.read_csv('assignments.tsv', sep='\t', dtype=str)
df_toloka.dropna(how='all', inplace=True)
df_toloka['INPUT:audio'] = df_toloka['INPUT:audio'].apply(json.loads)
df_toloka['OUTPUT:result'] = df_toloka['OUTPUT:result'].apply(json.loads)

df_gold = df_toloka[~df_toloka['GOLDEN:result'].isna()]
print(f'{df_gold.size} golden row(s) excluded out of {df_toloka.size}')
df_toloka.drop(df_gold.index, inplace=True)

df = df_toloka[['INPUT:audio', 'OUTPUT:result', 'ASSIGNMENT:worker_id']].copy()
df.columns = ['task', 'output', 'performer']
df.sort_values('task', inplace=True)

encoder = SentenceTransformer('paraphrase-distilroberta-base-v1')
hrrasa = TextHRRASA(encoder.encode)
df_agg = hrrasa.fit_predict(df).reset_index(name='result')

df_agg.to_csv('assignments-hrrasa.tsv', sep='\t', index=False)
df_agg.sort_values('task', inplace=True)



Training Pool Dataset

Main Pool Dataset

Text Aggregation
Jupyter Notebook Code

Fri Jun 11 2021 16:49:40 GMT+0300 (Moscow Standard Time)