Yandex.Toloka Open Datasets

Toloka is a major source of human-marked data for machine learning tasks. Toloka has thousands of performers making millions of evaluations in hundreds of tasks every single day. Research and experiments related to machine learning always require a large volume of high-quality data. This is why we have started publishing open datasets for academic research in various subject areas. Please note: All materials are intended for non-commercial use. You must indicate that the data was obtained using Yandex.Toloka. If you plan to use the datasets for commercial purposes, obtain consent from Yandex by contacting: toloka@support.yandex.com. 

Toloka Persona Chat Rus

This dataset of 10,000 dialogues will help researchers of dialogue systems to develop approaches for training chat bots. Prepared in collaboration with MIPT’s Neural Networks and Deep Learning Lab, the dataset contains profiles with a description of each individual's personality and dialogues between the research participants. A chatbot that is trained on the dataset will be able to communicate on behalf of a certain persona and get to know people by chatting with them on general topics.
File name
Description
filename.tsv
Profiles
filename.tsv
Dialogs
ZIP-archive, 2 files, 6.2 Mb

Toloka Aggregation Relevance 2

Researchers can use this dataset to explore different methods of quality control in crowdsourcing. The dataset contains around 0.5 million anonymized crowdsourced labels that were collected in the "Relevance 2 Gradations" project in 2016. It includes the labels from individual performers and golden labels that help to measure the quality of their answers. The dataset contains anonymized information about how the performers evaluated a particular document, and in some cases, whether their answer was correct. By studying this dataset, you can find out how the opinion of individual performers affects the quality of the final assessment, what aggregation model is most effective, and how many opinions you need in order to get an accurate answer

The main quality metric is accuracy of aggregated labels, which is estimated as the percentage of the aggregated labels that match the golden labels for the golden set

File name
Description
crowd_labels.tsv
Tasks & performers' responses
golden_labels.tsv
Golden labels
ZIP-archive, 2 files, 2.8 Mb

Toloka Aggregation Relevance 5

This dataset is similar to the previous one, but rather than a binary choice for rating label relevance, it uses a five-point scale in the "Relevance 5 Gradations" project. The task was to assess the relevance of a document for a query on a 5-point scale. Some tasks in this dataset have more than one golden label. In these cases, all the golden labels are considered equally correct.

The main quality metric is accuracy of aggregated labels, which is estimated as the percentage of the aggregated labels that match one of the golden labels for a given task from the golden set. In addition to the crowdsourced labels, there is also information about performers who were banned for a certain reason. For each banned performer, the reason for banning is provided as one out of four ban types (details about each ban type are not given). The dataset contains more than 1 million labels.

File name
Description
crowd_labels.tsv
Tasks & performers' responses
golden_labels.tsv
Golden labels
bans.tsv
Banned performers
ZIP-archive, 3 files, 6.4 Mb

Lexical Relations from the Wisdom of the Crowd (LRWC)

This dataset contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of it (hyponym). Assembled by Dmitry Ustalov in 2017. A set of 300 most frequent nouns was extracted from the Russian National Corpus. Then each method or resource (including RuThes and RuWordNet) produced at most five hypernyms, if possible. This resulted in 10,600 unique non-empty subsumption pairs, which were annotated by seven different performers whose mother tongue is Russian and were at least 20 years old as of February 1, 2017. As a result, 4,576 out of 10,600 pairs were annotated as positive while the remaining 6,024 were annotated as negative. Interestingly, the performers were more confident in the negative answers than in the positive ones.
File name
Description
lrwc-1.1-assignments.tsv
Input data
toloka-isa-50-skip-300-train-hit.tsv
 Training tasks
lrwc-1.1-aggregated.tsv
 Aggregated responses
ZIP-archive, 3 files, 2.1 Mb

Human-Annotated Sense-Disambiguated Word Contexts for Russian

This dataset contains human-annotated sense identifiers for 2562 contexts of 20 words used in the RUSSE'2018 shared task on Word Sense Induction and Disambiguation for the Russian language. Assembled by Dmitry Ustalov in 2017. In particular, 80 pre-annotated contexts were used for training the human annotators, and 2562 contexts were annotated by humans such that each context was annotated by 9 different annotators. After the annotation, every context was additionally inspected (“curated”) by the organizers of the shared task.
File name
Description
tasks-train.tsv
Training tasks
tasks-test.tsv
Main tasks
assignments_01-12-2017.tsv.xz
Full results
aggregated_results_pool_1036853 __ 2017_12_01.tsv
Aggregated results
agreement.txt
Annotator agreement report
report-curated.tsv.xz
& tasks-eval.tsv.xz
Curated report
tasks-eval.tsv.xz
Supplementary file
agreement.txt
Agreement
bts-rnc-crowd.tsv
Final aggregated dataset
ZIP-archive, 9 files, 2.4 Mb

Toloka Business ID Recognition

For this dataset, we prepared 10,000 photos of information signs outside of businesses and a text file with the INN (Taxpayer Identification Number) and OGRN (Business Registration Number) codes shown on the signs. This data can be used for training a computer vision model to recognize number sequences in images. The dataset was provided by Yandex Business Directory.

How we collected the data

First we launched a task in the Yandex.Toloka mobile app that asked performers to go to a specific address marked on the map, find the organization, and take a photo of its information sign. We use field tasks like this to keep the Yandex Business Directory updated.

Then the quality of completed tasks was checked by other performers. The photos containing the INN and OGRN codes were sent for reсognition. Toloka performers typed out the numbers from the photos, and then we processed the results and formed a dataset.

File name
Description
photos.zip
Photos of business signs
inn-ogrn.tsv
INN / OGRN
Full data: ZIP-archive, 9.5Gb
Sample: ZIP-archive, 191Mb

Toloka Aggregation Features

The dataset contains about 60K crowdsourced labels for 1K tasks and groud truth labels for almost all the tasks. The task was to classify websites into 5 categories by the presence of adult content on them. Additionally, each task has 52 real-valued features that can be used to predict its category.

The main quality metric is accuracy of aggregated labels, which is estimated as the percentage of the aggregated labels that match the golden labels for the golden set.

File name
Description
crowd_labels.tsv
crowdsourced labels
golden_labels.tsv
groud truth labels
features.tsv
features of tasks
ZIP-archive, 3 files, 0.45 Mb
Tue Nov 05 2019 23:19:53 GMT+0300 (Moscow Standard Time)