Datasets

Toloka Aggregation Features
The data set contains about 60K crowdsourced labels for 1K tasks and groud truth labels for almost all the tasks. The task was to classify websites into 5 categories by the presence of adult content on them. Additionally, each task has 52 real-valued features that can be used to predict its category.
June 9, 2019
Yandex.Market, Learning to filter for sorting by price
The dataset contains 30K queries submitted to Yandex.Market for which the user chose ordering by price in ascending order. For each query, there are up to 500 cheapest documents with relevance scores.
June 7, 2019
Toloka Business ID Recognition
The dataset contains 10,000 photos of information signs outside of businesses and a text file with the INN (Taxpayer Identification Number) and OGRN (Business Registration Number) codes shown on the signs.
May 23, 2019
Toloka Persona Chat Rus
This dataset of 10,000 dialogues will help researchers of dialogue systems to develop approaches for training chat bots.
April 5, 2019
Toloka Aggregation Relevance 2
The dataset contains around 0.5 million anonymized crowdsourced labels that were collected in the "Relevance 2 Gradations" project in 2016.
April 5, 2019