Yandex Research Datasets

Heterophilous graph datasets
Graph machine learning
Oleg Platonov
Denis Kuznedelev
Michael Diskin
Artem Babenko
Liudmila Prokhorenkova
A graph dataset is called heterophilous if nodes prefer to connect to other nodes that are not similar to them. For example, in financial transaction networks, fraudsters often perform transactions with non-fraudulent users, and in dating networks, most connections are between people of opposite genders. Learning under heterophily is an important subfield of graph ML. Thus, having diverse and reliable benchmarks is essential.

We propose a benchmark of five diverse heterophilous graphs that come from different domains and exhibit a variety of structural properties. Our benchmark includes a word dependency graph Roman-empire, a product co-purchasing network Amazon-ratings, a synthetic graph emulating the minesweeper game Minesweeper, a crowdsourcing platform worker network Tolokers, and a question-answering website interaction network Questions.
Shifts Dataset
Distributional shift Uncertainty estimation Tabular data Machine translation Natural language processing
Andrey Malinin
Neil Band
Yarin Gal
Mark J. F. Gales
Alexander Ganshin
German Chesnokov
Alexey Noskov
Andrey Ploskonosov
Liudmila Prokhorenkova
Ivan Provilkov
Vatsal Raina
Vyas Raina
Denis Roginskiy
Mariya Shmatova
Panos Tigas
Boris Yangel
The Shifts Dataset contains curated and labeled examples of real, 'in-the-wild' distributional shifts across three large-scale tasks. Specifically, it contains tabular weather prediction, machine translation, and vehicle motion prediction tasks' data used in Shifts Challenge 2021. Dataset shift is ubiquitous in all of these tasks and modalities.
Text-to-Image dataset for billion-scale similarity search
Nearest neighbor search Natural language processing Computer vision
Dmitry Baranchuk
Artem Babenko
Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.

The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.

Read more about the data format and how to download the dataset in the related post.

Datasets

Heterophilous graph datasets

Shifts Dataset

Text-to-Image dataset for billion-scale similarity search