Dmitry Baranchuk

Artem Babenko

Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.

The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.

Read more about the data format and how to download the dataset in the related post.

To encourage future developments of scalable similarity search algorithms, Yandex releases two billion-scale datasets that can serve as representative benchmarks for researchers from the machine learning and algorithmic communities interested in efficient similarity search. Both datasets are released under the CC BY 4.0 license.

Benchmarks for Billion-Scale Similarity Search

Yandex Research team regularly contributes to the computer vision research community, mostly in the field of image retrieval and generative modelling.

Computer vision

Language is one of the key forms of communication. We study methods of language representation and understanding to simplify human-computer interactions.

Natural language processing 

Nearest neighbor search is a long-standing problem arising in a large number of machine learning applications, such as recommender services, information retrieval, and others.

Nearest neighbor search

Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand. 

The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.

Text-to-Image dataset for billion-scale similarity search