Benchmarks for Billion-Scale Similarity Search



To encourage future developments of scalable similarity search algorithms, Yandex releases two billion-scale datasets that can serve as representative benchmarks for researchers from the machine learning and algorithmic communities interested in efficient similarity search. Both datasets are released under the CC BY 4.0 license.

Deep1B

The Deep1B dataset (Babenko et al., CVPR'2016) consists of 109 image embeddings produced as the outputs from the last fully-connected layer of the  GoogLeNet model, which was pretrained on the Imagenet classification task. The embeddings are then compressed by PCA to 96 dimensions and l2-normalized. The similarity measure for Deep1B is the Euclidean distance. 
The entire database of 109 image embeddings can be downloaded here. We also provide  a smaller subset of 10M embeddings for debugging, an additional  learning set of 350M embeddings, and a query set containing 10K embeddings. For all queries, top-100 groundtruth nearest neighbors w.r.t. the database set are precomputed and their indices can be downloaded here.

Text-to-Image-1B

The Text-to-Image-1B dataset  contains data from both textual and visual modalities, which is common for typical cross-modal retrieval tasks, where database and query vectors can potentially have different distributions in shared representation space. In Text-to-Image-1B, the database consists of image embeddings produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. The mapping to the shared representation space is learned via minimizing a variant of the triplet loss using clickthrough data. The similarity measure for this dataset is the inner product. The dimensionality of both image and textual embedding vectors is 200.

The entire database of 109 image embeddings can be downloaded here. We also provide two smaller subsets of 1M and 10M image embeddings for preliminary experiments and debugging. 100K textual query embeddings can be downloaded here, and the indices of top-100 groundtruth w.r.t the database set are provided here.

Data format

All embedding data is stored as contiguous arrays of float32 values and the groundtruth is stored as a contiguous array of int32 values. Sample of Python code to read the data.

Contacts

Feel free to contact us on ask-research@yandex-team.ru with any questions.

Cookie files
Yandex uses cookies to personalize its services. By continuing to use this site, you agree to this cookie usage. You can learn more about cookies and how your data is processed in the Privacy Policy.
Tue May 04 2021 22:17:03 GMT+0300 (Moscow Standard Time)