Benchmarks for Billion-Scale Similarity Search



To encourage future developments of scalable similarity search algorithms, Yandex releases two billion-scale datasets that can serve as representative benchmarks for researchers from the machine learning and algorithmic communities interested in efficient similarity search. Both datasets are released under the CC BY 4.0 license.

Deep1B

The Deep1B dataset (Babenko et al., CVPR'2016) consists of 109 image embeddings produced as the outputs from the last fully-connected layer of the  GoogLeNet model, which was pretrained on the Imagenet classification task. The embeddings are then compressed by PCA to 96 dimensions and l2-normalized. The similarity measure for Deep1B is the Euclidean distance. 
The entire database of 109 image embeddings can be downloaded here. We also provide  a smaller subset of 10M embeddings for debugging, an additional  learning set of 350M embeddings, and a query set containing 10K embeddings. For all queries, top-100 groundtruth nearest neighbors w.r.t. the database set are precomputed and their indices can be downloaded here.

Text-to-Image-1B

The Text-to-Image-1B dataset  contains data from both textual and visual modalities, which is common for typical cross-modal retrieval tasks, where database and query vectors can potentially have different distributions in shared representation space. In Text-to-Image-1B, the database consists of image embeddings produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. The mapping to the shared representation space is learned via minimizing a variant of the triplet loss using clickthrough data. The similarity measure for this dataset is the inner product. The dimensionality of both image and textual embedding vectors is 200.

The entire database of 109 image embeddings can be downloaded here. We also provide two smaller subsets of 1M and 10M image embeddings for preliminary experiments and debugging. 100K textual query embeddings can be downloaded here, and the indices of top-100 groundtruth w.r.t the database set are provided here. Additional 50M textual query embeddings are provided as training data for learnable indexing structures.

Data format

All embedding data is stored in .fbin format:

[num_vectors (uint32), vector_dim (uint32), vector_array (float32)]

The groundtruth is stored in .ibin format:

[num_vectors (uint32), vector_dim (uint32), vector_array (int32)]

Sample of Python code to read and write the data.

Contacts

Feel free to contact us on ask-research@yandex-team.ru with any questions.

Wed May 12 2021 18:19:04 GMT+0300 (Moscow Standard Time)