Benchmarks for Billion-Scale Similarity Search

Deep1B The link has been copied to clipboard

The Deep1B dataset (Babenko et al., CVPR'2016) consists of 1B image embeddings produced as the outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task. The embeddings are then compressed by PCA to 96 dimensions and l2-normalized. The similarity measure for Deep1B is the Euclidean distance.

The entire database of 1B image embeddings can be downloaded here. We also provide a smaller subset of 10M embeddings for debugging, an additional learning set of 350M embeddings, and a query set containing 10K embeddings. For all queries, top-100 groundtruth nearest neighbors w.r.t. the database set are precomputed and their indices can be downloaded here.

Text-to-Image-1B The link has been copied to clipboard

The Text-to-Image-1B dataset contains data from both textual and visual modalities, which is common for typical cross-modal retrieval tasks, where database and query vectors can potentially have different distributions in shared representation space. In Text-to-Image-1B, the database consists of image embeddings produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. The mapping to the shared representation space is learned via minimizing a variant of the triplet loss using clickthrough data. The similarity measure for this dataset is the inner product. The dimensionality of both image and textual embedding vectors is 200.

The entire database of 1B image embeddings can be downloaded here. We also provide two smaller subsets of 1M and 10M image embeddings for preliminary experiments and debugging. 100K textual query embeddings can be downloaded here, and the indices of top-100 groundtruth w.r.t the database set are provided here. Additional 50M textual query embeddings are provided as training data for learnable indexing structures.

Data format The link has been copied to clipboard

All embedding data is stored in .fbin format:

[num_vectors (uint32), vector_dim (uint32), vector_array (float32)]

The groundtruth is stored in .ibin format:

[num_vectors (uint32), vector_dim (uint32), vector_array (int32)]

Sample of Python code to read and write the data.

Contacts The link has been copied to clipboard

Feel free to contact us on ask-research@yandex-team.ru with any questions.