To encourage future developments of scalable similarity search algorithms, Yandex releases two billion-scale datasets that can serve as representative benchmarks for researchers from the machine learning and algorithmic communities interested in efficient similarity search. Both datasets are released under the CC BY 4.0 license.
The Deep1B dataset (Babenko et al., CVPR'2016
) consists of 109
image embeddings produced as the outputs from the last fully-connected layer of the GoogLeNet
model, which was pretrained on the Imagenet classification task. The embeddings are then compressed by PCA to 96 dimensions and l2
-normalized. The similarity measure for Deep1B is the Euclidean distance.
The entire database of 109
image embeddings can be downloaded here
. We also provide a smaller subset of 10M
embeddings for debugging, an additional learning set of 350M
embeddings, and a query set
containing 10K embeddings. For all queries, top-100 groundtruth nearest neighbors w.r.t. the database set are precomputed and their indices can be downloaded here
The Text-to-Image-1B dataset contains data from both textual and visual modalities, which is common for typical cross-modal retrieval tasks, where database and
query vectors can potentially have different distributions in shared representation
space. In Text-to-Image-1B, the database consists of image embeddings produced by the Se-ResNext-101 model, and queries are textual embeddings produced by a variant of the DSSM model. The mapping to the shared representation space is learned via minimizing
a variant of the triplet loss using clickthrough data. The similarity measure for this dataset is the inner product. The dimensionality of both image and textual embedding vectors is 200.
The entire database of 109 image embeddings can be downloaded here. We also provide two smaller subsets of 1M and 10M image embeddings for preliminary experiments and debugging. 100K textual query embeddings can be downloaded here, and the indices of top-100 groundtruth w.r.t the database set are provided here. Additional 50M textual query embeddings are provided as training data for learnable indexing structures.
All embedding data is stored in .fbin
[num_vectors (uint32), vector_dim (uint32), vector_array (float32)]
The groundtruth is stored in .ibin
format: [num_vectors (uint32), vector_dim (uint32), vector_array (int32)] Sample of Python code
to read and write the data.
Feel free to contact us on firstname.lastname@example.org with any questions.