- October 12, 2022Research
Graph-based nearest neighbor search
- April 26, 2021Research
Benchmarks for Billion-Scale Similarity Search
Nearest neighbor search
Nearest neighbor search is a long-standing problem arising in a large number of machine learning applications, such as recommender services, information retrieval, and others.
Graph-based Nearest Neighbor Search in Hyperbolic SpacesICLR,
The nearest neighbor search (NNS) problem is widely studied in Euclidean space, and graph-based algorithms are known to outperform other approaches for this task. However, hyperbolic geometry often allows for better data representation in various domains, including graphs, words, and images. In this paper, we show that graph-based approaches are also well suited for hyperbolic geometry. From a theoretical perspective, we rigorously analyze the time and space complexity of graph-based NNS, assuming that an n-element dataset is uniformly distributed within a d-dimensional ball of radius R in the hyperbolic space of curvature -1. Under some conditions on R and d, we derive the time and space complexity of graph-based NNS and compare the obtained results with known guarantees for the Euclidean case. Interestingly, in the dense setting (d << log(n)) and under some assumptions on the radius R, graph-based NNS has lower time complexity in the hyperbolic space. This agrees with our experiments: we consider datasets embedded in hyperbolic and Euclidean spaces and show that graph-based NNS can be more efficient in the hyperbolic space. We also demonstrate that graph-based methods outperform other existing baselines for hyperbolic NNS. Overall, our theoretical and empirical analysis suggests that graph-based NNS can be considered a default approach for similarity search in hyperbolic spaces.
Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor SearchNeurIPS Competitions,
Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-à-vis their hardware cost. This competition compares ANNS algorithms at billion-scale by hardware cost, accuracy and performance. We set up an open source evaluation framework and leaderboards for both standardized and specialized hardware. The competition involves three tracks. The standard hardware track T1 evaluates algorithms on an Azure VM with limited DRAM, often the bottleneck in serving billion-scale indices, where the embedding data can be hundreds of GigaBytes in size. It uses FAISS as the baseline. The standard hardware track T2 additional allows inexpensive SSDs in addition to the limited DRAM and uses DiskANN as the baseline. The specialized hardware track T3 allows any hardware configuration, and again uses FAISS as the baseline. We compiled six diverse billion-scale datasets, four newly released for this competition, that span a variety of modalities, data types, dimensions, deep learning models, distance functions and sources. The outcome of the competition was ranked leaderboards of algorithms in each track based on recall at a query throughput threshold. Additionally, for track T3, separate leaderboards were created based on recall as well as cost-normalized and power-normalized query throughput.
Graph-based Nearest Neighbor Search: From Practice to TheoryICML,
Graph-based approaches are empirically shown to be very successful for the nearest neighbor search (NNS). However, there has been very little research on their theoretical guarantees. We fill this gap and rigorously analyze the performance of graph-based NNS algorithms, specifically focusing on the low-dimensional (d << log n) regime. In addition to the basic greedy algorithm on nearest neighbor graphs, we also analyze the most successful heuristics commonly used in practice: speeding up via adding shortcut edges and improving accuracy via maintaining a dynamic list of candidates. We believe that our theoretical insights supported by experimental analysis are an important step towards understanding the limits and benefits of graph-based NNS algorithms.
Text-to-Image dataset for billion-scale similarity search
Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.
The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.
Read more about the data format and how to download the dataset in the related post.