Computer vision

Yandex Research team regularly contributes to the computer vision research community, mostly in the field of image retrieval and generative modelling.

Area 2. Computer Vision.svg



  • Hyperbolic Vision Transformers: Combining Improvements in Metric Learning

    Computer visionRepresentationsRanking
    Aleksandr Ermolov
    Leyla Mirvakhabova
    Valentin Khrulkov
    Nicu Sebe
    Ivan Oseledets
    CVPR, 2022

    Metric learning aims to learn a highly discriminative model encouraging the embeddings of similar classes to be close in the chosen metrics and pushed apart for dissimilar ones. The common recipe is to use an encoder to extract embeddings and a distance-based loss function to match the representations – usually, the Euclidean distance is utilized. An emerging interest in learning hyperbolic data embeddings suggests that hyperbolic geometry can be beneficial for natural data. Following this line of work, we propose a new hyperbolic-based model for metric learning. At the core of our method is a vision transformer with output embeddings mapped to hyperbolic space. These embeddings are directly optimized using modified pairwise cross-entropy loss. We evaluate the proposed model with six different formulations on four datasets achieving the new state-of-the-art performance. The source code is available at

  • Training Transformers Together

    Computer visionLarge-scale machine learningGenerative models
    Alexander Borzunov
    Max Ryabinin
    Tim Dettmers
    Quentin Lhoest
    Lucile Saulnier
    Michael Diskin
    Yacine Jernite
    Thomas Wolf
    NeurIPS Demos, 2022

    The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, we collaboratively trained a text-to-image transformer similar to OpenAI DALL-E. We invited the viewers to join the ongoing training run, showing them instructions on how to contribute using the available hardware. We explained how to address the engineering challenges associated with such a training run (slow communication, limited memory, uneven performance between devices, and security concerns) and discussed how the viewers can set up collaborative training runs themselves. Finally, we show that the resulting model generates images of reasonable quality on a number of prompts.

  • Label-Efficient Semantic Segmentation with Diffusion Models

    Computer visionSegmentation
    Dmitry Baranchuk
    Ivan Rubachev
    Andrey Voynov
    Valentin Khrulkov
    Artem Babenko
    ICLR, 2022

    Denoising diffusion probabilistic models have recently received much research attention since they outperform alternative approaches, such as GANs, and currently provide state-of-the-art generative performance. The superior performance of diffusion models has made them an appealing tool in several applications, including inpainting, super-resolution, and semantic editing. In this paper, we demonstrate that diffusion models can also serve as an instrument for semantic segmentation, especially in the setup when labeled data is scarce. In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process. We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem. Based on these observations, we describe a simple segmentation method, which can work even if only a few training images are provided. Our approach significantly outperforms the existing alternatives on several datasets for the same amount of human supervision.


  • Text-to-Image dataset for billion-scale similarity search

    Computer visionNatural language processing Nearest neighbor search
    Dmitry Baranchuk
    Artem Babenko

    Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.

    The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.

    Read more about the data format and how to download the dataset in the related post.