- November 7, 2023Research
Faster customization of text-to-image models with DVAR
- October 18, 2023Research
How much do text-to-Image models know? A hypernymy-based approach
- July 6, 2022Research
What deep generative models can do with images
Computer vision
Yandex Research team regularly contributes to the computer vision research community, mostly in the field of image retrieval and generative modelling.
Posts
Publications
Hyperbolic Vision Transformers: Combining Improvements in Metric Learning
CVPR, 2022Metric learning aims to learn a highly discriminative model encouraging the embeddings of similar classes to be close in the chosen metrics and pushed apart for dissimilar ones. The common recipe is to use an encoder to extract embeddings and a distance-based loss function to match the representations – usually, the Euclidean distance is utilized. An emerging interest in learning hyperbolic data embeddings suggests that hyperbolic geometry can be beneficial for natural data. Following this line of work, we propose a new hyperbolic-based model for metric learning. At the core of our method is a vision transformer with output embeddings mapped to hyperbolic space. These embeddings are directly optimized using modified pairwise cross-entropy loss. We evaluate the proposed model with six different formulations on four datasets achieving the new state-of-the-art performance. The source code is available at https://github.com/htdt/hyp_metric
Training Transformers Together
NeurIPS Demos, 2022The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, we collaboratively trained a text-to-image transformer similar to OpenAI DALL-E. We invited the viewers to join the ongoing training run, showing them instructions on how to contribute using the available hardware. We explained how to address the engineering challenges associated with such a training run (slow communication, limited memory, uneven performance between devices, and security concerns) and discussed how the viewers can set up collaborative training runs themselves. Finally, we show that the resulting model generates images of reasonable quality on a number of prompts.
Label-Efficient Semantic Segmentation with Diffusion Models
ICLR, 2022Denoising diffusion probabilistic models have recently received much research attention since they outperform alternative approaches, such as GANs, and currently provide state-of-the-art generative performance. The superior performance of diffusion models has made them an appealing tool in several applications, including inpainting, super-resolution, and semantic editing. In this paper, we demonstrate that diffusion models can also serve as an instrument for semantic segmentation, especially in the setup when labeled data is scarce. In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process. We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem. Based on these observations, we describe a simple segmentation method, which can work even if only a few training images are provided. Our approach significantly outperforms the existing alternatives on several datasets for the same amount of human supervision.
Datasets
Text-to-Image dataset for billion-scale similarity search
Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.
The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.
Read more about the data format and how to download the dataset in the related post.