Dmitry Baranchuk


  • Distributed Inference and Fine-tuning of Large Language Models Over The Internet

    Large-scale machine learningNatural language processing
    Alexander Borzunov
    Max Ryabinin
    Artem Chumachenko
    Dmitry Baranchuk
    Tim Dettmers
    Younes Belkada
    Pavel Samygin
    Colin Raffel
    NeurIPS, 2023

    Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals — a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10х faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

  • DeDrift: Robust Similarity Search under Content Drift

    Nearest neighbor search
    Dmitry Baranchuk
    Matthijs Douze
    Yash Upadhyay
    I. Zeki Yalniz
    ICCV, 2023

    The statistical distribution of content uploaded and searched on media sharing sites changes over time due to seasonal, sociological and technical factors. We investigate the impact of this "content drift" for large-scale similarity search tools, based on nearest neighbor search in embedding space. Unless a costly index reconstruction is performed frequently, content drift degrades the search accuracy and efficiency. The degradation is especially severe since, in general, both the query and database distributions change. We introduce and analyze real-world image and video datasets for which temporal information is available over a long time period. Based on the learnings, we devise DeDrift, a method that updates embedding quantizers to continuously adapt large-scale indexing structures on-the-fly. DeDrift almost eliminates the accuracy degradation due to the query and database content drift while being up to 100x faster than a full index reconstruction.

  • TabDDPM: Modelling Tabular Data with Diffusion Models

    Tabular dataGenerative models
    Akim Kotelnikov
    Dmitry Baranchuk
    Ivan Rubachev
    Artem Babenko
    ICML, 2023

    Denoising diffusion probabilistic models are becoming the leading generative modeling paradigm for many important data modalities. Being the most prevalent in the computer vision community, diffusion models have recently gained some attention in other domains, including speech, NLP, and graph-like data. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where data points are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data makes it quite challenging for accurate modeling since the individual features can be of a completely different nature, i.e., some of them can be continuous and some can be discrete. To address such data types, we introduce TabDDPM --- a diffusion model that can be universally applied to any tabular dataset and handles any feature types. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields.



  • Text-to-Image dataset for billion-scale similarity search

    Computer visionNatural language processing Nearest neighbor search
    Dmitry Baranchuk
    Artem Babenko

    Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.

    The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.

    Read more about the data format and how to download the dataset in the related post.