Generative models

Generative models in computer vision are powerful tool for various applications.

Area 4. Generative models.svg

Posts

Publications

  • Alchemist: Turning Public Text-to-Image Data into Generative Gold

    Generative modelsComputer vision
    Valerii Startsev
    Alexander Ustyuzhanin
    Alexey Kirillov
    Dmitry Baranchuk
    Sergey Kastryulin
    NeurIPS Datasets and Benchmarks, 2025

    Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models’ weights to the public.

  • YaART: Yet Another ART Rendering Technology

    Computer visionGenerative models
    Sergey Kastryulin
    Artem Konev
    Alexander Shishenya
    Eugene Lyapustin
    Artem Khurshudov
    Alexander Tselousov
    Nikita Vinokurov
    Denis Kuznedelev
    Alexander Markovich
    Grigoriy Livshits
    Alexey Kirillov
    Anastasiia Tabisheva
    Liubov Chubarova
    Marina Kaminskaia
    Alexander Ustyuzhanin
    Artemii Shvetsov
    Daniil Shlenskii
    Valerii Startsev
    Dmitrii Kornilov
    Mikhail Romanov
    Dmitry Baranchuk
    Artem Babenko
    Sergei Ovcharenko
    Valentin Khrulkov
    KDD, 2025

    In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.

  • Inverse Bridge Matching Distillation

    Generative modelsComputer vision
    Nikita Gushchin
    David Li
    Daniil Selikhanovych
    Evgeny Burnaev
    Dmitry Baranchuk
    Alexander Korotin
    ICML, 2025

    Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.