Papers accepted to ICML 2025

Six papers by the Yandex Research team (shown in bold below) and our collaborators have been accepted for publication at the International Conference on Machine Learning (ICML 2025).

Discrete Neural Algorithmic Reasoning by Gleb Rodionov and Liudmila Prokhorenkova

In this work, we achieve perfect neural execution of several algorithms by forcing the node and edge representations to be from a fixed finite set. Also, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data. See our blog post for more details.

Measuring Diversity: Axioms and Challenges by Mikhail Mironov and Liudmila Prokhorenkova

In this paper, we analyze diversity measures. For this, we formulate three simple desirable properties (axioms) of a good measure: monotonicity, uniqueness and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradictory. Unfortunately, these examples are NP-hard to compute. Thus, we pose an open problem of constructing a diversity measure that has all the listed properties and can be efficiently computed.

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models by Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh

When LLMs generate text, they need to maintain a memory of previous tokens in the form of attention keys and values: tens of thousands of numbers for each token. For tasks where LLMs deal with long texts, this adds up to tens of gigabytes of GPU memory for every sequence in a batch. To avoid running out of GPU memory, people have been compressing KV-vectors, such as quantizing or pruning them. We propose a better way of compressing these keys and values: instead of quantizing them individually, we exploit the mutual information between different layers to quantize them together. Our approach fits a simple linear classifier to predict adjacent layer key-values and only stores the part that cannot be predicted. This allows us to compress KV-vectors with significantly better accuracy, especially for extreme 2-bit quantization.

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training by Philip Zmushko, Aleksandr Beznosikov, Martin Takáč, Samuel Horváth

With the increase in the number of parameters in LLMs, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. This naturally creates a desire to reduce these memory costs, which is precisely what this work focuses on.

The most obvious approach to reducing optimizer state memory costs is simply to decrease the size of the optimization space. However, prior works have gradually converged on the understanding that optimization in low-rank subspaces often doesn't provide sufficiently good results. While previous approaches developed techniques that alternate between different subspaces during the optimization process to obtain high-rank weight updates, these methods still suffer from a critical limitation — each individual optimization step remains low-rank. We address this shortcoming by proposing the FRUGAL framework. Our approach leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD.

We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

Inverse Bridge Matching Distillation by Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin

In recent years, diffusion bridge models have proven to be very effective models for inverse problems with high perceptual quality and general image-to-image translation problems: super-resolution, JPEG-restoration, inpainting, sketch-to-image, etc. However, diffusion bridge models, like classical diffusion models, require 10-1000 steps to simulate the reverse SDE. To address this, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models into a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than the used teacher model depending on the particular setup. Our method outperforms previous acceleration methods based on consistency distillation (CBD/CBT) and more advanced sampling techniques (DBIM).

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search by Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh

The high computational costs of large language models (LLMs) have spurred intense research into compression techniques such as quantization, sparsification, and structured pruning. Recently, dynamic, non-uniform compression methods—adjusting per-block or per-layer levels to minimize accuracy loss while meeting global compression targets—have emerged. However, existing approaches often depend on heuristics, assuming error monotonicity: that total model error tracks the sum of layer-wise errors. We show this assumption fails for LLMs, where lower per-layer error sums can yield worse results. To address this, we propose EvoPress, a general evolutionary framework for provably optimal dynamic compression within a given input range. EvoPress offers strong theoretical guarantees and low evaluation complexity. Applied to Llama, Mistral, and Phi models, EvoPress sets new state-of-the-art results across dynamic structural pruning, unstructured sparsity, and quantization with variable bitwidths.