Papers accepted to ICML 2024

Three papers by the Yandex Research team (shown in bold below) and our collaborators have been accepted for publication at the International Conference on Machine Learning (ICML 2024).
Extreme Compression of Large Language Models via Additive Quantization by Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
There has been an emergence of large language models (LLMs). While numerous commercial options offer top-notch quality, the open-source community has rapidly caught up. Releases of large open-source models like Llama, Mistral, Qwen, and others have led to a race towards efficient quantization techniques capable of enabling their execution on end-user devices.  
In this paper we looked at the case of “extreme” LLM compression, where the aim is to achieve exceptionally low bit counts, typically 2 to 3 bits per parameter — from the point of view of classic methods in Multi-Codebook Quantization (MCQ) .  Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across entire layer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation. 
The Frank-Wolfe (FW) method is one of classical and popular approaches for solving optimization problems with “simple” constraints (balls, simplexes, etc) that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stochastic finite-sum minimization. The modifications are based on so-called variance reduced methods (in particular, the SARAH and PAGE methods). Our algorithms have the best convergence guarantees of existing stochastic FW approaches for both convex and non-convex objective functions. Our methods do not have the issue of permanently collecting large batches, which is common to many stochastic projection-free (FW-like) approaches. Moreover, our second approach does not require either large batches or full deterministic gradients, which is a typical weakness of many techniques for finite-sum problems. The faster theoretical rates of our approaches are confirmed experimentally.
Robust Reinforcement Learning (RRL) is a promising Reinforcement Learning (RL) paradigm aimed at training robust to uncertainty or disturbances models, making them more efficient for real-world applications. Following this paradigm, uncertainty or disturbances are interpreted as actions of a second adversarial agent, and thus, the problem is reduced to seeking the agents’ policies robust to any opponent’s actions. This paper is the first to propose considering the RRL problems within the positional differential game theory, which helps us to obtain theoretically justified intuition to develop a centralized Q-learning approach. Namely, we prove that under Isaacs’s condition (sufficiently general for real-world dynamical systems), the same Q-function can be utilized as an approximate solution of both minimax and maximin Bellman equations. Based on these results, we present the Isaacs Deep Q-Network algorithms and demonstrate their superiority compared to other baseline RRL and Multi-Agent RL algorithms in various environments.