Machine learning theory

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training
Optimization Machine learning theory Natural language processing
Philip Zmushko
Aleksandr Beznosikov
Martin Takáč
Samuel Horváth
ICML, 2025
With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the effective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-Rank Updates with GrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.
Measuring Diversity: Axioms and Challenges
Machine learning theory
Mikhail Mironov
Liudmila Prokhorenkova
ICML, 2025
This paper addresses the problem of quantifying diversity for a set of objects. First, we conduct a systematic review of existing diversity measures and explore their undesirable behavior in certain cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradictory. Unfortunately, the constructed examples are too computationally expensive (NP-hard) for practical use. Thus, we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice or proving that all such measures are NP-hard to compute.
On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms
Optimization Machine learning theory
Ekaterina Borodich
Alexander Gasnikov
Dmitry Kovalev
ICML, 2025
We revisit the smooth convex-concave bilinearly-coupled saddle-point problem of the form minxmaxy f(x)+⟨y,Bx⟩−g(y). In the highly specific case where function f(x) is strongly convex and function g(y) is affine, or both functions are affine, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional to log⁡(1/ϵ). However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve general smooth non-strongly convex functions f(x) and g(y). Therefore, we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.

Machine learning theory

Posts

How to validate validation measures

How profitable to sell a series of goods to the same buyer?

Publications

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Measuring Diversity: Axioms and Challenges

On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms