Machine learning theory

We study various aspects related to theoretical understanding of ML models and algorithms.

Area 8. Machine learning theory.svg

Posts

Publications

  • FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

    OptimizationMachine learning theoryNatural language processing
    Philip Zmushko
    Aleksandr Beznosikov
    Martin Takáč
    Samuel Horváth
    ICML, 2025

    With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the effective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-Rank Updates with GrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

  • Measuring Diversity: Axioms and Challenges

    Machine learning theory
    Mikhail Mironov
    Liudmila Prokhorenkova
    ICML, 2025

    This paper addresses the problem of quantifying diversity for a set of objects. First, we conduct a systematic review of existing diversity measures and explore their undesirable behavior in certain cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradictory. Unfortunately, the constructed examples are too computationally expensive (NP-hard) for practical use. Thus, we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice or proving that all such measures are NP-hard to compute.

  • On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms

    OptimizationMachine learning theory
    Ekaterina Borodich
    Alexander Gasnikov
    Dmitry Kovalev
    ICML, 2025

    We revisit the smooth convex-concave bilinearly-coupled saddle-point problem of the form minxmaxy f(x)+⟨y,Bx⟩−g(y). In the highly specific case where function f(x) is strongly convex and function g(y) is affine, or both functions are affine, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional to log⁡(1/ϵ). However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve general smooth non-strongly convex functions f(x) and g(y). Therefore, we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.