Machine learning theory

We study various aspects related to theoretical understanding of ML models and algorithms.

Area 8. Machine learning theory.svg

Posts

Publications

  • Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

    Computer visionGenerative modelsMachine learning theory
    Mikhail Persiianov
    Arip Asadulaev
    Nikita Andreev
    Nikita Starodubcev
    Dmitry Baranchuk
    Anastasis Kratsios
    Evgeny Burnaev
    Alexander Korotin
    ICML, 2026

    Learning conditional distributions π*(·|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x, y) ∼ π*. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of semi-supervised models that utilize both limited paired data and additional unpaired i.i.d. samples x ∼ π*ₓ and y ∼ π*ᵧ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm called EBiEOT that integrates both paired and unpaired data seamlessly using data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an end-to-end learning algorithm to get π*(·|x). In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Finally, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

  • Softsign: Smooth Sign in Your Optimizer for Better Parameter Heterogeneity Handling

    OptimizationMachine learning theory
    Dmitrii Feoktistov
    Timofey Belinsky
    Andrey Veprikov
    Amir Zainullin
    Aleksandr Beznosikov
    ICML, 2026

    Sign-based and LMO-inspired optimizers have recently attracted substantial attention in deep learning due to their strong performance and low memory footprint. However, their fixed-magnitude updates can hurt terminal convergence: they decouple update mechanisms from gradient magnitudes and fail to account for parameter heterogeneity, often leading to oscillation rather than convergence. We propose SoftSignum, a smooth relaxation of sign-based optimization that replaces the hard sign map with a temperature-controlled soft-sign transformation, enabling a parameter-wise transition from sign-like updates to magnitude-sensitive SGD-like steps. We complement it with an adaptive quantile-based temperature schedule and extend the same principle to matrix-valued optimizers, obtaining SoftMuon. We also develop a generalized geometry-relaxation framework based on strongly convex regularizers and Fenchel conjugates, proving convergence in stochastic non-convex setting. Experiments on diverse deep learning tasks, including LLM pretraining, show that SoftSignum and SoftMuon consistently improve over their hard sign-based counterparts and standard AdamW.

  • Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization

    OptimizationMachine learning theory
    Ekaterina Borodich
    Dmitry Kovalev
    ICLR, 2026

    In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function, $\min_x f(x)$. Recently, Malitsky (2020); Alacaoglu et al. (2023) developed an adaptive first-order method, GRAAL. This algorithm computes stepsizes by estimating the local curvature of the objective function without any line search procedures or hyperparameter tuning, and attains the standard iteration complexity $\mathcal{O}(L\Vert x_0-x^* \Vert^2/\epsilon)$ of fixed-stepsize gradient descent for $L$-smooth functions. However, a natural question arises: is it possible to accelerate the convergence of GRAAL to match the optimal complexity $\mathcal{O}(\sqrt{L\Vert x_0-x^*\Vert^2/\epsilon})$ of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made by Li and Lan (2025); Suh and Ma (2025), the ability of existing accelerated algorithms to adapt to the local curvature of the objective function is highly limited. We resolve this issue and develop GRAAL with Nesterov acceleration, which can adapt its stepsize to the local curvature at a geometric, or linear, rate just like non-accelerated GRAAL. We demonstrate the adaptive capabilities of our algorithm by proving that it achieves near-optimal iteration complexities for $L$-smooth functions, as well as under a more general $(L_0,L_1)$-smoothness assumption (Zhang et al., 2019).