Optimization

Most machine learning algorithms build an optimization model and learn its parameters from the given data. Thus, developing effective and efficient optimization methods is of the essence.

Area 12. Optimization.svg

Publications

  • Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features

    OptimizationMachine learning theory
    Aleksandr Beznosikov
    David Dobre
    Gauthier Gidel
    ICML, 2024

    The Frank-Wolfe (FW) method is a popular approach for solving optimization problems with structured constraints that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stochastic finite-sum minimization. Our algorithms have the best convergence guarantees of existing stochastic FW approaches for both convex and non-convex objective functions. Our methods do not have the issue of permanently collecting large batches, which is common to many stochastic projection-free approaches. Moreover, our second approach does not require either large batches or full deterministic gradients, which is a typical weakness of many techniques for finite-sum problems. The faster theoretical rates of our approaches are confirmed experimentally.

  • Ito Diffusion Approximation of Universal Ito Chains for Sampling, Optimization and Boosting

    OptimizationMachine learning theoryGradient boosting
    Aleksei Ustimenko
    Aleksandr Beznosikov
    ICLR, 2024

    In this work, we consider rather general and broad class of Markov chains, Ito chains, that look like Euler-Maryama discretization of some Stochastic Differential Equation. The chain we study is a unified framework for theoretical analysis. It comes with almost arbitrary isotropic and state-dependent noise instead of normal and state-independent one as in most related papers. Moreover, in our chain the drift and diffusion coefficient can be inexact in order to cover wide range of applications as Stochastic Gradient Langevin Dynamics, sampling, Stochastic Gradient Descent or Stochastic Gradient Boosting. We prove the bound in W2-distance between the laws of our Ito chain and corresponding differential equation. These results improve or cover most of the known estimates. And for some particular cases, our analysis is the first.

  • First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities

    OptimizationMachine learning theory
    Aleksandr Beznosikov
    Sergey Samsonov
    Marina Sheshukova
    Alexander Gasnikov
    Alexey Naumov
    Eric Moulines
    NeurIPS, 2023

    This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the underlying noise sequence, we use the randomized batching scheme, which is based on the multilevel Monte Carlo method. Moreover, our technique allows us to eliminate the limiting assumptions of previous research on Markov noise, such as the need for a bounded domain and uniformly bounded stochastic gradients. Our extension to variational inequalities under Markovian noise is original. Additionally, we provide lower bounds that match the oracle complexity of our method in the case of strongly convex optimization problems.