Optimization

Most machine learning algorithms build an optimization model and learn its parameters from the given data. Thus, developing effective and efficient optimization methods is of the essence.

Area 12. Optimization.svg

Publications

  • First Order Methods with Markovian Noise: from Acceleration to Variational Inequalities

    OptimizationMachine learning theory
    Aleksandr Beznosikov
    Sergey Samsonov
    Marina Sheshukova
    Alexander Gasnikov
    Alexey Naumov
    Eric Moulines
    NeurIPS, 2023

    This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the underlying noise sequence, we use the randomized batching scheme, which is based on the multilevel Monte Carlo method. Moreover, our technique allows us to eliminate the limiting assumptions of previous research on Markov noise, such as the need for a bounded domain and uniformly bounded stochastic gradients. Our extension to variational inequalities under Markovian noise is original. Additionally, we provide lower bounds that match the oracle complexity of our method in the case of strongly convex optimization problems.

  • Similarity, Compression and Local Steps: Three Pillars of Efficient Communications for Distributed Variational Inequalities

    OptimizationMachine learning theory
    Aleksandr Beznosikov
    Martin Takáč
    Alexander Gasnikov
    NeurIPS, 2023

    Variational inequalities are a broad and flexible class of problems that includes minimization, saddle point, and fixed point problems as special cases. Therefore, variational inequalities are used in various applications ranging from equilibrium search to adversarial learning. With the increasing size of data and models, today’s instances demand parallel and distributed computing for real-world machine learning problems, most of which can be represented as variational inequalities. Meanwhile, most distributed approaches have a significant bottleneck - the cost of communications. The three main techniques to reduce the total number of communication rounds and the cost of one such round are the similarity of local functions, compression of transmitted information, and local updates. In this paper, we combine all these approaches. Such a triple synergy did not exist before for variational inequalities and saddle problems, nor even for minimization problems. The methods presented in this paper have the best theoretical guarantees of communication complexity and are significantly ahead of other methods for distributed variational inequalities. The theoretical results are confirmed by adversarial learning experiments on synthetic and real datasets.

  • A View of Mini-batch SGD via Generating Functions: Conditions of Convergence, Phase Transitions, Benefit from Negative Momenta

    Optimization
    Maksim Velikanov
    Denis Kuznedelev
    Dmitry Yarotsky
    ICLR, 2023

    Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of "Spectrally Expressible" approximations. This allows to obtain an explicit expression for the generating function of the sequence of loss values. By analyzing this generating function, we find, in particular, that 1) the SGD dynamics exhibits several convergent and divergent regimes depending on the spectral distributions of the problem; 2) the convergent regimes admit explicit stability conditions, and explicit loss asymptotics in the case of power-law spectral distributions; 3) the optimal convergence rate can be achieved at negative momenta. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.