- December 8, 2021Research
How to validate validation measures
- April 19, 2017Research
How profitable to sell a series of goods to the same buyer?
Machine learning theory
We study various aspects related to theoretical understanding of ML models and algorithms.
Posts
Publications
Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization
ICLR, 2026In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function, $\min_x f(x)$. Recently, Malitsky (2020); Alacaoglu et al. (2023) developed an adaptive first-order method, GRAAL. This algorithm computes stepsizes by estimating the local curvature of the objective function without any line search procedures or hyperparameter tuning, and attains the standard iteration complexity $\mathcal{O}(L\Vert x_0-x^* \Vert^2/\epsilon)$ of fixed-stepsize gradient descent for $L$-smooth functions. However, a natural question arises: is it possible to accelerate the convergence of GRAAL to match the optimal complexity $\mathcal{O}(\sqrt{L\Vert x_0-x^*\Vert^2/\epsilon})$ of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made by Li and Lan (2025); Suh and Ma (2025), the ability of existing accelerated algorithms to adapt to the local curvature of the objective function is highly limited. We resolve this issue and develop GRAAL with Nesterov acceleration, which can adapt its stepsize to the local curvature at a geometric, or linear, rate just like non-accelerated GRAAL. We demonstrate the adaptive capabilities of our algorithm by proving that it achieves near-optimal iteration complexities for $L$-smooth functions, as well as under a more general $(L_0,L_1)$-smoothness assumption (Zhang et al., 2019).
SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration
ICLR, 2026In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix smoothness and noise assumptions. This allows us to recover state-of-the-art convergence results for several popular adaptive gradient methods, including AdaGrad-Norm, AdaGrad, and ASGO/One-sided Shampoo. In addition, we establish the fundamental connection between two recently proposed algorithms, Scion and DASGO, and provide the first theoretical guarantees for the latter. Second, we show that the convergence of methods like AdaGrad and DASGO can be provably accelerated beyond the best-known rates using Nesterov momentum. Consequently, we obtain the first theoretical justification that AdaGrad-type algorithms can simultaneously benefit from both diagonal preconditioning and momentum, which may provide an ultimate explanation for the practical efficiency of Adam.
Sign-SGD via Parameter-Free Optimization
ICLR, 2026Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately end-to-end speedup compared to runs with grid-searched stepsizes.