- December 8, 2021Research
How to validate validation measures
- April 19, 2017Research
How profitable to sell a series of goods to the same buyer?
Machine learning theory
We study various aspects related to theoretical understanding of ML models and algorithms.
Posts
Publications
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization
ICML, 2026Learning conditional distributions π*(·|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x, y) ∼ π*. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of semi-supervised models that utilize both limited paired data and additional unpaired i.i.d. samples x ∼ π*ₓ and y ∼ π*ᵧ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm called EBiEOT that integrates both paired and unpaired data seamlessly using data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an end-to-end learning algorithm to get π*(·|x). In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Finally, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.
Softsign: Smooth Sign in Your Optimizer for Better Parameter Heterogeneity Handling
ICML, 2026Sign-based and LMO-inspired optimizers have recently attracted substantial attention in deep learning due to their strong performance and low memory footprint. However, their fixed-magnitude updates can hurt terminal convergence: they decouple update mechanisms from gradient magnitudes and fail to account for parameter heterogeneity, often leading to oscillation rather than convergence. We propose SoftSignum, a smooth relaxation of sign-based optimization that replaces the hard sign map with a temperature-controlled soft-sign transformation, enabling a parameter-wise transition from sign-like updates to magnitude-sensitive SGD-like steps. We complement it with an adaptive quantile-based temperature schedule and extend the same principle to matrix-valued optimizers, obtaining SoftMuon. We also develop a generalized geometry-relaxation framework based on strongly convex regularizers and Fenchel conjugates, proving convergence in stochastic non-convex setting. Experiments on diverse deep learning tasks, including LLM pretraining, show that SoftSignum and SoftMuon consistently improve over their hard sign-based counterparts and standard AdamW.
Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization
ICLR, 2026In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function, $\min_x f(x)$. Recently, Malitsky (2020); Alacaoglu et al. (2023) developed an adaptive first-order method, GRAAL. This algorithm computes stepsizes by estimating the local curvature of the objective function without any line search procedures or hyperparameter tuning, and attains the standard iteration complexity $\mathcal{O}(L\Vert x_0-x^* \Vert^2/\epsilon)$ of fixed-stepsize gradient descent for $L$-smooth functions. However, a natural question arises: is it possible to accelerate the convergence of GRAAL to match the optimal complexity $\mathcal{O}(\sqrt{L\Vert x_0-x^*\Vert^2/\epsilon})$ of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made by Li and Lan (2025); Suh and Ma (2025), the ability of existing accelerated algorithms to adapt to the local curvature of the objective function is highly limited. We resolve this issue and develop GRAAL with Nesterov acceleration, which can adapt its stepsize to the local curvature at a geometric, or linear, rate just like non-accelerated GRAAL. We demonstrate the adaptive capabilities of our algorithm by proving that it achieves near-optimal iteration complexities for $L$-smooth functions, as well as under a more general $(L_0,L_1)$-smoothness assumption (Zhang et al., 2019).