Papers accepted to ICLR and NAACL 2025

Three papers by the Yandex Research team (shown in bold below) and our collaborators have been accepted for publication at the International Conference on Learning Representations (ICLR 2025), and one paper has been accepted for publication at the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025).

ICLR 2025 The link has been copied to clipboard

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks by Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko

In this paper, we take a closer look at tabular deep learning benchmarks. We identify two characteristics of tabular data in real-world deployment scenarios overlooked by existing benchmarks: temporal distribution shifts and extensive feature engineering pipelines. To fill this gap, we introduce TabReD — a collection of eight industry-grade tabular datasets from real ML pipelines and Kaggle competitions. We evaluate recent advances in tabular deep learning on the new benchmark and find that evaluation on both time-based data splits and richer feature sets leads to different methods ranking, compared to evaluation on random splits and smaller number of features, which are common in academic benchmarks.

TabReD represents an important step towards more realistic tabular DL benchmarks by covering industrial tabular DL use-cases. We encourage researchers to test their methods on TabReD.

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling by Yury Gorishniy, Akim Kotelnikov, Artem Babenko

What deep learning architecture to try on tabular data? In this work, we present TabM — our new answer to this eternal question. TabM is a simple model that efficiently imitates an ensemble of MLPs. On public benchmarks, it demonstrates the best average performance, while also being significantly more efficient than attention- and retrieval-based architectures. Overall, TabM advances the performance-efficiency trade-off in tabular DL, and becomes a new strong baseline for practitioners and researchers.

Decentralized Optimization with Coupled Constraints by Demyan Yarmoshik, Alexander Rogozin, Nikita Kiselev, Daniil Dorin, Alexander Gasnikov, Dmitry Kovalev

Vertical Federated Learning over Erdős–Rényi graph with LIBSVM Mushrooms dataset

We revisit the smooth and strongly convex separable optimization problem with linear coupling constraints, where the objective functions and coupling matrices are stored across the nodes of a decentralized communication network. This problem has attracted a lot of interest due to applications in power systems, distributed control, federated learning, etc., and a number of optimization methods with various convergence guarantees have been proposed for solving it. In this work, we develop theoretical lower bounds on the communication, matrix-vector multiplication, and gradient computation complexities of solving the problem. Our findings suggest that current algorithms may be inefficient, as their theoretical performances fall short of these lower bounds by a substantial margin. Consequently, we close this gap by developing the first distributed gradient method, whose theoretical complexities match the lower bounds. The proposed algorithm substantially outperforms the existing algorithms in practice, as corroborated by our preliminary experiments with distributed linear regression and vertical federated learning problems.

NAACL 2025 The link has been copied to clipboard

Pushing the Limits of Large Language Model Quantization via the Linearity Theorem by Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh

In this paper, we examine the theoretical foundations of large language model quantization. We present a fundamental “linearity theorem” that connects layer-wise ℓ2 reconstruction error to model perplexity changes during quantization. Based on these insights, we introduce HIGGS — a data-free quantization method utilizing Hadamard rotations and MSE-optimal grids and develop an optimal solution for non-uniform per-layer quantization using dynamic programming. We evaluate our approaches on Llama-3.1, 3.2, and Qwen-family models, demonstrating superior compression-accuracy trade-offs compared to popular methods like NF4. We further show that our method can be efficiently implemented with GPU kernels across various batch sizes, making practical advances in data-free and non-uniform LLM quantization.