Tabular data

Posts

June 7, 2023
Research
TabDDPM: modelling tabular data with diffusion models
Tabular data Generative models
December 2, 2022
Research
Embeddings for numerical features in tabular deep learning
Tabular data

Publications

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling
Tabular data
Yury Gorishniy
Akim Kotelnikov
Artem Babenko
ICLR, 2025
Deep learning architectures for supervised learning on tabular data range from simple multilayer perceptrons (MLP) to sophisticated Transformers and retrieval-augmented methods. This study highlights a major, yet so far overlooked opportunity for substantially improving tabular MLPs; namely, parameter-efficient ensembling -- a paradigm for imitating an ensemble of models with just one model. We start by describing TabM -- a simple model based on MLP and BatchEnsemble (an existing technique), improved with our custom modifications. Then, we perform a large scale evaluation of tabular DL architectures on public benchmarks in terms of both task performance and efficiency, which renders the landscape of tabular DL in a new light. In particular, we find that TabM outperforms prior tabular DL models, while the complexity of attention- and retrieval-based methods does not pay off. Lastly, we conduct a detailed empirical analysis, that sheds some light on the high performance of TabM. For example, we show that parameter-efficient ensembling is not an arbitrary trick, but rather a highly effective way to reduce overfitting and improve optimization dynamics of tabular MLPs. Overall, our work brings an impactful technique to tabular DL, analyses its behaviour, and advances the performance-efficiency tradeoff with TabM -- a simple and powerful baseline for researchers and practitioners.
TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks
Tabular data
Ivan Rubachev
Nikolay Kartashev
Yury Gorishniy
Artem Babenko
ICLR, 2025
Advances in machine learning research drive progress in real-world applications. To ensure this progress, it is important to understand the potential pitfalls on the way from a novel method's success on academic benchmarks to its practical deployment. In this work, we analyze existing tabular deep learning benchmarks and find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature. First, in real-world deployment scenarios, distribution of data often changes over time. To account for this distribution drift, time-based train/test splits should be used in evaluation. However, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets. In this work, we aim to understand how recent research advances in tabular deep learning transfer to these underrepresented conditions. To this end, we introduce TabReD — a collection of eight industry-grade tabular datasets. We reassess a large number of tabular ML models and techniques on TabReD. We demonstrate that evaluation on both time-based data splits and richer feature sets leads to different methods ranking, compared to evaluation on random splits and smaller number of features, which are common in academic benchmarks. Furthermore, simple MLP-like architectures and GBDT show the best results on the TabReD datasets, while other methods are less effective in the new setting.
TabR: Tabular Deep Learning Meets Nearest Neighbors
Tabular data
Yury Gorishniy
Ivan Rubachev
Nikolay Kartashev
Daniil Shlenskii
Akim Kotelnikov
Artem Babenko
ICLR, 2024
Deep learning (DL) models for tabular data problems (e.g. classification, regression) are currently receiving increasingly more attention from researchers. However, despite the recent efforts, the non-DL algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution for these problems. One of the research directions aimed at improving the position of tabular DL involves designing so-called retrieval-augmented models. For a target object, such models retrieve other objects (e.g. the nearest neighbors) from the available training data and use their features and labels to make a better prediction.

In this work, we present TabR — essentially, a feed-forward network with a custom k-Nearest-Neighbors-like component in the middle. On a set of public benchmarks with datasets up to several million objects, TabR marks a big step forward for tabular DL: it demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed “GBDT-friendly” benchmark. Among the important findings and technical details powering TabR, the main ones lie in the attention-like mechanism that is responsible for retrieving the nearest neighbors and extracting valuable signal from them. In addition to the higher performance, TabR is simple and significantly more efficient compared to prior retrieval-based tabular DL models.

Datasets

Shifts Dataset
Distributional shift Uncertainty estimation Tabular data Machine translation Natural language processing
Andrey Malinin
Neil Band
Yarin Gal
Mark J. F. Gales
Alexander Ganshin
German Chesnokov
Alexey Noskov
Andrey Ploskonosov
Liudmila Prokhorenkova
Ivan Provilkov
Vatsal Raina
Vyas Raina
Denis Roginskiy
Mariya Shmatova
Panos Tigas
Boris Yangel
The Shifts Dataset contains curated and labeled examples of real, 'in-the-wild' distributional shifts across three large-scale tasks. Specifically, it contains tabular weather prediction, machine translation, and vehicle motion prediction tasks' data used in Shifts Challenge 2021. Dataset shift is ubiquitous in all of these tasks and modalities.

Tabular data

Posts

TabDDPM: modelling tabular data with diffusion models

Embeddings for numerical features in tabular deep learning

Publications

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks

TabR: Tabular Deep Learning Meets Nearest Neighbors

Datasets

Shifts Dataset