- June 21, 2023Research
Structural distributional shifts for graph datasets
- January 31, 2022Announcement
Outcomes of the Shifts Challenge
- July 22, 2021Announcement
Shifts Challenge at NeurIPS 2021
Uncertainty estimation
Uncertainty estimation enables detecting when ML models make mistakes. This is of critical importance in high risk machine learning applications, such as autonomous vehicle and medical ML.
Posts
Publications
Unveiling the Role of Data Uncertainty in Tabular Deep Learning
ICML, 2026Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data (aleatoric) uncertainty for explaining the effectiveness of recent tabular DL methods. While data uncertainty leads to irreducible prediction errors on test samples, it also introduces stochasticity into the training signal that can impede effective learning. We demonstrate that tabular methods differ significantly in their ability to cope with this optimization challenge. Specifically, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, advanced ensembling strategies, retrieval-augmented models, and tabular Prior-Fitted Networks, can be partially attributed to their respective implicit mechanisms for performing well under high data uncertainty. By dissecting these varied mechanisms, we provide a unifying understanding of recent performance improvements. Furthermore, leveraging insights from this perspective, we design a novel, more effective numerical feature embedding method as an immediate practical outcome of our analysis. Overall, our work paves the way toward a principled understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.
Evaluating Robustness and Uncertainty of Graph Models Under Structural Distributional Shifts
NeurIPS, 2023In reliable decision-making systems based on machine learning, models have to be robust to distributional shifts or provide the uncertainty of their predictions. In node-level problems of graph learning, distributional shifts can be especially complex since the samples are interdependent. To evaluate the performance of graph models, it is important to test them on diverse and meaningful distributional shifts. However, most graph benchmarks considering distributional shifts for node-level problems focus mainly on node features, while structural properties are also essential for graph problems. In this work, we propose a general approach for inducing diverse distributional shifts based on graph structure. We use this approach to create data splits according to several structural node properties: popularity, locality, and density. In our experiments, we thoroughly evaluate the proposed distributional shifts and show that they can be quite challenging for existing graph models. We also reveal that simple models often outperform more sophisticated methods on the considered structural shifts. Finally, our experiments provide evidence that there is a trade-off between the quality of learned representations for the base classification task under structural distributional shift and the ability to separate the nodes from different distributions using these representations.
Gradient Boosting Performs Gaussian Process Inference
ICLR, 2023This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection.
Datasets
Shifts Dataset
The Shifts Dataset contains curated and labeled examples of real, 'in-the-wild' distributional shifts across three large-scale tasks. Specifically, it contains tabular weather prediction, machine translation, and vehicle motion prediction tasks' data used in Shifts Challenge 2021. Dataset shift is ubiquitous in all of these tasks and modalities.