Introducing new heterophilous graph datasets
In many real-world graphs, edges tend to connect similar nodes. This property of graphs is called homophily. For example, in social networks, edges (friendships, following) usually connect people with similar interests, and in citation networks, papers typically cite works from the same area. The opposite of homophily is called heterophily. This property describes the preference of nodes to connect to nodes that are not similar to them. For example, in financial transaction networks, fraudsters often perform transactions with non-fraudulent users, and in dating networks, most connections are between people of opposite genders.
Graph neural networks (GNNs) have achieved strong results on node classification tasks. However, it is often believed that standard GNNs only work well for homophilous graphs, and thus, specialized methods need to be developed for heterophilous graphs. Many such methods have been recently proposed. These methods are typically evaluated on the same set of six graphs, which were first adopted for studying learning under heterophily by Pei et al. [1] and have become the de-facto standard benchmark for heterophily-specific models. However, in our recent work [2], we have found significant problems with these datasets, which make results obtained using them unreliable. These problems are low diversity, small graph size, strong class imbalance, and the presence of a large number of duplicated nodes.
Motivated by these issues, we have proposed an alternative benchmark of five diverse heterophilous graphs that come from different domains and exhibit a variety of structural properties. Our benchmark includes:
- a word dependency graph roman-empire;
- a product co-purchasing network amazon-ratings;
- a synthetic graph emulating the minesweeper game minesweeper;
- a crowdsourcing platform worker network tolokers;
- a question-answering website interaction network questions.
Our datasets are available on GitHub and in PyG Datasets. The statistics of these datasets are listed in the table below:
roman-empire | amazon-ratings | minesweeper | tolokers | questions | |
---|---|---|---|---|---|
nodes | 22662 | 24492 | 10000 | 11758 | 48921 |
edges | 32927 | 93050 | 39402 | 519000 | 153540 |
avg degree | 2.91 | 7.60 | 7.88 | 88.28 | 6.28 |
global clustering | 0.29 | 0.32 | 0.43 | 0.23 | 0.02 |
avg local clustering | 0.39 | 0.58 | 0.44 | 0.53 | 0.03 |
diameter | 6824 | 46 | 99 | 11 | 16 |
node features | 300 | 300 | 7 | 10 | 301 |
classes | 18 | 5 | 2 | 2 | 2 |
edge homophily | 0.05 | 0.38 | 0.68 | 0.59 | 0.84 |
adjusted homophily | -0.05 | 0.14 | 0.01 | 0.09 | 0.02 |
label informativeness | 0.11 | 0.04 | 0.00 | 0.01 | 0.00 |
We have evaluated a large number of models, both standard GNNs and heterophily-specific methods, and, perhaps surprisingly, found that standard GNNs augmented with skip connections and layer normalization almost always outperform specialized models. We hope the proposed benchmark and the insights obtained using it will facilitate further research in learning under heterophily.