How much do text-to-Image models know? A hypernymy-based approach

Text-to-image synthesis models like Stable Diffusion, DALL-E 2 or Imagen have increased in popularity over the past few years. The research community keeps developing new approaches, with RAPHAEL, DeepFloyd and DALL-E 3 being some of the most recent ones. Still, assessing the language understanding capabilities of such models remains a challenging problem. In our latest work, "Hypernymy Understanding Evaluation of Text-to-image Models via WordNet Hierarchy", we present a method for analyzing the linguistic abilities and world knowledge of text-to-image models by measuring how well they understand subtype relations between concepts.

Background The link has been copied to clipboard

Most current approaches for comparing text-to-image models rely on measuring purely visual characteristics on datasets of image-caption pairs (for example, Fréchet Inception Distance or FID). This provides only an implicit measure of language understanding. Alternatively, some methods, for instance, CLIPScore, assess overly abstract skills such as general prompt alignment.
While these metrics may have been sufficient to evaluate simpler models, the rapid increase in the abilities of newer models demands more intricate methods of analyzing their performance. Factors such as aesthetic quality and high resolution of generated images are important, but we are ultimately interested in how well a model understands the user. For instance, we want the model to know and recognize concepts so that it’s able to correctly draw them.
Additionally, we want the model to be as diverse as possible: when asked to generate a broad term such as "animal", it should generate as many different animal species as possible. If, for example, the model could only generate cats, it would be heavily limited in its creative potential for the user.

Method The link has been copied to clipboard

First, we need to introduce the notion of hypernymy, or the is-a relation between concepts, as it is central to our approach. In essence, hypernymy is the relation between a more general term called a hypernym (e.g., "an animal"), and a more specific term called a hyponym (e.g., "a dog").
Example of a hypernymy tree. Arrows are directed from more general concepts (hypernyms) to less general ones (hyponyms).
We design two metrics: the first one, called the In-subtree Probability (ISP), measures if the generated samples represent some hyponym of the concept. This metric is similar to precision: intuitively, if we ask the model to generate images of dogs, it should generate specific dog species, and it should not generate cats, cars or any other unrelated objects. The second metric, called the Subtree Coverage Score (SCS), measures how well the model covers the entire set of possible hyponyms. In this case, we want the model to draw as many different dog species as possible and we penalize it when it omits some species from generation.
To compute these metrics, we use WordNet, a well-known curated set of English words grouped into synonym sets (also called synsets or concepts), which already contains information about hypernymy. Importantly, ImageNet classes already have WordNet synsets assigned to them, so we are able to use off-the-shelf image classification models to map generated samples to these synsets. We release the implementation of our metrics in the official GitHub repository of the paper. A visual overview of the proposed method can be seen in the figure below.
Example computation of the In-Subtree Probability (left) and the Subtree Coverage Score (right). Blue color marks the "Dog" synset used as a prompt.

Analysis The link has been copied to clipboard

Now that we've described our metrics, let's see how they perform in practice. First, we compare popular text-to-image models using ISP and SCS and provide results in the following table. We find that our metrics largely follow the same trend that older evaluation methods have, yet rank some models in different order (for example, Stable Diffusion 1.4 and Stable Diffusion 2.0). This confirms that hypernymy understanding correlates with other abilities of text-to-image generation models.
What sets our metrics apart from most others is the analysis we can perform with them. Our approach allows us to easily identify concepts where the model is weak by selecting synsets with low In-subtree Probability. For example, the figure below shows that models struggle with many concepts, sometimes not being able to draw them sufficiently well and sometimes missing the meaning completely.
Similarly, we are able to detect concepts that have low diversity for the given model. Unlike other diversity measures, our results are highly interpretable because we can directly identify synsets that the model omits. We discover that SD 1.4 "knows" only one species of wildcats, foxes and Belgian shepherds. For instance, if we take the hyponym probability distribution for the "Belgian shepherd" synset, we find that the model almost exclusively draws the Groenendael breed and ignores all others. This is one of the benefits of our metric being tied to WordNet, which allows it to broadly measure model knowledge.
Another interesting use-case of our approach is comparing models that achieve equal scores on other benchmarks. When this occurs, it becomes hard to reason about the differences between these models. Our metrics naturally allow us to perform granular comparisons by analyzing the scores on different concepts separately. For example, when examining Stable Diffusion 1.4 and Stable Diffusion 2.0, we discover that while they are very similar (as indicated by identical scores on most concepts), they do have major differences in some fields, as shown by the following figure.
Per-synset metric difference between Stable Diffusion 1.4 and Stable Diffusion 2.0. Higher values depict concepts where the first model outperforms the second one.
Alternatively, we could evaluate the models on sets of related concepts. This also uncovers interesting differences between models. The table below shows that while GLIDE is one of the weakest models, it outperforms stronger ones when it is tasked to draw different varieties of birds, lizards and fruits. Another finding is that GLIDE struggles severely when asked to draw various clothing, achieving a near-zero score on this set. This information could be helpful if the end user wanted to choose an appropriate model to draw some specific category of objects (such as birds).

Conclusion The link has been copied to clipboard

In our new work, we present two new metrics for evaluating text-to-image generative models. They allow researchers to perform practically important analyses, such as discovering concepts or entire sets of concepts where models perform poorly. The implementation of these metrics is publicly available at
Another feature of our approach is the possibility of a granular comparison between models. Such comparisons could provide new insights into how various model modifications affect its performance. For example, does the model "forget" other fields if we finetune it on a specific domain? Or does adding another super-resolution step affect the "knowledge" of the full pipeline? These are the examples of questions we can answer by going beyond single-value leaderboard-style metrics.