Most current approaches for comparing text-to-image models rely on measuring purely visual characteristics on datasets of image-caption pairs (for example, Fréchet Inception Distance or FID). This provides only an implicit measure of language understanding. Alternatively, some methods, for instance, CLIPScore, assess overly abstract skills such as general prompt alignment.
While these metrics may have been sufficient to evaluate simpler models, the rapid increase in the abilities of newer models demands more intricate methods of analyzing their performance. Factors such as aesthetic quality and high resolution of generated images are important, but we are ultimately interested in how well a model understands the user. For instance, we want the model to know and recognize concepts so that it’s able to correctly draw them.
Additionally, we want the model to be as diverse as possible: when asked to generate a broad term such as "animal", it should generate as many different animal species as possible. If, for example, the model could only generate cats, it would be heavily limited in its creative potential for the user.