We present a novel large-scale dataset and accompanying machine learning
models aimed at providing a detailed understanding of the interplay between
visual content, its emotional effect, and explanations for the latter in
language. In contrast to most existing annotation datasets in computer vision,
we focus on the affective experience triggered by visual artworks and ask the
annotators to indicate the dominant emotion they feel for a given image and,
crucially, to also provide a grounded verbal explanation for their emotion
choice. As we demonstrate below, this leads to a rich set of signals for both
the objective content and the affective impact of an image, creating
associations with abstract concepts (e.g., "freedom" or "love"), or references
that go beyond what is directly visible, including visual similes and
metaphors, or subjective references to personal experiences. We focus on visual
art (e.g., paintings, artistic photographs) as it is a prime example of imagery
created to elicit emotional responses from its viewers. Our dataset, termed
ArtEmis, contains 439K emotion attributions and explanations from humans, on
81K artworks from WikiArt. Building on this data, we train and demonstrate a
series of captioning systems capable of expressing and explaining emotions from
visual stimuli. Remarkably, the captions produced by these systems often
succeed in reflecting the semantic and abstract content of the image, going
well beyond systems trained on existing datasets. The collected dataset and
developed methods are available at this https URL