searching for letters in the visual genome dataset https://visualgenome.org/VGViz/explore?query=letter
thinking about the deictic properties of the indefinite article so often found in image captions, especially in a machine learning context—e.g. it's almost always something like "a boat is in the water," not "this/that/the boat..." (or, you know, "my boat," "someone else's boat," etc.). demonstrates the extent to which image captioning as an activity assumes & enforces a kind of physical, temporal and social decontextualization of images—neither captionist nor viewer are "present" in the process
image captions like this always sound like descriptions from text adventure games, or like calling the super for help with the fusebox when a fuse gets tripped, or like when one stranger asks another at some public spectacle, "what happened here?" (very different from how you might "caption"/describe photos you took [e.g.] on vacation to a friend while showing them to the friend—in that case you'd be saying things like "and *this* is *my* favorite place in...")
@aparrish I like the idea this sort of suggests, little microfiction pieces as image captions
@cori I kind of like the aesthetic effect too! but I feel like it's only a very particular kind of fiction does visual descriptions like this—especially in microfiction I'd figure you'd prefer the deixis-heavy "I adore this fuzzy sweater" to "A fuzzy sweater is near a person. A person adores a fuzzy sweater."
@aparrish that's true, I was thinking more about "I adore this fuzzy sweater"-type captions. maybe I just like the idea of explicitly subjective image descriptions?
@aparrish Is technology like this being used for captioning? Or just for search engine indexing?
For the reasons you mention, and more, it seems pretty poorly suited to captioning.
@apLundell thinking of e.g. https://cs.stanford.edu/people/karpathy/densecap/ and https://github.com/tensorflow/models/tree/master/research/im2txt both of which are well-known image captioning models based on corpora that have this characteristic (though admittedly I don't know exactly how these particular models are being used in commercial applications)
@aparrish Do you feel differently about mastodon image descriptions? I find if I write these as "the magpie shouts Heck!" I feel like it raises questions - what magpie? Are we supposed to recognize it? While "a magpie shouts Heck!" just describes what is in the image with as much or as little context as looking at the image provides.
@anne I should be clear that I don't think physical, temporal and social decontextualization of the image is a priori a bad thing! and it's probably necessary to some extent when describing images (especially for accessibility purposes) in a medium like this that is already likewise decontextualized. I just thought it was interesting to note, and wanted to think about other ways that images are described (and can be described) in different contexts and for different purposes
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.