searching for letters in the visual genome dataset https://visualgenome.org/VGViz/explore?query=letter
thinking about the deictic properties of the indefinite article so often found in image captions, especially in a machine learning context—e.g. it's almost always something like "a boat is in the water," not "this/that/the boat..." (or, you know, "my boat," "someone else's boat," etc.). demonstrates the extent to which image captioning as an activity assumes & enforces a kind of physical, temporal and social decontextualization of images—neither captionist nor viewer are "present" in the process
image captions like this always sound like descriptions from text adventure games, or like calling the super for help with the fusebox when a fuse gets tripped, or like when one stranger asks another at some public spectacle, "what happened here?" (very different from how you might "caption"/describe photos you took [e.g.] on vacation to a friend while showing them to the friend—in that case you'd be saying things like "and *this* is *my* favorite place in...")
@apLundell thinking of e.g. https://cs.stanford.edu/people/karpathy/densecap/ and https://github.com/tensorflow/models/tree/master/research/im2txt both of which are well-known image captioning models based on corpora that have this characteristic (though admittedly I don't know exactly how these particular models are being used in commercial applications)
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.