searching for letters in the visual genome dataset https://visualgenome.org/VGViz/explore?query=letter
image captions like this always sound like descriptions from text adventure games, or like calling the super for help with the fusebox when a fuse gets tripped, or like when one stranger asks another at some public spectacle, "what happened here?" (very different from how you might "caption"/describe photos you took [e.g.] on vacation to a friend while showing them to the friend—in that case you'd be saying things like "and *this* is *my* favorite place in...")
@aparrish Is technology like this being used for captioning? Or just for search engine indexing?
For the reasons you mention, and more, it seems pretty poorly suited to captioning.
@apLundell thinking of e.g. https://cs.stanford.edu/people/karpathy/densecap/ and https://github.com/tensorflow/models/tree/master/research/im2txt both of which are well-known image captioning models based on corpora that have this characteristic (though admittedly I don't know exactly how these particular models are being used in commercial applications)
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.