poking at the edges of markov chain text generation... here I'm using truncated SVD to find similar ngrams, based on the tokens that follow them. (the goal is to add variety to the generation process by plucking possible next tokens from those following similar ngrams)
another way to find similar ngram contexts: each context has an embedding derived from the sum of positional encoding (they're not just for transformers!) multiplied by "word vectors" (actually just truncated SVD of the transpose of the context matrix). then load 'em up in a nearest neighbor index
(this is cool because I can use it even on ngrams that *don't* occur in the source text, though all of the words themselves need to be in the vocabulary)
generating with a markov chain using softmax sampling w/temperature (a la neural networks). this is an order 3 character model, and you can really see the difference between low temperature (instantly starts repeating itself) and high temperature (draws from wacky corners of the distribution) (if you've generated text with a markov chain before, it's probably using what amounts to a temperature of 1.0)
I like having this extra setting to fiddle with! but based on my limited testing, the temperature doesn't really matter once the length of the ngram hits a certain limit, since most ngrams only have one or two possible continuations. like... with word 3-grams, it's pretty difficult to distinguish 0.35 from 2.5
@zatnosk just the example that I happened to pick! the ngrams on the left are similar in that they are followed by tokens that are similar, and "the" is one of the tokens that distinguish this group. to give another example, this is another cluster of ngrams that are followed with 'of' (and a few other tokens)
@aparrish i love seeing some great comp ling stuff on here! what are you getting after here? (i'm familiar with most of the techniques but am curious about the bigger picture)
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.