Before LLMs Could Predict, They Had to Count
By the end of this post, you'll understand exactly how the simplest language models work, the chain rule, the Markov assumption, n-grams, maximum likelihood estimation, and you'll see why every one...

Source: DEV Community
By the end of this post, you'll understand exactly how the simplest language models work, the chain rule, the Markov assumption, n-grams, maximum likelihood estimation, and you'll see why every one of these ideas is still alive inside the LLMs you use daily. You'll also understand the specific limitations that forced the field to move beyond counting and into neural prediction. This isn't history for history's sake. This is the conceptual foundation without which transformers don't make sense. This is how n-gram language models laid the foundation for every idea that transformers run on today. One Task, One Question Every language model, from a 1990s bigram counter to GPT-4, does the same job: given some words, figure out what word comes next. More precisely, a language model computes one of two things: The probability of a full sentence: P(w1,w2,…,wn)P(w_1, w_2, \dots, w_n) P(w1,w2,…,wn) The probability of the next word given everything before it: P(wn∣w1,w2,…,wn−1)P(w_n \mid w_1,