A hands-on guide to the core ideas behind language models, from the ground up.
A language model is a program that predicts the next token given what came before. That's it. When you type "The cat sat on the" — a language model figures out that "mat" is more likely than "xylophone". By chaining these predictions together, it can generate whole paragraphs of coherent text.
Given a sequence of text, predict what comes next. Each prediction is a probability distribution over all possible next tokens.
Language has patterns. After learning millions of patterns from text, the model captures grammar, facts, and even reasoning — all from next-token prediction.
Frontier models can have trillions of parameters. Our tiny model has ~20,000. Same idea, vastly different scale. But the fundamentals are identical.
Before a model can process text, it needs to break it into tokens — meaningful pieces that each get a numeric ID. Our model uses Byte Pair Encoding (BPE), the same algorithm used by GPT and most modern LLMs. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens like "the", "ing", or "at".
After tokenization, each token is just a number — an ID like 0, 1, or 42. But a number alone doesn't tell the model anything about what a token means. The model needs a richer representation: a list of numbers (a vector) that captures a token's properties. That's what an embedding is.
Think of it like coordinates. A city can be described by two numbers (latitude, longitude) that tell you where it is relative to other cities. An embedding does the same thing for a token, but in many more dimensions. Each dimension might loosely capture something like "how noun-like is this?" or "does this appear at the start of sentences?" — though the model discovers these dimensions on its own during training.
The model stores a big table (matrix) where each row is one token's embedding. To get the embedding for token #42, it just looks up row 42. The number of dimensions is a configurable parameter — our tiny model defaults to 8, while GPT-style models use thousands. You can adjust this in the Lab.
As the model trains, tokens that behave similarly — like "the" and "a", or "run" and "walk" — end up with similar embeddings (nearby vectors). You can see this in the embedding plot in the Lab.
The model looks at several tokens at once (a "context window" of 10 tokens in our case). It concatenates their embeddings into one long vector, giving the neural network a complete picture of the recent context.
The concatenated embeddings flow through a neural network that learns to map patterns in the input to predictions about the next token. Our model uses one hidden layer with a tanh activation function.
Training is how the model learns. We show it examples from the text: "given these 10 tokens, the next one should be X." The model makes a prediction, we measure how wrong it was (the loss), and then nudge all the parameters slightly to make a better prediction next time. This is called gradient descent.
Instead of updating after every single example, we average gradients over a "mini-batch" of examples (e.g., 64 at a time). This makes training faster and more stable.
With enough training, the model memorizes the training text exactly. For our tiny model, this is expected! Real LLMs train on trillions of tokens to avoid this.
To generate text, we start with a seed and repeatedly predict the next token. Each prediction is sampled from the probability distribution — we don't always pick the most likely token. The temperature parameter controls randomness.
Train your own tiny language model right in the browser. Add text, hit train, and watch it learn.
Paste text, load a sample, or upload a .txt file. More text = better patterns to learn.
Run BPE tokenization on your text to build a vocabulary before training.
Enter a seed prompt and watch the model generate text token by token. Opacity reflects confidence.