Mukit

What is attention?

Every time ChatGPT or Claude responds to you, the model reads your entire message at once and decides which parts matter most for each word it produces. That mechanism is called attention — and the heatmap on the left is a direct window into it.

Each row in the heatmap is a query token — a word asking "what should I pay attention to?" Each column is a key token — a word answering "I'm available to be attended to." A bright cell at row fox, column quick means the model, while processing the word fox, is heavily drawing context from the word quick. A near-black cell means almost no influence.

Notice the diagonal: tokens tend to attend to themselves, which makes sense — a word always carries its own meaning. And notice that later tokens can attend to earlier ones, but not the reverse. This is the causal (left-to-right) constraint that makes text generation possible — the model can't "cheat" by looking ahead.

A transformer runs this process in parallel across multiple attention heads(try clicking the dot buttons). Each head learns to track something different — one might follow grammatical subject-verb relationships, another might track co-references like it → the cat. The model has six layers of this stacked on top of each other, each refining the representation further. When you switch layers, you're moving deeper into the model's reasoning chain.

How does the model generate text?

ChatGPT, Claude, and every other modern language model share one core mechanic: they predict the next token, one at a time. A token is roughly a word or word fragment — "transformer" might be a single token, while "transformers" could be split into "transform" + "ers" depending on the model's vocabulary.

After processing your prompt, the model produces a probability distribution over its entire vocabulary — often 50,000+ words. The bar chart on the right shows the top candidates. "Jumps" at 32% doesn't mean the model is certain — it means that, given everything it has seen in training, jumps is the most statistically likely continuation of the quick brown fox. But runs, leaps, and others are all real possibilities.

When you click a token to accept it, it gets appended to the sequence and the model runs again from scratch on the extended input — this is called autoregressive generation. That loop is exactly what happens when Claude is "typing" a response to you: it's choosing one token, appending it, then running the full attention mechanism again to choose the next.

What does temperature control?

The temperature slider reshapes the probability distribution without changing the model's underlying beliefs. At low temperature (near 0.1), the distribution collapses — the most likely token becomes overwhelmingly dominant and the model feels deterministic and repetitive. At high temperature (near 2.0), the distribution flattens — lower-ranked tokens get a real shot, and the output becomes more creative but also less coherent.

The temperature control in most AI interfaces (including the Claude API) maps directly to this slider. When you want a factual, consistent answer, you use a low temperature. When you want creative variation or brainstorming, you turn it up. The slider here lets you see that redistribution happen instantly on the same logits — no re-inference needed, just a different softmax scaling.

Inside the LLM

What is attention?

How does the model generate text?

What does temperature control?