Build a Tiny LLM in Go, Part 4: Letting Letters Look at Each Other

By Part 3 we had a model that learns: it measures its own wrongness as a single number and rolls downhill until English stops surprising it. But it learns with a handicap. It looks at a fixed little window of characters and treats them as an undifferentiated smear. It cannot tell that in “the cat sat on the m”, the letters that make “at” likely next are the ones spelling “cat” a few positions back, not the “on the” in between.

What the model is missing is a way for the right earlier characters to reach forward and influence the current prediction, while the irrelevant ones stay quiet. That mechanism is called attention. It is the idea in the 2017 paper that gave the transformer its name and the “T” in GPT, and once you have seen it built out of small pieces, the mystique falls away. It is a weighted average with a clever way of choosing the weights.

The problem attention solves

Think about predicting the next character after “she opened the door and looked ”. A good guess is “a” (starting “at” or “around”). To make that guess well, the model needs to know there is a person, “she”, who is doing the looking, and that “opened the door” already happened. That information sits many characters back. The prediction at the current position needs to pull in the relevant earlier positions and ignore the rest.

The naive fix, remembering everything in a bigger lookup, is exactly the wall Part 2 smashed into. Attention does something smarter. Instead of memorizing which context predicts what, it lets every position in the text look over all the earlier positions and decide, on the fly, which of them are worth listening to right now. The decision is learned, not fixed, so the model can learn that verbs care about their subjects and open quotes care about close quotes, without anyone programming those rules.

Queries, keys, and values

Here is the mechanism, and the standard way to explain it is a small analogy that happens to be exactly what the math does.

Every position produces three things from its own vector, each just a different learned mixing of its numbers:

A query: what this position is looking for. The position after “looked” might be asking, roughly, “who is doing an action, and what was it?”
A key: what this position offers to others. The position holding “she” advertises, roughly, “I am a subject, a person doing things.”
A value: what this position will actually contribute if someone listens to it.

Now the matching. For the current position, you compare its query against the key of every earlier position. Where a query and a key line up well, that earlier position is relevant, and it gets a high score. Where they do not, a low score. Run those scores through the softmax from Part 3 and they become weights that sum to one: a set of proportions saying “pay 70% attention to this position, 20% to that one, and almost none to the rest”. The position’s new vector is then the weighted average of everyone’s values, using exactly those proportions. Relevant positions contribute a lot; irrelevant ones barely register.

That is one attention head, in full. Three learned projections to make queries, keys, and values; a score from matching queries against keys; a softmax to turn scores into weights; a weighted average of values. In the repo it is a handful of lines, and the score-and-weight step reads almost like the description above:

// Weights returns the causal softmax attention weights for input x: for each
// position, how much it attends to every position it is allowed to see.
func (h *Head) Weights(x *Tensor) *Tensor {
	q := MatMul(x, h.Wq) // what each position is looking for
	k := MatMul(x, h.Wk) // what each position offers
	// score every position against every other, scaled so softmax stays sane
	scores := MulScalar(MatMul(q, Transpose(k)), 1/math.Sqrt(float64(h.headSize)))
	scores = MaskedFillCausal(scores) // no peeking at the future
	return Softmax(scores)            // scores become weights that sum to 1
}

No peeking at the future

There is one rule that line quietly enforces, and it matters enough to name: MaskedFillCausal. A language model is trained to predict the next character, so when it is deciding what comes after position five, it must be allowed to look at positions one through five, but never at position six or beyond. If it could see the future, the task would be trivial and it would learn nothing useful, like a student who can see the answer key during the exam.

So before the softmax, we take every score that would let a position attend to a later position and set it to negative infinity. After softmax, negative infinity becomes a weight of zero. Each position can attend to itself and everything before it, and nothing after. This is called causal masking, and it is why the model can be trained on ordinary text: every position simultaneously practices predicting its own successor, using only what came before it.

Several kinds of attention at once

One head learns one kind of relationship, one way of deciding what is relevant. But language has many relationships running at the same time. A verb relates to its subject, a pronoun to the noun it stands for, a close bracket to its matching open bracket. Asking a single head to track all of these is asking too much.

So the model runs several heads in parallel, each with its own queries, keys, and values, each free to specialize. One head might learn to look at the previous character, another at the start of the current word, another at a matching quotation mark far behind. Their outputs are stitched back together and mixed. This is called multi-head attention, and it is the reason a transformer can juggle several kinds of context at once without them interfering.

Seeing it happen

Because attention weights are just proportions, we can print them and look. The stage 4 demo runs a single head over a short string and shows, for each character, how much it attends to each earlier character. The output looks like this (an untrained head, so the exact proportions are near-random, but the structure is the whole point):

go run ./cmd/stage4_attention

attention weights over "hello" (row i = how char i attends):

          h    e    l    l    o
    h  1.00 0.00 0.00 0.00 0.00
    e  0.43 0.57 0.00 0.00 0.00
    l  0.31 0.32 0.36 0.00 0.00
    l  0.23 0.24 0.27 0.27 0.00
    o  0.20 0.24 0.17 0.17 0.21

Two things in that grid are the whole lesson. First, the entire upper-right triangle is zero: that is the causal mask, every character refusing to look at the ones after it. The first h can only attend to itself, so it gives itself 1.00. The e can look at h and itself, and splits its attention between them. By the time we reach o, it is spreading its attention across all five characters. Second, each row sums to one, because these are proportions. This head is untrained, so its weights are close to an even spread. After training, patterns appear: heads learn to concentrate their attention exactly where the useful information is.

That grid is attention laid bare. There is no magic in it, just a learned, masked, weighted average. But it is the piece that was missing. With it, every position can reach back and gather precisely the earlier context it needs, and it can learn which context that is.

We now have every part: a tokenizer, an autograd engine, a loss, gradient descent, and attention. In Part 5 we assemble them into the full transformer, stack a few layers, and turn it loose on a real book. Then we do the only thing that ever really convinces anyone: we watch the loss fall and read what the machine dreams up.

Code for this part is in cmd/stage4_attention and attention.go at github.com/erubboli/go-tiny-llm.