Build a Tiny LLM in Go, Part 5: Putting It Together and Watching It Learn

We have built every part. Part 1 turned characters into a counting model and found its fatal flaw: it remembers only one letter. Part 2 showed why you cannot fix that by counting harder. Part 3 replaced counting with learning: a loss, gradient descent, and an autograd engine to compute the slopes. Part 4 added attention, letting each position reach back and gather the context it needs. This part assembles them into the real thing, trains it on a book, and reads the result.

The transformer block

A transformer is not a new idea on top of Part 4. It is Part 4 stacked and wrapped, twice per layer, with two small refinements that make deep networks trainable.

The first refinement is the residual connection. Instead of each layer replacing its input with something new, it adds an adjustment to the input and passes the sum along. The picture is a running draft that each layer edits lightly rather than rewriting. This matters because it gives the gradient a clean path all the way back through many layers: even a deep stack can be trained, because the slopes do not have to survive a long chain of transformations to reach the early weights. The second refinement is normalization, a rescaling before each step that keeps the numbers in a sane range so training does not blow up. Modern models use a lean version called RMSNorm, and so do we.

One block, then, is: normalize, do attention, add the result back to the input; normalize again, run a small per-position feed-forward network, add that back too. Attention lets positions share information across the sequence; the feed-forward network then digests it at each position. In Go the whole block is a few lines of actual work:

func (b *Block) Forward(x *Tensor) *Tensor {
	// attention sublayer, added back to the input (a residual connection)
	x = Add(x, b.attn.Forward(RMSNorm(x, b.norm1)))
	// feed-forward sublayer, also added back
	ff := AddRow(MatMul(ReLU(AddRow(MatMul(RMSNorm(x, b.norm2), b.w1), b.b1)), b.w2), b.b2)
	return Add(x, ff)
}

The full model wraps a stack of these blocks. It turns each character id into a vector, adds a second vector that encodes the character’s position (so the model knows the order, since attention by itself does not), runs the stack, and projects the final vectors down to a score for every possible next character. That is a transformer. Ours is deliberately tiny: two layers, four attention heads, vectors 64 numbers wide, a context window of 128 characters. Every one of those knobs is a single named constant in config.go, so you can make it bigger and slower or smaller and faster by editing one line.

Training on a book

The training loop is the one from Part 3, unchanged in spirit. Take a random slice of the book. Ask the model to predict the next character at every position at once. Compute the loss. Call Backward() to fill in every gradient through the whole stack, attention and all. Take one step downhill with the optimizer. Repeat a few thousand times.

The bundled text is Alice in Wonderland, cleaned down to our seventy characters, about 140 kilobytes. On a laptop, with no GPU and no cleverness, this trains in a few minutes. Run it:

go run ./cmd/stage5_train

The loss falls, though not smoothly. It jumps around from step to step, because each step sees a different random slice of the book, but the trend is unmistakable:

step     0/3000  loss 4.7355
step   500/3000  loss 2.8512
step  1000/3000  loss 2.6186
step  1500/3000  loss 2.3629
step  2000/3000  loss 2.5237
step  2500/3000  loss 2.5066
step  2900/3000  loss 2.5590

It starts at 4.74. That is essentially the loss of knowing nothing: with seventy characters, blind guessing scores about 4.25, and a freshly randomized model does a touch worse. Within a few hundred steps it is under 3.0, and it settles around 2.5. The model has learned, from nothing but a book and the machinery of the previous four parts, to be meaningfully less surprised by English than random noise.

Reading what it dreams

The loss is a number. The honest test is to let the model write, and read what comes out. Generation is the loop from Part 1: predict the next character, sample one from the model’s confidences, append it, feed the growing text back in, and repeat. Here is a real 400-character sample, unedited:

 te heaide qus t, s jesong ant gete Aln suhoke thengplils haso t ber
 wonor re finrhese t whekeno, ayouthe ghem. ok! cbe ureum!
 lrider, tergrsnoumpend weanor.

Wowol cotherentosas, bo aisth?
Afaste fe t wiongshe
cuced stis ite
Ibhen, Rk ve sain
I l ofocollin.

! n whe Alede s to s opo ay ithey be t lely tathat Alit hero fithe y, ts icle.
les she sarye, dswAlerof anore s I aw iryorly.

It is gibberish. It is also, clearly, gibberish that has learned English. The word lengths are right. The letter combinations are ones English uses. There are real words in there, “she”, “to”, “be”, “hero”, “anore” reaching for “another”, and shadows of the source text: “Aln”, “Alede”, “Alit” are the model grasping at “Alice”. It has learned to open a line, put spaces where words end, and even scatter a bit of punctuation. It writes the way you write in a dream, where every word looks right until you try to read it.

Compare that to where we started in Part 1, the counting model that could see one letter back. Both produce nonsense, but they fail differently. The bigram wandered because it forgot everything older than the last character. This one does not forget: it has 128 characters of context and attention to search them. What it lacks is not memory but scale. It is a two-layer model with a few tens of thousands of weights, trained for a few minutes on one short book.

The only thing that changes

And that is the honest ending, the one worth sitting with. The difference between this program and the model behind Claude or GPT is not a difference of kind. It is the same next-character question, the same loss, the same gradient descent, the same attention we built by hand in Part 4, the same autograd engine from Part 3 checking its own slopes against reality. Everything essential is in the roughly twelve hundred lines of Go you can now read end to end.

What separates our dreaming toy from a system that writes working code is scale, and nothing else conceptually. More layers. Wider vectors. A context window of hundreds of thousands of characters instead of 128. Tokens that are word-pieces instead of single letters. Training not on one book for minutes but on much of the written internet for months, across thousands of GPUs. Every one of those is an engineering problem of size, not a new idea. The idea is the one you have now built and watched run.

That is what a large language model is. It is this, made enormous. There is no separate secret. The machine that surprised the world by writing is, at its core, a next-letter guesser that got very, very big, and you have just built the small version with your own hands.

The complete, working code is at github.com/erubboli/go-tiny-llm: a character-level transformer in Go, standard library only, small enough to read in an afternoon. Clone it, change a constant, feed it your own text, and watch it dream.