Neural Networks in Go: XOR and Backpropagation from Scratch

In the linear regression series from 2015, we implemented gradient descent in Go from scratch. The core idea was straightforward: define a cost function, compute its gradient analytically, and descend. The weights were a flat vector and the gradient was a single formula.

Neural networks use exactly the same idea, but applied across multiple layers. The gradient is no longer a single formula. You compute it layer by layer, propagating error backward through the network: this is backpropagation. Most introductions to neural networks either skip the derivation or hide it behind framework abstractions. This article does neither.

We will work through the math explicitly, implement a complete neural network in Go using nothing but the standard library, and train it to solve XOR. At the end, we show the equivalent PyTorch implementation, which makes visible exactly what the framework automates.

The XOR problem

XOR is the simplest task that exposes the limits of linear models. Its truth table:

$x_1$	$x_2$	XOR
0	0	0
0	1	1
1	0	1
1	1	0

The defining property of XOR is that no straight line can separate the 0-class from the 1-class. The two 1-outputs sit at $(0,1)$ and $(1,0)$ : diagonally opposite. The two 0-outputs sit at $(0,0)$ and $(1,1)$ : the other diagonal. Any line that separates these two groups will either cut through both diagonals incorrectly, or fail to classify one of the four points.

This is what not linearly separable means. A single perceptron, which computes $\text{output} = \text{sign}(w_1 x_1 + w_2 x_2 + b)$ , can only draw one hyperplane. That is enough for AND and OR. It is not enough for XOR.

Adding a hidden layer changes the geometry. The hidden layer learns to map the inputs into a new space where the classes are linearly separable. The output layer then draws the separating line in that transformed space. This is the geometric intuition behind why a two-layer network can solve XOR and a single perceptron cannot.

Network architecture

We will use the smallest network that can solve XOR: two input nodes, two hidden nodes, one output node.

The parameters are:

Hidden layer weights $W^{(1)}$ : a $2 \times 2$ matrix mapping 2 inputs to 2 hidden nodes
Hidden layer biases $b^{(1)}$ : a vector of length 2
Output layer weights $W^{(2)}$ : a $1 \times 2$ matrix mapping 2 hidden nodes to 1 output
Output layer bias $b^{(2)}$ : a scalar

Total: $4 + 2 + 2 + 1 = 9$ parameters. Enough to represent the XOR function, not so many that training is opaque.

The activation function

Each neuron computes a weighted sum and then applies an activation function. We use the sigmoid:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

Three properties make sigmoid natural here. First, its output is bounded between 0 and 1, which matches our XOR labels. Second, it is smooth everywhere, so we can compute gradients at any input value. Third, its derivative has a particularly clean form: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ . We will use this constantly in backpropagation.

The derivative in terms of the output (not the input) is even more convenient. If $a = \sigma(z)$ , then $\frac{d\sigma}{dz} = a(1-a)$ . We never need to store $z$ to compute the gradient; we only need the activation $a$ we already computed in the forward pass.

The forward pass

Label the layers: $x$ is the input vector ( $2 \times 1$ ), $a^{(1)}$ is the hidden layer activation ( $2 \times 1$ ), and $a^{(2)}$ is the output ( $1 \times 1$ ).

The forward pass computes:

$z^{(1)} = W^{(1)} x + b^{(1)}$

$a^{(1)} = \sigma(z^{(1)})$

$z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}$

$\hat{y} = a^{(2)} = \sigma(z^{(2)})$

Each hidden neuron sees a weighted combination of both inputs, applies sigmoid, and passes the result to the output neuron. The output neuron applies sigmoid again, bounding the prediction between 0 and 1.

The loss function

We measure error with mean squared error over the training set:

$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

where $y_i$ is the true label and $\hat{y}_i$ is the network’s prediction for the $i$ -th example. For XOR, $n = 4$ .

MSE has a clean gradient and works well for small networks learning bounded outputs. It is not the only choice (cross-entropy is more common for classification in production), but for the purpose of deriving backpropagation from scratch, MSE makes the algebra easier to follow.

Backpropagation: the chain rule layer by layer

This is the part most tutorials abbreviate. We will not.

Backpropagation is the chain rule applied systematically from the output back to the weights. The goal is to compute $\frac{\partial L}{\partial W^{(2)}}$ , $\frac{\partial L}{\partial b^{(2)}}$ , $\frac{\partial L}{\partial W^{(1)}}$ , and $\frac{\partial L}{\partial b^{(1)}}$ .

We work on one training example at a time and sum the gradients over all four.

Output layer gradients

Start from the loss. For a single example:

$L = (y - \hat{y})^2$

(The $\frac{1}{n}$ factor is absorbed when we average gradients over the batch.)

The gradient of $L$ with respect to the output activation $a^{(2)} = \hat{y}$ :

$\frac{\partial L}{\partial a^{(2)}} = -2(y - \hat{y})$

The output activation $a^{(2)} = \sigma(z^{(2)})$ , so by the chain rule:

$\frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} = -2(y - \hat{y}) \cdot a^{(2)}(1 - a^{(2)})$

Call this $\delta^{(2)}$ . It is the error signal at the output layer.

The output weights $W^{(2)}$ appear in $z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}$ , so:

$\frac{\partial L}{\partial W^{(2)}} = \delta^{(2)} \cdot (a^{(1)})^T$

$\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)}$

These are the gradient updates for the output layer. Now we need to propagate the error backward into the hidden layer.

Hidden layer gradients

The hidden layer activations $a^{(1)}$ feed into the output. The loss depends on $a^{(1)}$ only through $z^{(2)}$ :

$\frac{\partial L}{\partial a^{(1)}} = (W^{(2)})^T \cdot \delta^{(2)}$

This is the key step: we multiply the output error $\delta^{(2)}$ by the transpose of the output weights. Each output weight $W^{(2)}_j$ scales how much unit $j$ in the hidden layer contributed to the output error. Transposing the weight matrix and multiplying distributes the error back to each hidden unit in proportion to its contribution.

Now apply the chain rule through the sigmoid at the hidden layer:

$\frac{\partial L}{\partial z^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \odot a^{(1)} \odot (1 - a^{(1)})$

where $\odot$ denotes element-wise multiplication. Call this $\delta^{(1)}$ .

The hidden layer weights and biases:

$\frac{\partial L}{\partial W^{(1)}} = \delta^{(1)} \cdot x^T$

$\frac{\partial L}{\partial b^{(1)}} = \delta^{(1)}$

Gradient descent update

With all gradients computed, we update each parameter by stepping opposite to the gradient:

$W \leftarrow W - \alpha \cdot \frac{\partial L}{\partial W}$

$b \leftarrow b - \alpha \cdot \frac{\partial L}{\partial b}$

The learning rate $\alpha$ controls step size. Too large and the updates overshoot the minimum and the loss oscillates or diverges. Too small and convergence is slow. For this problem, $\alpha = 0.5$ works well.

One training step: run all four XOR examples through the forward pass, accumulate the gradients from each, average them, and apply the update. Repeat for many epochs.

The Go implementation

The complete implementation. No external libraries, just math and math/rand.

package main

import (
	"fmt"
	"math"
	"math/rand"
)

// Network holds the weights and biases for a 2-2-1 neural network.
type Network struct {
	// Hidden layer: 2 neurons, each receiving 2 inputs
	w1 [2][2]float64 // w1[i][j] = weight from input j to hidden neuron i
	b1 [2]float64

	// Output layer: 1 neuron, receiving 2 hidden activations
	w2 [2]float64 // w2[j] = weight from hidden neuron j to output
	b2 float64
}

func sigmoid(x float64) float64 {
	return 1.0 / (1.0 + math.Exp(-x))
}

// sigmoidPrime computes the derivative of sigmoid given the sigmoid output a.
func sigmoidPrime(a float64) float64 {
	return a * (1.0 - a)
}

// forward runs the forward pass and returns hidden activations and the output.
func (n *Network) forward(x [2]float64) ([2]float64, float64) {
	// Hidden layer
	var a1 [2]float64
	for i := 0; i < 2; i++ {
		z := n.b1[i]
		for j := 0; j < 2; j++ {
			z += n.w1[i][j] * x[j]
		}
		a1[i] = sigmoid(z)
	}

	// Output layer
	z2 := n.b2
	for j := 0; j < 2; j++ {
		z2 += n.w2[j] * a1[j]
	}
	a2 := sigmoid(z2)

	return a1, a2
}

// train runs one epoch of backpropagation over the full dataset.
func (n *Network) train(inputs [][2]float64, targets []float64, lr float64) float64 {
	// Accumulators for gradients (averaged over the dataset)
	var dw1 [2][2]float64
	var db1 [2]float64
	var dw2 [2]float64
	var db2 float64
	totalLoss := 0.0

	for k, x := range inputs {
		y := targets[k]

		// Forward pass
		a1, a2 := n.forward(x)

		// Loss for this example (MSE, without the 1/n factor)
		totalLoss += (y - a2) * (y - a2)

		// --- Backpropagation ---

		// Output layer error signal
		// dL/da2 = -2(y - a2)
		// dL/dz2 = dL/da2 * sigmoid'(a2)
		delta2 := -2.0 * (y - a2) * sigmoidPrime(a2)

		// Gradients for output layer weights and bias
		for j := 0; j < 2; j++ {
			dw2[j] += delta2 * a1[j]
		}
		db2 += delta2

		// Propagate error back to hidden layer
		// dL/da1[i] = w2[i] * delta2
		// dL/dz1[i] = dL/da1[i] * sigmoid'(a1[i])
		var delta1 [2]float64
		for i := 0; i < 2; i++ {
			delta1[i] = n.w2[i] * delta2 * sigmoidPrime(a1[i])
		}

		// Gradients for hidden layer weights and biases
		for i := 0; i < 2; i++ {
			for j := 0; j < 2; j++ {
				dw1[i][j] += delta1[i] * x[j]
			}
			db1[i] += delta1[i]
		}
	}

	// Average gradients and apply gradient descent update
	m := float64(len(inputs))
	for i := 0; i < 2; i++ {
		for j := 0; j < 2; j++ {
			n.w1[i][j] -= lr * dw1[i][j] / m
		}
		n.b1[i] -= lr * db1[i] / m
		n.w2[i] -= lr * dw2[i] / m
	}
	n.b2 -= lr * db2 / m

	return totalLoss / m
}

func newNetwork() *Network {
	n := &Network{}
	// Initialize with small random weights to break symmetry
	for i := 0; i < 2; i++ {
		for j := 0; j < 2; j++ {
			n.w1[i][j] = rand.Float64()*2 - 1
		}
		n.b1[i] = rand.Float64()*2 - 1
		n.w2[i] = rand.Float64()*2 - 1
	}
	n.b2 = rand.Float64()*2 - 1
	return n
}

func main() {
	rand.Seed(42)

	inputs := [][2]float64{
		{0, 0},
		{0, 1},
		{1, 0},
		{1, 1},
	}
	targets := []float64{0, 1, 1, 0}

	net := newNetwork()

	epochs := 10000
	lr := 0.5

	for epoch := 0; epoch <= epochs; epoch++ {
		loss := net.train(inputs, targets, lr)
		if epoch%2000 == 0 {
			fmt.Printf("Epoch %5d  Loss: %.6f\n", epoch, loss)
		}
	}

	fmt.Println("\nPredictions after training:")
	for k, x := range inputs {
		_, output := net.forward(x)
		fmt.Printf("  XOR(%v, %v) = %.4f  (expected %v)\n",
			int(x[0]), int(x[1]), output, int(targets[k]))
	}
}

Save this as main.go and run it with go run main.go. Output after training:

Epoch     0  Loss: 0.310472
Epoch  2000  Loss: 0.084531
Epoch  4000  Loss: 0.017823
Epoch  6000  Loss: 0.007341
Epoch  8000  Loss: 0.004128
Epoch 10000  Loss: 0.002701

Predictions after training:
  XOR(0, 0) = 0.0476  (expected 0)
  XOR(0, 1) = 0.9521  (expected 1)
  XOR(1, 0) = 0.9521  (expected 1)
  XOR(1, 1) = 0.0479  (expected 0)

The network has learned XOR. The 0-cases are close to 0, the 1-cases are close to 1.

Why random initialization matters

Notice the call to rand.Seed(42) and the initialization with random values between -1 and 1. If you initialize all weights to zero, all hidden neurons receive identical gradients at every step because they are computing identical functions. The hidden layer never differentiates. The network is stuck. Random initialization breaks this symmetry: each neuron starts computing a slightly different function, and gradient descent can push them in different directions.

What the code makes explicit

The Go implementation lays bare something that framework code hides. Every gradient is computed by hand: delta2 is the output error signal, delta1[i] propagates it back through the weight n.w2[i] and the sigmoid derivative at the hidden layer. The accumulation loop over the dataset and the subsequent division by m is the manual batch averaging that an optimizer like torch.optim.SGD performs automatically.

None of this is mysterious. It is the chain rule, applied twice.

The PyTorch comparison

Here is the same network, same task, in Python with PyTorch:

import torch
import torch.nn as nn

# XOR dataset
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)

# 2-2-1 network with sigmoid activations
model = nn.Sequential(
    nn.Linear(2, 2),
    nn.Sigmoid(),
    nn.Linear(2, 1),
    nn.Sigmoid()
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
loss_fn = nn.MSELoss()

for epoch in range(10001):
    pred = model(X)
    loss = loss_fn(pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 2000 == 0:
        print(f"Epoch {epoch:5d}  Loss: {loss.item():.6f}")

print("\nPredictions:")
with torch.no_grad():
    for i, xi in enumerate(X):
        print(f"  XOR({int(xi[0])}, {int(xi[1])}) = {model(xi).item():.4f}")

The PyTorch version and the Go version are doing the same computation. What PyTorch automates:

Automatic differentiation. The call loss.backward() computes all gradients through the computation graph. In the Go code, this is the entire backpropagation section: computing delta2, computing delta1, and accumulating dw1, dw2, db1, db2. PyTorch builds a graph of operations during the forward pass and traverses it in reverse. The math is identical.

Gradient accumulation. After loss.backward(), each parameter’s .grad attribute holds the accumulated gradient. In Go, our dw1, dw2, db1, db2 variables serve the same role.

The optimizer step. optimizer.step() applies the gradient descent update $w \leftarrow w - \alpha \cdot \text{grad}$ to every parameter. In Go, our final loop does exactly this.

optimizer.zero_grad(). PyTorch accumulates gradients across calls to backward(). Calling zero_grad() before each forward pass resets them. In Go, we declare fresh zero-valued accumulators at the start of each call to train(), which has the same effect.

The framework is not doing anything different. It is doing the same things, automatically, across arbitrarily large and complex computation graphs. The Go code is useful precisely because it makes the mechanics visible.

What this network has actually learned

It is worth looking at what the hidden layer is computing after training. The two hidden neurons have learned representations of the XOR inputs.

One neuron tends to learn something close to OR: it activates when at least one input is 1. The other tends to learn something close to NAND: it activates when not both inputs are 1. Together, these two functions are linearly separable into XOR: their AND, which is $(OR) \cap (NAND)$ . The output neuron learns to compute that final combination.

The specific representations vary across runs due to random initialization, but the structure is always the same: the hidden layer finds a transformation of the input space in which XOR becomes linearly separable, and the output layer draws the line. This is what a neural network does. The training procedure, gradient descent guided by backpropagation, finds the transformation automatically.

Where to go from here

This network has nine parameters and four training examples. Real networks have millions of parameters, mini-batch gradient descent instead of full-batch, additional techniques like momentum and adaptive learning rates, and regularization to prevent overfitting. The math is the same. The chain rule is still the chain rule.

The linear regression series on this site covers gradient descent and the cost function in detail. The next natural step from here is to add more layers, replace sigmoid with ReLU (which trains faster for deep networks), and apply the same backpropagation logic to a problem with real data. The machinery does not change; the scale does.

What does change is the practical justification for using a framework. PyTorch’s automatic differentiation is not just convenient: for deep networks, hand-deriving gradients is error-prone and building a correct autodiff engine is substantial engineering work. The Go implementation here is a pedagogical tool, not a production choice. Its value is that it leaves nowhere to hide.