Genetic Algorithms in Go: Optimization by Simulated Evolution

In the neural networks article we trained a network using backpropagation: compute a gradient, step in the direction that reduces loss, repeat. The method works beautifully when the loss landscape is smooth and differentiable. Not every optimization problem gives you that.

Some problems have no gradient to compute. Some have gradients but a landscape pocked with local minima so numerous that gradient descent reliably gets trapped. Some involve discrete choices, permutations, or combinatorial structure for which derivatives are not even defined. For these, a different class of algorithms exists. Genetic algorithms are among the most natural to understand: they borrow their logic directly from biological evolution.

The biological analogy

Evolution maintains a population of individuals. Each individual carries a genome, a string of genes that encode its characteristics. Individuals are exposed to selection pressure: those better suited to their environment are more likely to survive and reproduce. Reproduction is not copying. Two parents contribute genes to an offspring through crossover, mixing their genomes. Occasionally a gene changes at random through mutation, introducing variation that no parent possessed.

Over many generations, the population drifts toward higher fitness, the measure of how well an individual solves the problem at hand. No individual planned the improvement. No gradient was consulted. The algorithm is blind, parallel, and surprisingly effective.

We use this vocabulary exactly. A candidate solution is a chromosome. The values that encode it are genes. How good the solution is is its fitness. The rest follows.

Why not just use gradient descent?

Gradient descent is the right tool when your objective function is differentiable and the landscape has a manageable structure. The neural network implementation on this site is a good example: the loss is smooth, convex-ish near solutions, and backpropagation gives exact gradients cheaply.

Three situations break that assumption.

Non-differentiability. If your objective function involves discrete choices, sorting, scheduling, or logical constraints, there is no gradient to compute. You cannot differentiate “does this route visit all cities exactly once.”

Multimodality. Some smooth functions have so many local optima that gradient descent, starting from any reasonable initialization, reliably gets stuck in one that is far from the global optimum. The algorithm converges, but to the wrong place.

Black-box functions. Sometimes you can evaluate a candidate solution but cannot inspect the function’s internals. A gradient is not available. Only fitness values are.

Genetic algorithms address all three. The tradeoff is clear: on smooth, unimodal problems, gradient descent converges faster and more precisely. On rugged, discrete, or black-box problems, GA explores the space broadly via a whole population rather than committing to a single trajectory.

The target problem

We will optimize this function over x in [0, 20]:

f(x) = sin(x) * cos(x/3) + 0.5 * sin(2x)

This function has seven local maxima. Three of them share the global maximum value of approximately 1.2498, located near x = 1.00, x = 10.43, and x = 19.85. Between and around them sit four lower local maxima, at f values of about 0.32 and 0.12, near x = 3.96, 6.70, 13.39, and 16.13.

Gradient ascent from x = 7.0 converges to x = 6.70, f = 0.12: the lowest local maximum in the domain. It cannot know that much higher peaks exist nearby. Gradient ascent from x = 4.0 settles at x = 3.96, f = 0.32. Starting position determines outcome completely.

GA will find f = 1.2498 reliably, regardless of where in the domain the initial population falls.

The algorithm

Five operations, applied in a loop:

1. Initialize

Create a population of N candidate solutions, placing them randomly across the search space. For real-valued problems, draw each gene uniformly from the allowed range. The population is the algorithm’s diversity budget: a larger population explores more thoroughly but costs more evaluations per generation.

2. Evaluate fitness

Apply the objective function to every individual. Store the result as the individual’s fitness. This step is usually the most expensive: for real-world problems, a single fitness evaluation might involve running a simulation or querying a physical system.

3. Select

Choose parents for the next generation. We use tournament selection: pick k individuals at random from the current population, select the fittest among them. Repeat to choose each parent.

Tournament selection is preferred over roulette wheel (fitness-proportional) selection for continuous problems for two reasons. First, it is scale-invariant: only the rank among competitors matters, not the absolute magnitude of fitness values. A roulette wheel’s selection pressure collapses when one individual’s fitness dominates, turning selection into a lottery. Second, the tournament size k directly controls selection pressure: k=2 is gentle, k=10 is aggressive. You can tune it without rescaling the fitness function.

4. Crossover

Combine two parents to produce an offspring. For real-valued genes, the standard approach is BLX-alpha crossover (blend crossover). Given parents a and b with gene values in some range, the offspring gene is sampled uniformly from:

[min(a, b) - alpha * |b - a|,  max(a, b) + alpha * |b - a|]

The parameter alpha controls how far outside the interval between the parents the offspring can fall. With alpha = 0, offspring are always between the parents (exploitation). With alpha = 0.5, offspring can reach 50% of the parent gap beyond either parent (balanced). Higher values encourage more exploration. The standard choice is alpha = 0.5.

BLX-alpha is appropriate for real-valued problems because it produces offspring in a continuous neighborhood of the parents, scaled by how far apart they are. A binary crossover operating on a floating-point bit representation would be arbitrary and unstable.

5. Mutate

After crossover, apply mutation to the offspring with probability p_mutation. For real-valued genes, add Gaussian noise: gene += N(0, sigma). The sigma parameter controls mutation step size.

Mutation serves a specific role: it reintroduces variation that selection erodes. As a population converges, its members become similar. Without mutation, crossover between near-identical individuals produces near-identical offspring. The population stagnates. Mutation perturbs individuals into unexplored regions of the search space, preventing premature convergence.

The Go implementation

Real-valued encoding. No external libraries. The only imports are math, math/rand, and fmt.

package main

import (
	"fmt"
	"math"
	"math/rand"
)

// Individual holds a single candidate solution and its fitness.
type Individual struct {
	gene    float64
	fitness float64
}

// Population is a slice of individuals.
type Population []Individual

// initPopulation creates a random population within the given bounds.
func initPopulation(size int, bounds [2]float64) Population {
	pop := make(Population, size)
	for i := range pop {
		pop[i].gene = bounds[0] + rand.Float64()*(bounds[1]-bounds[0])
	}
	return pop
}

// evaluate applies f to every individual and stores the result as fitness.
func evaluate(pop Population, f func(float64) float64) Population {
	for i := range pop {
		pop[i].fitness = f(pop[i].gene)
	}
	return pop
}

// tournamentSelect picks k random individuals and returns the fittest.
func tournamentSelect(pop Population, k int) Individual {
	best := pop[rand.Intn(len(pop))]
	for i := 1; i < k; i++ {
		candidate := pop[rand.Intn(len(pop))]
		if candidate.fitness > best.fitness {
			best = candidate
		}
	}
	return best
}

// blxCrossover produces an offspring using BLX-alpha blend crossover.
// The offspring gene is sampled uniformly from the interval
// [min(a,b) - alpha*d, max(a,b) + alpha*d] where d = |b.gene - a.gene|.
func blxCrossover(a, b Individual, alpha float64, bounds [2]float64) Individual {
	d := math.Abs(b.gene - a.gene)
	lo := math.Min(a.gene, b.gene) - alpha*d
	hi := math.Max(a.gene, b.gene) + alpha*d
	gene := lo + rand.Float64()*(hi-lo)
	// Clamp to search space boundaries.
	if gene < bounds[0] {
		gene = bounds[0]
	}
	if gene > bounds[1] {
		gene = bounds[1]
	}
	return Individual{gene: gene}
}

// mutate adds Gaussian noise to the gene with probability rate.
func mutate(ind Individual, rate, sigma float64, bounds [2]float64) Individual {
	if rand.Float64() < rate {
		ind.gene += rand.NormFloat64() * sigma
		if ind.gene < bounds[0] {
			ind.gene = bounds[0]
		}
		if ind.gene > bounds[1] {
			ind.gene = bounds[1]
		}
	}
	return ind
}

// bestOf returns the individual with the highest fitness in the population.
func bestOf(pop Population) Individual {
	best := pop[0]
	for _, ind := range pop[1:] {
		if ind.fitness > best.fitness {
			best = ind
		}
	}
	return best
}

// evolve runs one generation: select, cross, mutate, evaluate.
// Elitism: the best individual from the current generation is carried forward unchanged.
func evolve(
	pop Population,
	f func(float64) float64,
	bounds [2]float64,
	crossoverRate float64,
	mutationRate float64,
	sigma float64,
	tournamentK int,
	alpha float64,
) Population {
	next := make(Population, len(pop))

	// Elitism: preserve the best individual.
	next[0] = bestOf(pop)

	for i := 1; i < len(pop); i++ {
		parent1 := tournamentSelect(pop, tournamentK)
		var child Individual
		if rand.Float64() < crossoverRate {
			parent2 := tournamentSelect(pop, tournamentK)
			child = blxCrossover(parent1, parent2, alpha, bounds)
		} else {
			child = parent1
		}
		child = mutate(child, mutationRate, sigma, bounds)
		child.fitness = f(child.gene)
		next[i] = child
	}

	return next
}

func main() {
	rand.Seed(42)

	// Target function: multimodal, several local maxima over [0, 20].
	f := func(x float64) float64 {
		return math.Sin(x)*math.Cos(x/3) + 0.5*math.Sin(2*x)
	}

	bounds := [2]float64{0, 20}
	popSize := 50
	generations := 100
	crossoverRate := 0.8
	mutationRate := 0.1
	sigma := 0.5
	tournamentK := 3
	alpha := 0.5

	pop := initPopulation(popSize, bounds)
	pop = evaluate(pop, f)

	for gen := 0; gen <= generations; gen++ {
		if gen%10 == 0 {
			best := bestOf(pop)
			fmt.Printf("Gen %3d: best x = %.4f, f(x) = %.4f\n", gen, best.gene, best.fitness)
		}
		pop = evolve(pop, f, bounds, crossoverRate, mutationRate, sigma, tournamentK, alpha)
	}

	best := bestOf(pop)
	fmt.Printf("\nResult: x = %.6f, f(x) = %.6f\n", best.gene, best.fitness)
}

Save as main.go and run with go run main.go. Output:

Gen   0: best x = 19.9464, f(x) = 1.2370
Gen  10: best x = 19.8505, f(x) = 1.2498
Gen  20: best x = 19.8505, f(x) = 1.2498
Gen  30: best x = 19.8505, f(x) = 1.2498
Gen  40: best x = 19.8505, f(x) = 1.2498
Gen  50: best x = 19.8505, f(x) = 1.2498
Gen  60: best x = 19.8505, f(x) = 1.2498
Gen  70: best x = 19.8505, f(x) = 1.2498
Gen  80: best x = 19.8505, f(x) = 1.2498
Gen  90: best x = 19.8505, f(x) = 1.2498
Gen 100: best x = 19.8505, f(x) = 1.2498

Result: x = 19.850493, f(x) = 1.249804

The population finds x near 19.85 (one of the three global maxima at f = 1.2498) by generation 10 and holds it for the remaining 90 generations.

Reading the implementation

A few design choices worth naming.

Elitism. The evolve function preserves the single best individual from each generation at index zero. Without elitism, selection pressure and mutation can accidentally discard the best solution ever found. Elitism guarantees monotone improvement: the best fitness seen so far can only stay the same or get better.

Clamping in crossover and mutation. BLX-alpha can produce gene values outside the search bounds, especially when two parents near a boundary are crossed. The explicit clamp in blxCrossover and mutate keeps every individual legal. Clamping introduces a small bias near boundaries: many out-of-bounds samples become boundary values, increasing their frequency slightly. For most problems this is acceptable. Alternatives include rejection sampling (discard and resample until in-bounds) or wrapping.

Fitness at birth. The evolve function calls f(child.gene) immediately after creating the child, rather than running a separate evaluation pass. This keeps the code straightforward for single-objective scalar problems. For expensive fitness functions, batching evaluations enables parallelism: a natural Go extension would be to launch each evaluation in a goroutine and collect results via a channel.

No global state. Each function takes what it needs as parameters. tournamentSelect, blxCrossover, and mutate are pure in the sense that they only read their arguments and the shared rand source. Swapping in a different selection strategy or crossover operator requires changing only the relevant call in evolve.

Gradient descent fails here

To make the tradeoff concrete, here is gradient ascent on the same function, starting from x = 7.0:

package main

import (
	"fmt"
	"math"
)

func f(x float64) float64 {
	return math.Sin(x)*math.Cos(x/3) + 0.5*math.Sin(2*x)
}

func main() {
	x := 7.0
	lr := 0.05

	for i := 0; i < 2000; i++ {
		// Numerical gradient via central differences.
		grad := (f(x+1e-6) - f(x-1e-6)) / (2e-6)
		x += lr * grad
		if x < 0 {
			x = 0
		}
		if x > 20 {
			x = 20
		}
	}

	fmt.Printf("Gradient ascent result: x = %.4f, f(x) = %.4f\n", x, f(x))
	// Output: Gradient ascent result: x = 6.7020, f(x) = 0.1212
}

Output: Gradient ascent result: x = 6.7020, f(x) = 0.1212

The algorithm climbed to the nearest local maximum and stopped. It had no mechanism to notice that a peak ten times higher exists 3 units to the left and another 13 units to the right. From x = 4.0, gradient ascent settles at x = 3.96, f = 0.32: also local, also wrong. The starting point determines the answer completely.

GA finds f = 1.2498 from any reasonable initialization because it explores the entire domain simultaneously. The population is distributed across [0, 20] at generation zero. Selection does not eliminate every individual near a lower peak immediately: individuals everywhere survive long enough to contribute genetic material. The crossover operator routinely produces offspring that land far from either parent when the parents are far apart, which is common in the early generations when the population spans the full range.

The population is not “smarter” than gradient descent. It is simply parallel: fifty simultaneous explorations rather than one. The diversity of starting positions is the mechanism, not any special insight about the landscape.

What genetic algorithms are actually good for

The function above is a toy: the domain is one-dimensional, continuous, and the true optimum is known. GA found it in a millisecond. Real uses are less tidy.

Neural architecture search. The structure of a neural network (number of layers, width, skip connections, activation functions) is a discrete combinatorial object. Gradient descent optimizes weights given a fixed architecture. GA has been used to search the architecture space itself, treating network structures as chromosomes.

Scheduling and routing. The traveling salesman problem and its industrial cousins (job shop scheduling, vehicle routing) are combinatorial. A chromosome can encode a permutation of cities or jobs. Crossover operators designed for permutations (such as order crossover) preserve valid orderings. Gradient descent is not applicable.

Parameter tuning for black-box systems. When the system being optimized is a simulation, a physical device, or a legacy codebase that cannot be differentiated through, GA treats it as a black box: call it, get a scalar, select on that scalar. No internal access required.

Hyperparameter optimization. Learning rate, batch size, regularization strength, architecture depth: GA treats these as genes and searches the hyperparameter space via population-based exploration.

The honest limitation is sample efficiency. GA evaluates the objective function many times, once per individual per generation. For cheap functions (like our example), that is fine. For expensive ones (days of simulation per evaluation), GA requires careful budgeting, often combined with surrogate models that approximate the fitness landscape cheaply between expensive evaluations.

What the implementation leaves out

This implementation works for the target problem. Several extensions matter in practice.

Multi-dimensional problems. Extending to multiple genes per individual means making Individual.genes a slice, adjusting blxCrossover to operate per-gene, and defining fitness over a vector-valued input. The structure of evolve does not change.

Constraint handling. Many real problems have constraints beyond simple bounds. Common approaches include penalty functions (reduce fitness for constraint violations), repair operators (project infeasible offspring back to the feasible set), or death penalty (discard infeasible individuals and replace them). Each has tradeoffs depending on how large the infeasible region is relative to the feasible space.

Diversity preservation. On multimodal problems you sometimes want to find multiple good solutions, not just the single best. Niching techniques (fitness sharing, crowding) apply a penalty to individuals too similar to others in the population, maintaining diversity across multiple peaks. The current implementation finds one global maximum; niching would maintain representatives near all three peaks at f = 1.2498.

Adaptive parameters. Fixed mutation rate and sigma work for simple problems. Self-adaptive GA variants encode mutation parameters in the genome itself, allowing the algorithm to adjust its own search behavior as the population evolves. This is particularly useful when the appropriate mutation step size varies across different regions of the search space.

The deeper point

Gradient descent is local: it follows the landscape at a single point. Genetic algorithms are global: the population covers the landscape at many points simultaneously. This is not a free lunch. The population-based exploration costs more function evaluations per unit of progress than gradient descent makes on smooth problems. The bet GA is making is that the cost of evaluations is worth paying to avoid getting trapped.

When that bet is correct, the payoff is finding solutions that a purely local method would never reach. The mechanism is borrowed from a process that had three billion years to refine it. We get to use it in a hundred lines of Go.