Building a Neural Network from scratch in Go

17 Jun 2026 · Stone Liu

Neural Networks

Neural Networks have always been quite mysterious to me, even after learning about them in school I could not fully grasp how the inner machinery of the network learned arbitrary tasks such as image classification. So like any normal person ought to do, I decided I would take on the task of building it completely from scratch myself using this book.

The Perceptron

I first started out by modeling a perceptron, its a type of neuron that takes as input several bits and produces a binary output or . We can take each of the inputs to our neuron and add a threshold which determines if the neuron activates or not. The weight to each input represents the importance of the input. This can be formalized as the following rule:

Notice that if we assign the threshold value a name such as and move it to the other side of the inequality we get

And finally if we replace the with vectors to represent the weights and features as and then we can get the familiar formula

In code, I represented my perceptron as such:

type Perceptron struct {
  threshold float64
  // Number of inputs the perceptron is allowed to take.
  inputs    uint
}

func CreatePerceptron(threshold float64, inputSize uint) *Perceptron {
  return &Perceptron{
    threshold: threshold,
    inputs:    inputSize,
  }

}

// Input is a single binary input together with its weight.
type Input struct {
  X uint8
  W float64
}

Whats interesting about perceptrons is that it can serve as a universal model for computation. For instance you can create a NAND gate by taking two binary inputs each with a weight of and a bias of .

func NAND(x1 uint8, x2 uint8) uint8 {
  // 3 is the bias and 2 is the number of inputs
  p1 := CreatePerceptron(3, 2)
  return p1.Forward(Input{X: x1, W: -2}, Input{X: x2, W: -2})
}

Why do the weights equaling and a bias of represent a NAND network? Lets do the math. NAND is simply a logical operator with the following truth table

You can also model other functions such as XOR by building a network of NAND perceptrons. However the problem with perceptrons is that it can never truly learn, the activations are all linear and thus can only learn linear relationships and nothing more complex.

The Sigmoid Neuron

The sigmoid neuron is the defacto standard when first learning artificial neural networks. It solves the problem of the Perceptron, since the activations are all step functions and composing such a network of perceptrons could only truly learn linear relationships. It is defined to be

The nice thing about the sigmoid is that it looks something like this:

Notices that the sigmoid squashes all to a value between the interval of . This is very useful, much more useful than that of the step function given by the Perceptron.

Of course there are still many use cases for the perceptron, however our goal is to classify digits in which is not a binary problem. Let’s walk through an example neural network to get a sense of the differences of the activations.

I will be using Nielson’s Notation since I am following along his book. If we were to compute the output of this network in a single matrix multiplication we would have the following

The output of the equation is and from there we would apply our activations to get .

Passing the output of each layer forward is what we call Feedforward networks.

type Network[N Number] struct {
  sizes                       []uint       // Represents the number of nodes in each layer.
  layers                      uint         // Layers not including the input layer.
  biases                      []*Vector[N] // Biases for each layer represented as column vectors
  weights                     []*Matrix[N]
  activations                 []*Vector[N] // a_l
  preActivations              []*Vector[N] // z_l
  activationFunctionsPerLayer []Activation[N]
}
func (this *Network[T]) Forward(input *Vector[T]) *Vector[T] {
  var currentFeatures = input
  for layerIdx := range this.layers {
    biasVector := this.biases[layerIdx]
    weights := this.weights[layerIdx]
    z_l := weights.Multiply(currentFeatures).Add(biasVector)
    this.preActivations[layerIdx] = z_l
    activation := this.activationFunctionsPerLayer[layerIdx]
    a_l := activation.Apply(z_l)
    this.activations[layerIdx] = a_l
    currentFeatures = a_l
  }
  return currentFeatures
}

Backpropogation

After building out the feedforwarding for my network, now its time to implement the basis of how neural networks actually learn. Say we have the following network architecture, courtesy of 3Blue1Brown

I have represented this network as an object:

sizes := []uint{784, 16, 16, 10}
activations := []nn.Activation[float64]{
  nn.SigmoidActivation[float64](), // Activations at each layer
  nn.SigmoidActivation[float64](),
  nn.SigmoidActivation[float64](),
}
network := nn.CreateNetwork(sizes, activations)

When training a neural network to learn how to classify digits, we will take the output or prediction of the final layer and compare it with the ground truth. We model how accurate the prediction of our model is using something called a Loss Function. For simplicity, we will be using the mean squared error to capture how close or far out our network is from the ground truth.

In code I have represented it as such:

type MSE[N Number] struct{}

func (this *MSE[N]) Cost(prediction, target *Vector[N]) N {
  n := prediction.rows
  var sum float64 = 0

  for i := range n {
    gt := prediction.Get(i, 0)
    pred := target.Get(i, 0)
    sum += math.Pow(float64(gt-pred), 2)
  }
  return N(sum / float64(n))
}

So far I have gotten the following:

Defined the network architecture .
Implemented feedforward with the sigmoid activation function.
Implement Backpropogation ?

To do this at a very high level we want to be able to find what weights and biases in our network that we can tweak such that we minimize our loss function. But what does it mean to minimize our loss function?

This is where calculus comes in handy because there is a technique called Gradient Descent in which we can use it to find local minimums. We will update each of the weights and biases of our network with the following rule:

But what is and ? Let us consider the final layer of our network, . We know that in the final layer the preactivation value is equal to

The Gradient of the Cost Function With Respect to the Weights

It’s quite easy to get lost in the sea of notation, but a neural network is simply made up of a bunch of composite functions. So lets think about this in the classical chain rule from calculus.

So what is ? Well the chain rule tells us its

Now replacing , , , and we have

But notice what is ? We know that the activations in our neural network is precisely the derivative of the sigmoid. The derivative of the sigmoid is

Furthermore we can easily calculate to be

So we finally have that

By convention we set

So we can finally get the famous

The Gradient of the Cost Function With Respect to the Bias

Similarily we can compute the gradient with respect to the bias term . If we follow the chain rule again we can see that

We have already computed and from above and . So quite elegantly it collapses down to

func (this *Network[N]) Backpropogation(input *Vector[N],
  prediction *Vector[N],
  target *Vector[N],
  loss Cost[N]) ([]*Vector[N], []*Matrix[N]) {
  // First compute the error of the prediction
  layer := int(this.layers - 1)
  derivative_of_cost_with_respect_to_L := loss.Gradient(prediction, target)
  derivative_of_activation := this.activationFunctionsPerLayer[layer].Gradient(this.preActivations[layer])
  delta := derivative_of_cost_with_respect_to_L.HadamardMultiply(derivative_of_activation)
  gradB := make([]*Vector[N], this.layers) // We store the gradients at each layer
  gradW := make([]*Matrix[N], this.layers)
  // Backpropogate the error results through the network...

In order to backpropogate all of our calculations we perform this iteratively. That means calculating

// remaining function
  for layer >= 0 {
    gradB[layer] = delta
    var inActivations *Vector[N]
    if layer-1 < 0 {
      inActivations = input
    } else {
      inActivations = this.activations[layer-1]
    }
    weightGrad := delta.Multiply(inActivations.Transpose())
    gradW[layer] = weightGrad
    if layer > 0 {
      wTransposeDelta := this.weights[layer].Transpose().Multiply(delta)
      activationGradient := this.activationFunctionsPerLayer[layer-1].Gradient(this.preActivations[layer-1])
      delta = wTransposeDelta.HadamardMultiply(activationGradient)
    }
    layer -= 1
  }
  return gradB, gradW

That’s it! We achieve approximately accuracy on the MNIST test dataset.