Beens

home

blogs

go-torch - simple torch like library in 1000 lines.

2025-05-14

I was reading Joe Bodner’s book, Learning Go, and got the idea of recreating a PyTorch-like library in Go. Before this, I tried creating a tensor library in Rust, but couldn’t get far. Rust had a steep learning curve, and I still have confusion over ownership and borrowing semantics.

So, I thought: why not Go?

The idea for go-torch is simple — just build out the essentials: tensor functions, auto-grad, backward pass, and a linear layer. That’s all we need to wrap a working library that can actually train a network — like a digit recognizer for MNIST.


GitHub - https://github.com/Abinesh-Mathivanan/go-torch


I didn’t go fancy yet. There’s a lot of stuff to handle under BLAS and Intel kernels. I’ll wrap those soon, and we’ll have some cool functionalities in our library.

We’ll go step by step, through “how I built this”.


Project Structure

Tensor/ -> Core tensor logic and computation graph tracking
-> All tensor operations (add, mul, matmul, reshape, transpose)
-> Also includes Backward() for automatic differentiation
nn/ -> Neural network layer definitions [Linear, ReLU, Sigmoid, Tanh, Softmax, CrossEntropyLoss]
autograd/ -> Currently unused; placed for future refactor of backward/topo sort
utils/ -> Benchmarking utilities

Architecture

The core of go-torch is the Tensor struct defined in tensor/tensor.go:

type Tensor struct {
shape []int
data []float64
Grad *Tensor
RequiresGrad bool
Parents []*Tensor
Operation string
BackwardFunc func(*Tensor)
}

RequiresGrad, Parents, and BackwardFunc

When you create a tensor that needs gradients computed during training (like weights or biases), you set RequiresGrad = true.

When you perform an operation (say, adding two tensors A and B to get C = A + B), the resulting tensor C will have:

C.RequiresGrad = A.RequiresGrad || B.RequiresGrad

If C.RequiresGrad is true, then:

  • C.Parents is set to [A, B]
  • C.BackwardFunc is defined

This BackwardFunc knows the gradient rule for the operation. For addition:

$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$

So, C.BackwardFunc just takes the incoming gradient dL/dC and passes it backward to A and B. This chaining of Parents and BackwardFuncs is how we build the computation graph — by simply doing math, everything gets tracked.


The Flow: Forward and Backward Pass

Let’s walk through what actually happens when you run the simple network demo in main.go.

Input Tensor (x)

You start by creating your input tensor x and set:

x.RequiresGrad = true

This flags it for gradient tracking. You usually do this for weights and biases, but here we’re doing it just to show how gradients flow through the whole graph — even back to the input if needed.

Linear Layer (layer1)

You create your first linear layer. Inside, it sets up weight and bias tensors — both with RequiresGrad = true by default.

Forward Pass (layer1.Forward(x))

This step performs the standard matrix operation:

O = XW + b
  1. Matrix Multiply:
    Calls tensor.MatMulTensor(x, layer1.weight). The resulting output will have:

    • RequiresGrad = true
    • Parents = [x, weight]
    • BackwardFunc based on the matrix multiplication rule:

    $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial O} \cdot W^T \ \frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial O} $$

  2. Add Bias:
    Then, adds the bias via broadcasting with tensor.AddTensor(...), which again sets:

    • RequiresGrad = true
    • Parents = [matmul_result, bias]
    • BackwardFunc:

    $$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$

Activation Function (nn.RELU(h))

Applies ReLU activation:

$$ y = \text{max}(0, x) $$

  • Output gets RequiresGrad = true
  • Parents = [h]
  • BackwardFunc:

$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \mathbf{1}_{x > 0} $$

Second Linear Layer (layer2.Forward(hRelu))

Same process:

  • Matmul with weights
  • Add bias
  • Parents + backward functions set

Returns logits.

Loss Function (nn.CrossEntropyLoss(logits, targets))

Calculates final loss:

$$ \text{loss} = -\sum_i y_i \log(\text{softmax}(x_i)) $$

  • Resulting tensor loss:
    • RequiresGrad = true
    • Parents = [logits]
    • BackwardFunc:

$$ \frac{\partial L}{\partial \mathrm{logits}} = \frac{\operatorname{softmax}(\mathrm{logits}) - \operatorname{onehot}(\mathrm{targets})}{\mathrm{batch\ size}} $$

Backward Pass (loss.Backward(nil))

Starts from the scalar loss node:

  1. loss.Grad = 1.0
  2. loss.BackwardFunc(1.0) runs
  3. Gradients backpropagate to:
    • logits
    • Then to layer2.weight, layer2.bias, and hRelu
    • Then hRelu applies ReLU derivative
    • Then to layer1.weight, layer1.bias, and finally to x

Every tensor in the graph with RequiresGrad = true will now have its .Grad field populated:

x.Grad
layer1.weight.Grad, layer1.bias.Grad
layer2.weight.Grad, layer2.bias.Grad

The Optimizer: Stochastic Gradient Descent (SGD)

To apply updates, the optimizer must know which tensors to manage.

Each layer in go-torch provides a Parameters() method that returns a slice of all trainable tensors:

allParams := append(layer1.Parameters(), layer2.Parameters()...)

Once we gather all the parameters, we construct the optimizer:

opt, err := optimizer.NewSGD(allParams, 0.1)
if err != nil {
log.Fatal(err)
}

This tells the optimizer:

  • Which tensors to update,
  • What learning rate to use,
  • To expect .Grad to be populated on each tensor after .Backward() is called.

For each parameter $ \theta $, the update rule is:

$$ \theta \leftarrow \theta - \eta \cdot \nabla_\theta L $$

Where:

  • $ \eta $ is the learning rate (a small positive constant),
  • $ \nabla_\theta L $ is the gradient of the loss ( L ) with respect to parameter $ \theta $.

This update moves each parameter in the direction that most reduces the loss.

In go-torch, this logic is implemented in the optimizer/sgd.go file. Below is the core implementation:

func (s *SGD) Step() error {
for _, p := range s.parameters {
if p.Grad == nil {
continue
}
paramData := p.GetData()
gradData := p.Grad.GetData()
for i := range paramData {
paramData[i] -= s.learningRate * gradData[i]
}
}
return nil
}

Before starting the next training step, we must clear the previous gradients:

opt.ZeroGrad()

This is crucial—otherwise, gradients from multiple steps accumulate and corrupt learning. This works by calling ZeroGrad() on each parameter tensor:

func (s *SGD) ZeroGrad() {
for _, p := range s.parameters {
p.ZeroGrad()
}
}

Each tensor’s ZeroGrad() sets its Grad.data to all zeroes.


Benchmark

some performance measurements for common operations and a simple training step in go-torch.

Operation128×128512×5121024×1024
Element-wise Add187.602 µs2.326982 ms9.558306 ms
Element-wise Mul126.740 µs2.256796 ms10.684073 ms
Matrix Multiply8.514523 ms1.156596279 s15.784033 s
ReLU Activation226.385 µs4.192483 ms6.26745 ms
OperationConfigurationAvg Time per Iteration
Linear Layer ForwardBatch: 32, In:128, Out:10310.494 µs
CrossEntropyLossBatch: 32, Classes: 1039.996 µs
Full Forward-Backward PassNet: 128-256-10, Batch: 3228.68176 ms