Beens - go-torch - simple torch like library in 1000 lines.

go-torch - simple torch like library in 1000 lines.

2025-05-14

I was reading Joe Bodner’s book, Learning Go, and got the idea of recreating a PyTorch-like library in Go. Before this, I tried creating a tensor library in Rust, but couldn’t get far. Rust had a steep learning curve, and I still have confusion over ownership and borrowing semantics.

So, I thought: why not Go?

The idea for go-torch is simple — just build out the essentials: tensor functions, auto-grad, backward pass, and a linear layer. That’s all we need to wrap a working library that can actually train a network — like a digit recognizer for MNIST.

GitHub - https://github.com/Abinesh-Mathivanan/go-torch

I didn’t go fancy yet. There’s a lot of stuff to handle under BLAS and Intel kernels. I’ll wrap those soon, and we’ll have some cool functionalities in our library.

We’ll go step by step, through “how I built this”.

Project Structure

1
Tensor/       -> Core tensor logic and computation graph tracking
2
              -> All tensor operations (add, mul, matmul, reshape, transpose)
3
              -> Also includes Backward() for automatic differentiation
4

5
nn/           -> Neural network layer definitions [Linear, ReLU, Sigmoid, Tanh, Softmax, CrossEntropyLoss]
6

7
autograd/     -> Currently unused; placed for future refactor of backward/topo sort
8

9
utils/        -> Benchmarking utilities

Architecture

The core of go-torch is the Tensor struct defined in tensor/tensor.go:

1
type Tensor struct {
2
  shape         []int
3
  data          []float64
4
  Grad          *Tensor
5
  RequiresGrad  bool
6
  Parents       []*Tensor
7
  Operation     string
8
  BackwardFunc  func(*Tensor)
9
}

RequiresGrad, Parents, and BackwardFunc

When you create a tensor that needs gradients computed during training (like weights or biases), you set RequiresGrad = true.

When you perform an operation (say, adding two tensors A and B to get C = A + B), the resulting tensor C will have:

1
C.RequiresGrad = A.RequiresGrad || B.RequiresGrad

If C.RequiresGrad is true, then:

C.Parents is set to [A, B]
C.BackwardFunc is defined

This BackwardFunc knows the gradient rule for the operation. For addition:

$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$

So, C.BackwardFunc just takes the incoming gradient dL/dC and passes it backward to A and B. This chaining of Parents and BackwardFuncs is how we build the computation graph — by simply doing math, everything gets tracked.

The Flow: Forward and Backward Pass

Let’s walk through what actually happens when you run the simple network demo in main.go.

Input Tensor (`x`)

You start by creating your input tensor x and set:

1
x.RequiresGrad = true

This flags it for gradient tracking. You usually do this for weights and biases, but here we’re doing it just to show how gradients flow through the whole graph — even back to the input if needed.

Linear Layer (`layer1`)

You create your first linear layer. Inside, it sets up weight and bias tensors — both with RequiresGrad = true by default.

Forward Pass (`layer1.Forward(x)`)

This step performs the standard matrix operation:

1
O = XW + b

Matrix Multiply:
Calls tensor.MatMulTensor(x, layer1.weight). The resulting output will have:
- RequiresGrad = true
- Parents = [x, weight]
- BackwardFunc based on the matrix multiplication rule:
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial O} \cdot W^T \ \frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial O} $$
Add Bias:
Then, adds the bias via broadcasting with tensor.AddTensor(...), which again sets:
- RequiresGrad = true
- Parents = [matmul_result, bias]
- BackwardFunc:
$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$

Activation Function (`nn.RELU(h)`)

Applies ReLU activation:

$$ y = \text{max}(0, x) $$

Output gets RequiresGrad = true
Parents = [h]
BackwardFunc:

$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \mathbf{1}_{x > 0} $$

Second Linear Layer (`layer2.Forward(hRelu)`)

Same process:

Matmul with weights
Add bias
Parents + backward functions set

Returns logits.

Loss Function (`nn.CrossEntropyLoss(logits, targets)`)

Calculates final loss:

$$ \text{loss} = -\sum_i y_i \log(\text{softmax}(x_i)) $$

Resulting tensor loss:
- RequiresGrad = true
- Parents = [logits]
- BackwardFunc:

$$ \frac{\partial L}{\partial \mathrm{logits}} = \frac{\operatorname{softmax}(\mathrm{logits}) - \operatorname{onehot}(\mathrm{targets})}{\mathrm{batch\ size}} $$

Backward Pass (`loss.Backward(nil)`)

Starts from the scalar loss node:

loss.Grad = 1.0
loss.BackwardFunc(1.0) runs
Gradients backpropagate to:
- logits
- Then to layer2.weight, layer2.bias, and hRelu
- Then hRelu applies ReLU derivative
- Then to layer1.weight, layer1.bias, and finally to x

Every tensor in the graph with RequiresGrad = true will now have its .Grad field populated:

1
x.Grad
2
layer1.weight.Grad, layer1.bias.Grad
3
layer2.weight.Grad, layer2.bias.Grad

The Optimizer: Stochastic Gradient Descent (SGD)

To apply updates, the optimizer must know which tensors to manage.

Each layer in go-torch provides a Parameters() method that returns a slice of all trainable tensors:

1
allParams := append(layer1.Parameters(), layer2.Parameters()...)

Once we gather all the parameters, we construct the optimizer:

1
opt, err := optimizer.NewSGD(allParams, 0.1)
2
if err != nil {
3
  log.Fatal(err)
4
}

This tells the optimizer:

Which tensors to update,
What learning rate to use,
To expect .Grad to be populated on each tensor after .Backward() is called.

For each parameter $ \theta $, the update rule is:

$$ \theta \leftarrow \theta - \eta \cdot \nabla_\theta L $$

Where:

$ \eta $ is the learning rate (a small positive constant),
$ \nabla_\theta L $ is the gradient of the loss ( L ) with respect to parameter $ \theta $.

This update moves each parameter in the direction that most reduces the loss.

In go-torch, this logic is implemented in the optimizer/sgd.go file. Below is the core implementation:

1
func (s *SGD) Step() error {
2
  for _, p := range s.parameters {
3
    if p.Grad == nil {
4
      continue
5
    }
6
    paramData := p.GetData()
7
    gradData := p.Grad.GetData()
8
    for i := range paramData {
9
      paramData[i] -= s.learningRate * gradData[i]
10
    }
11
  }
12
  return nil
13
}

Before starting the next training step, we must clear the previous gradients:

1
opt.ZeroGrad()

This is crucial—otherwise, gradients from multiple steps accumulate and corrupt learning. This works by calling ZeroGrad() on each parameter tensor:

1
func (s *SGD) ZeroGrad() {
2
  for _, p := range s.parameters {
3
    p.ZeroGrad()
4
  }
5
}

Each tensor’s ZeroGrad() sets its Grad.data to all zeroes.

Benchmark

some performance measurements for common operations and a simple training step in go-torch.

Operation	128×128	512×512	1024×1024
Element-wise Add	187.602 µs	2.326982 ms	9.558306 ms
Element-wise Mul	126.740 µs	2.256796 ms	10.684073 ms
Matrix Multiply	8.514523 ms	1.156596279 s	15.784033 s
ReLU Activation	226.385 µs	4.192483 ms	6.26745 ms

Operation	Configuration	Avg Time per Iteration
Linear Layer Forward	Batch: 32, In:128, Out:10	310.494 µs
CrossEntropyLoss	Batch: 32, Classes: 10	39.996 µs
Full Forward-Backward Pass	Net: 128-256-10, Batch: 32	28.68176 ms

#llm

#ai

#pytorch

#deeplearning

go-torch - simple torch like library in 1000 lines.

Project Structure

Architecture

RequiresGrad, Parents, and BackwardFunc

The Flow: Forward and Backward Pass

Input Tensor (x)

Linear Layer (layer1)

Forward Pass (layer1.Forward(x))

Activation Function (nn.RELU(h))

Second Linear Layer (layer2.Forward(hRelu))

Loss Function (nn.CrossEntropyLoss(logits, targets))

Backward Pass (loss.Backward(nil))