go-torch - simple torch like library in 1000 lines.
I was reading Joe Bodner’s book, Learning Go, and got the idea of recreating a PyTorch-like library in Go. Before this, I tried creating a tensor library in Rust, but couldn’t get far. Rust had a steep learning curve, and I still have confusion over ownership and borrowing semantics.
So, I thought: why not Go?
The idea for go-torch is simple — just build out the essentials: tensor functions, auto-grad, backward pass, and a linear layer. That’s all we need to wrap a working library that can actually train a network — like a digit recognizer for MNIST.
GitHub - https://github.com/Abinesh-Mathivanan/go-torch
I didn’t go fancy yet. There’s a lot of stuff to handle under BLAS and Intel kernels. I’ll wrap those soon, and we’ll have some cool functionalities in our library.
We’ll go step by step, through “how I built this”.
Project Structure
Tensor/ -> Core tensor logic and computation graph tracking -> All tensor operations (add, mul, matmul, reshape, transpose) -> Also includes Backward() for automatic differentiation
nn/ -> Neural network layer definitions [Linear, ReLU, Sigmoid, Tanh, Softmax, CrossEntropyLoss]
autograd/ -> Currently unused; placed for future refactor of backward/topo sort
utils/ -> Benchmarking utilitiesArchitecture
The core of go-torch is the Tensor struct defined in tensor/tensor.go:
type Tensor struct { shape []int data []float64 Grad *Tensor RequiresGrad bool Parents []*Tensor Operation string BackwardFunc func(*Tensor)}RequiresGrad, Parents, and BackwardFunc
When you create a tensor that needs gradients computed during training (like weights or biases), you set RequiresGrad = true.
When you perform an operation (say, adding two tensors A and B to get C = A + B), the resulting tensor C will have:
C.RequiresGrad = A.RequiresGrad || B.RequiresGradIf C.RequiresGrad is true, then:
C.Parentsis set to[A, B]C.BackwardFuncis defined
This BackwardFunc knows the gradient rule for the operation. For addition:
$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$
So, C.BackwardFunc just takes the incoming gradient dL/dC and passes it backward to A and B. This chaining of Parents and BackwardFuncs is how we build the computation graph — by simply doing math, everything gets tracked.
The Flow: Forward and Backward Pass
Let’s walk through what actually happens when you run the simple network demo in main.go.
Input Tensor (x)
You start by creating your input tensor x and set:
x.RequiresGrad = trueThis flags it for gradient tracking. You usually do this for weights and biases, but here we’re doing it just to show how gradients flow through the whole graph — even back to the input if needed.
Linear Layer (layer1)
You create your first linear layer. Inside, it sets up weight and bias tensors — both with RequiresGrad = true by default.
Forward Pass (layer1.Forward(x))
This step performs the standard matrix operation:
O = XW + b-
Matrix Multiply:
Callstensor.MatMulTensor(x, layer1.weight). The resulting output will have:RequiresGrad = trueParents = [x, weight]BackwardFuncbased on the matrix multiplication rule:
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial O} \cdot W^T \ \frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial O} $$
-
Add Bias:
Then, adds the bias via broadcasting withtensor.AddTensor(...), which again sets:RequiresGrad = trueParents = [matmul_result, bias]BackwardFunc:
$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$
Activation Function (nn.RELU(h))
Applies ReLU activation:
$$ y = \text{max}(0, x) $$
- Output gets
RequiresGrad = true Parents = [h]BackwardFunc:
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \mathbf{1}_{x > 0} $$
Second Linear Layer (layer2.Forward(hRelu))
Same process:
- Matmul with weights
- Add bias
- Parents + backward functions set
Returns logits.
Loss Function (nn.CrossEntropyLoss(logits, targets))
Calculates final loss:
$$ \text{loss} = -\sum_i y_i \log(\text{softmax}(x_i)) $$
- Resulting tensor
loss:RequiresGrad = trueParents = [logits]BackwardFunc:
$$ \frac{\partial L}{\partial \mathrm{logits}} = \frac{\operatorname{softmax}(\mathrm{logits}) - \operatorname{onehot}(\mathrm{targets})}{\mathrm{batch\ size}} $$
Backward Pass (loss.Backward(nil))
Starts from the scalar loss node:
loss.Grad = 1.0loss.BackwardFunc(1.0)runs- Gradients backpropagate to:
logits- Then to
layer2.weight,layer2.bias, andhRelu - Then
hReluapplies ReLU derivative - Then to
layer1.weight,layer1.bias, and finally tox
Every tensor in the graph with RequiresGrad = true will now have its .Grad field populated:
x.Gradlayer1.weight.Grad, layer1.bias.Gradlayer2.weight.Grad, layer2.bias.GradThe Optimizer: Stochastic Gradient Descent (SGD)
To apply updates, the optimizer must know which tensors to manage.
Each layer in go-torch provides a Parameters() method that returns a slice of all trainable tensors:
allParams := append(layer1.Parameters(), layer2.Parameters()...)Once we gather all the parameters, we construct the optimizer:
opt, err := optimizer.NewSGD(allParams, 0.1)if err != nil { log.Fatal(err)}This tells the optimizer:
- Which tensors to update,
- What learning rate to use,
- To expect .Grad to be populated on each tensor after .Backward() is called.
For each parameter $ \theta $, the update rule is:
$$ \theta \leftarrow \theta - \eta \cdot \nabla_\theta L $$
Where:
- $ \eta $ is the learning rate (a small positive constant),
- $ \nabla_\theta L $ is the gradient of the loss ( L ) with respect to parameter $ \theta $.
This update moves each parameter in the direction that most reduces the loss.
In go-torch, this logic is implemented in the optimizer/sgd.go file. Below is the core implementation:
func (s *SGD) Step() error { for _, p := range s.parameters { if p.Grad == nil { continue } paramData := p.GetData() gradData := p.Grad.GetData() for i := range paramData { paramData[i] -= s.learningRate * gradData[i] } } return nil}Before starting the next training step, we must clear the previous gradients:
opt.ZeroGrad()This is crucial—otherwise, gradients from multiple steps accumulate and corrupt learning. This works by calling ZeroGrad() on each parameter tensor:
func (s *SGD) ZeroGrad() { for _, p := range s.parameters { p.ZeroGrad() }}Each tensor’s ZeroGrad() sets its Grad.data to all zeroes.
Benchmark
some performance measurements for common operations and a simple training step in go-torch.
| Operation | 128×128 | 512×512 | 1024×1024 |
|---|---|---|---|
| Element-wise Add | 187.602 µs | 2.326982 ms | 9.558306 ms |
| Element-wise Mul | 126.740 µs | 2.256796 ms | 10.684073 ms |
| Matrix Multiply | 8.514523 ms | 1.156596279 s | 15.784033 s |
| ReLU Activation | 226.385 µs | 4.192483 ms | 6.26745 ms |
| Operation | Configuration | Avg Time per Iteration |
|---|---|---|
| Linear Layer Forward | Batch: 32, In:128, Out:10 | 310.494 µs |
| CrossEntropyLoss | Batch: 32, Classes: 10 | 39.996 µs |
| Full Forward-Backward Pass | Net: 128-256-10, Batch: 32 | 28.68176 ms |