go-torch - simple torch like library in 1000 lines.
I was reading Joe Bodner’s book, Learning Go, and got the idea of recreating a PyTorch-like library in Go. Before this, I tried creating a tensor library in Rust, but couldn’t get far. Rust had a steep learning curve, and I still have confusion over ownership and borrowing semantics.
So, I thought: why not Go?
The idea for go-torch is simple — just build out the essentials: tensor functions, auto-grad, backward pass, and a linear layer. That’s all we need to wrap a working library that can actually train a network — like a digit recognizer for MNIST.
GitHub - https://github.com/Abinesh-Mathivanan/go-torch
I didn’t go fancy yet. There’s a lot of stuff to handle under BLAS and Intel kernels. I’ll wrap those soon, and we’ll have some cool functionalities in our library.
We’ll go step by step, through “how I built this”.
Project Structure
Tensor/ -> Core tensor logic and computation graph tracking -> All tensor operations (add, mul, matmul, reshape, transpose) -> Also includes Backward() for automatic differentiation
nn/ -> Neural network layer definitions [Linear, ReLU, Sigmoid, Tanh, Softmax, CrossEntropyLoss]
autograd/ -> Currently unused; placed for future refactor of backward/topo sort
utils/ -> Benchmarking utilities
Architecture
The core of go-torch is the Tensor
struct defined in tensor/tensor.go
:
type Tensor struct { shape []int data []float64 Grad *Tensor RequiresGrad bool Parents []*Tensor Operation string BackwardFunc func(*Tensor)}
RequiresGrad, Parents, and BackwardFunc
When you create a tensor that needs gradients computed during training (like weights or biases), you set RequiresGrad = true
.
When you perform an operation (say, adding two tensors A
and B
to get C = A + B
), the resulting tensor C
will have:
C.RequiresGrad = A.RequiresGrad || B.RequiresGrad
If C.RequiresGrad
is true
, then:
C.Parents
is set to[A, B]
C.BackwardFunc
is defined
This BackwardFunc
knows the gradient rule for the operation. For addition:
$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$
So, C.BackwardFunc
just takes the incoming gradient dL/dC
and passes it backward to A and B. This chaining of Parents
and BackwardFuncs
is how we build the computation graph — by simply doing math, everything gets tracked.
The Flow: Forward and Backward Pass
Let’s walk through what actually happens when you run the simple network demo in main.go
.
Input Tensor (x
)
You start by creating your input tensor x
and set:
x.RequiresGrad = true
This flags it for gradient tracking. You usually do this for weights and biases, but here we’re doing it just to show how gradients flow through the whole graph — even back to the input if needed.
Linear Layer (layer1
)
You create your first linear layer. Inside, it sets up weight
and bias
tensors — both with RequiresGrad = true
by default.
Forward Pass (layer1.Forward(x)
)
This step performs the standard matrix operation:
O = XW + b
-
Matrix Multiply:
Callstensor.MatMulTensor(x, layer1.weight)
. The resulting output will have:RequiresGrad = true
Parents = [x, weight]
BackwardFunc
based on the matrix multiplication rule:
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial O} \cdot W^T \ \frac{\partial L}{\partial W} = X^T \cdot \frac{\partial L}{\partial O} $$
-
Add Bias:
Then, adds the bias via broadcasting withtensor.AddTensor(...)
, which again sets:RequiresGrad = true
Parents = [matmul_result, bias]
BackwardFunc
:
$$ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C}, \quad \frac{\partial L}{\partial B} = \frac{\partial L}{\partial C} $$
Activation Function (nn.RELU(h)
)
Applies ReLU activation:
$$ y = \text{max}(0, x) $$
- Output gets
RequiresGrad = true
Parents = [h]
BackwardFunc
:
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \mathbf{1}_{x > 0} $$
Second Linear Layer (layer2.Forward(hRelu)
)
Same process:
- Matmul with weights
- Add bias
- Parents + backward functions set
Returns logits
.
Loss Function (nn.CrossEntropyLoss(logits, targets)
)
Calculates final loss:
$$ \text{loss} = -\sum_i y_i \log(\text{softmax}(x_i)) $$
- Resulting tensor
loss
:RequiresGrad = true
Parents = [logits]
BackwardFunc
:
$$ \frac{\partial L}{\partial \mathrm{logits}} = \frac{\operatorname{softmax}(\mathrm{logits}) - \operatorname{onehot}(\mathrm{targets})}{\mathrm{batch\ size}} $$
Backward Pass (loss.Backward(nil)
)
Starts from the scalar loss node:
loss.Grad = 1.0
loss.BackwardFunc(1.0)
runs- Gradients backpropagate to:
logits
- Then to
layer2.weight
,layer2.bias
, andhRelu
- Then
hRelu
applies ReLU derivative - Then to
layer1.weight
,layer1.bias
, and finally tox
Every tensor in the graph with RequiresGrad = true
will now have its .Grad
field populated:
x.Gradlayer1.weight.Grad, layer1.bias.Gradlayer2.weight.Grad, layer2.bias.Grad
The Optimizer: Stochastic Gradient Descent (SGD)
To apply updates, the optimizer must know which tensors to manage.
Each layer in go-torch provides a Parameters() method that returns a slice of all trainable tensors:
allParams := append(layer1.Parameters(), layer2.Parameters()...)
Once we gather all the parameters, we construct the optimizer:
opt, err := optimizer.NewSGD(allParams, 0.1)if err != nil { log.Fatal(err)}
This tells the optimizer:
- Which tensors to update,
- What learning rate to use,
- To expect .Grad to be populated on each tensor after .Backward() is called.
For each parameter $ \theta $, the update rule is:
$$ \theta \leftarrow \theta - \eta \cdot \nabla_\theta L $$
Where:
- $ \eta $ is the learning rate (a small positive constant),
- $ \nabla_\theta L $ is the gradient of the loss ( L ) with respect to parameter $ \theta $.
This update moves each parameter in the direction that most reduces the loss.
In go-torch
, this logic is implemented in the optimizer/sgd.go
file. Below is the core implementation:
func (s *SGD) Step() error { for _, p := range s.parameters { if p.Grad == nil { continue } paramData := p.GetData() gradData := p.Grad.GetData() for i := range paramData { paramData[i] -= s.learningRate * gradData[i] } } return nil}
Before starting the next training step, we must clear the previous gradients:
opt.ZeroGrad()
This is crucial—otherwise, gradients from multiple steps accumulate and corrupt learning. This works by calling ZeroGrad() on each parameter tensor:
func (s *SGD) ZeroGrad() { for _, p := range s.parameters { p.ZeroGrad() }}
Each tensor’s ZeroGrad() sets its Grad.data to all zeroes.
Benchmark
some performance measurements for common operations and a simple training step in go-torch.
Operation | 128×128 | 512×512 | 1024×1024 |
---|---|---|---|
Element-wise Add | 187.602 µs | 2.326982 ms | 9.558306 ms |
Element-wise Mul | 126.740 µs | 2.256796 ms | 10.684073 ms |
Matrix Multiply | 8.514523 ms | 1.156596279 s | 15.784033 s |
ReLU Activation | 226.385 µs | 4.192483 ms | 6.26745 ms |
Operation | Configuration | Avg Time per Iteration |
---|---|---|
Linear Layer Forward | Batch: 32, In:128, Out:10 | 310.494 µs |
CrossEntropyLoss | Batch: 32, Classes: 10 | 39.996 µs |
Full Forward-Backward Pass | Net: 128-256-10, Batch: 32 | 28.68176 ms |