Can-t stop till you get enough

02 Nov, 2025

Hey I am Tucker, and I want to talk about something I am proud of: Rewriting PyTorch in Rust

This is the first in a series of blog posts centered around my current personal project. email: tucker.bull.morgan@gmail.com

The Project: Can-t

The project is called Can-t, a Rust rewrite of PyTorch whose #1 goal was to maximize my personal learning.

Yes, I rewrote PyTorch in Rust.

Me realizing I actually rewrote PyTorch in Rust

The name comes from Bryan Cantrill, the CTO of Oxide Computer Company. I imagine Bryan reading my code and trying to pass his bar. (If this ever gets to Oxide: I’d love to go on Oxide and Friends!)

Why a PyTorch Rewrite?

Every time I went to learn ML I would end up writing Pytorch by accident. Eventually I just gave in and decided to do it properly. I learn best through projects, building, tinkering, and this endeavor has just been a whole atheneum unto itself.

I started by defining a few simple rules early to help keep such a large project focused.

Keep the goal simple: The project exists to help me learn — not to be the fastest or most innovative ML library in the world.
Don’t burn out: I’ve killed projects before by brute-forcing my way through problems. This time, I wanted to pace myself and stay excited.
Maximze learning. There where several techinal trades offs I ended up making because while it was a "better" approach, it would have meant spening months on some single aspect of the project.

If only it was so easy

Why Rust?

Rusts memory model, object model, and also just code philosophy are distinct from Pythons, so I was going to be forced to adapt the code, not just copy it. It also doesn't hurt that Rust is pretty fast out of the box, and comes with cargo, an extremely useful build system and dependency manager.

Lastly, I want to shout out Andrej Karpathy’s Zero to Hero series on YouTube — it was a huge help when I started this project:
Zero to Hero

Getting Started: Tensors and Operators

Where do you start when building a PyTorch-like library?
I started with Tensors and Operators.

What Are Tensors?

Tensors are multidimensional variables, the core data objects in any ML framework.

Quick digression: I find “PyTorch-like library” or "ML Framework" a mouthful to say again and again, so I’m going to call it a MLer (pronounced Miller).

Milling about

Tensors can also be called matrices (or vectors, or scalars, depending on the number of dimensions). Each Tensor has:

Data type — integers, floats, maybe bools (yes, PyTorch supports bool tensors!).
Shape — Which informs how we view the data in the Tensor. For example [3, 4], [4, 3, 10], or [1]. The total number of elements is the product of all dimensions.
So a tensor with shape [4, 5] has 20 elements. To access an element, you’d do something like let x = [2, 3].

My MLer is column-major, not row-major. Mostly do to personal prefence. I got used to the idea of going width then height(so x then y) from games, and I went with that.

When you allocate a Tensor in Can-t, it asks the “Equation” — which represents all computations — to allocate an N-sized data buffer, where N is the product of the Shape's dimensions.

All the actual tensor data lives in a single flat data array. The Tensor struct itself is lightweight — it doesn’t hold data directly. Instead, it stores a TensorID, which it uses to look up an InternalTensor (basically a fat pointer with start and length).

/// A internal record keeping version of a Tensor, holds the pointers to the data and grad of the array
pub struct InternalTensor {
    pub id: TensorID, // the unique id for this tensor, it is used by the equation to look up the data
    pub shape: Shape, // the shape this tensor has
    pub data_start_index: usize, // where the equation.data this tensors data starts
    pub grad_start_index: usize, // where in equation.grad this tensors grad starts
    pub operation: Operation, // Ther operation that created this tensor, Nop for an allocation
}

/// The external record of a Tensor that a user would work with
pub struct Tensor {
    pub id: TensorID, // The unique id for this tensor, ties it to the InternalTensor that can be used to look up the data
    pub shape: Shape, // The shape of the tensor
    operation: Operation, // The operation that created this Tensor(Nop for basic allocations)
}

Two Key Design Choices

At the start I made the first two major architectural decisions in Can-t:
Global Equation and Homogeneous Datatype.

Is anyone reading these alt texts?

Global Equation

All of these Tensors, and the Operations that we will get to in a moment, and various other pieces of book keeping need to live somewhere. Most often that is an "Equation" or "Graph" structure of some kind. Most other Rust based MLers I have come across require the end user to allocate this manually, and then you use it as a central object that you do every thing off of, like this.

// Roughly something like this
let mut equation = Equation::new();
let a = equation.allocate_tensor(Shape::new(vec![1, 2, 3]));

This makes sense — there’s a lot of bookkeeping involved.
But in Can-t, using the lazy_static macro I created more or less a singleton "Equation", that you could access though a this function

lazy_static! {
    static ref SINGLETON_INSTANCE: Mutex<Equation> = Mutex::new(Equation::new());
}

pub fn get_equation() -> MutexGuard<'static, Equation> {
    loop {
        let lock = SINGLETON_INSTANCE.lock();

        match lock {
            Ok(equation) => {
                return equation;
            }
            Err(_) => {
                continue;
            }
        }
    }
}

So then you can allocate a Tensor like this

let a = Tensor::new(Shape::new(vec![1, 2, 3]));

Under the hood, Tensor::new calls the global function:

let tensor_id = get_equation().allocate_tensor_with_shape(shape);

There’s a drawback, of course: you might want more than one equation, to build separate networks. Also, since there’s a lock around the lazy static variable, it’s possible to get contention over the equation if you’re not careful.

But this design keeps the experience PyTorch-like, letting users write straightforward code without worrying about the internal graph bookkeeping. It’s a simplicity-first choice.

Homogeneous Datatype

PyTorch supports many datatypes. Can-t supports one: f32.
This choice was for simplicity, the goal isn’t production use, but learning. Handling multiple data types would add complexity without teaching me much more about ML internals.

Operations

Now for the fun part: Operations — the arithmetic you perform on Tensors.

Simple ones like Add, Sub, or Mul work exactly as you’d expect:

Tensor A (Shape: [2, 3])
Tensor B (Shape: [2, 3])
Tensor C = A + B

So C[x, y] = A[x, y] + B[x, y] works so long as the Tensors have the same shape.

(Sub, Mul, and Div all follow the same logic.)

Broadcasting

Okay, I lied just there, tensors don’t actually need identical shapes to work together.
They just need to be broadcastable.

Always need a spongebob meme

From PyTorch’s documentation:

“Two tensors are ‘broadcastable’ if the following rules hold:

Each tensor has at least one dimension.

Starting from the trailing dimension, the sizes must either be equal, one of them is 1, or one of them doesn’t exist.”

So if I have two Tensors of shape[1, 2] and [3, 2], when we Broadcast A -> B, we treat A as a [3, 2] where we triplicate the first axis. (For those curious how the backwards step of this works, you just sum up the gradients along the boardcasted axis)

You can broadcast multiple dimensions at once, so [1, 1, 2] and [2, 3, 2] broadcast together without issue as well.

I am gonna skip ahead a little and talk about Matrix Multiplication shapes, as it is a special and interesting case.

When you preform what is called Batch Matrix Multiplication, which is MatMul where at least one of the operands has three or more dimensions. You only broadcast the dimensions over two.

So lets show a simple example.

Tensor A has Shape [4, 2, 1, 2] Tensor B has Shape [3, 2]

In can-t, for this operation, we treat both Tensors as 4D, inserting a 1 sized dimension for any under three.

(Can-t also just does not work on Tensors over 4 dimensions, again trying to keep things simple) (updated, this has been updated to ten dimensions recently)

So A stays the same, but B becomes [1, 1, 3, 2], and then we broadcast just the first two dimensions, the batch dimensions.

Giving us

Tensor A[4, 2, 1, 2] and Tensor B[4, 2, 3, 2]

Then we we go to the MatMul them together, we break off the last two dimension of each A[1, 2], B[3, 2], transpose B to [2, 3] and givens us result Tensor C[1, 3] and a final result of the entire MatMul operation C[4, 2, 1, 2]

Single-Tensor Operations

There’s also a set of operations you can perform on just a single Tensor. These are pretty common, so I won’t go into too much detail — but functions like sin, cos, log, or pow are all defined for a Tensor in Can-t, just like in PyTorch. They simple map NewTensor[x, y] = sinOldTensor[x, y] for example.

Matrix Multiplication

Now to the spicy stuff: Matrix Multiplication — the operation that powers modern AI and pays for Jensen Huang’s leather jackets.

Is anyone reading these alt texts?

Conceptually, matrix multiplication combines inputs and weights to produce outputs — the backbone of every neural network layer.

If A is shape (m × n) and B is (n × p), the result C is (m × p):

C[i, j] = Σ (A[i, k] * B[k, j])

Simplified Rust-style pseudocode:

for i in 0..m {
    for j in 0..p {
        let mut sum = 0.0;
        for k in 0..n {
            sum += a[i][k] * b[k][j];
        }
        c[i][j] = sum;
    }
}

Under the hood, GPUs perform billions of these small dot products in parallel — that’s why matrix multiplication dominates deep learning performance.

Checkpoint

A few reflections from building Can-t so far:

Rust’s ownership model made it easy to reason about memory safety but challenging to handle dynamic computation graphs cleanly.
Broadcasting logic was trickier to get it working, but since then it has been pretty much let alone.
Global Equation simplified everything for learning, but I’d consider making it configurable later for multi-graph experimentation.

Wrapping Up

At this point, we have Tensors and Operations — enough for inference.
Each Tensor tracks which Operation produced it, forming a directed acyclic graph (DAG) where Tensors are nodes and Operations are edges.

For example

let a = Tensor::new(Shape::new(vec![1]));
let b = Tensor::new(Shape::new(vec![1]));

Allocates two nodes in the DAG

let c = a + b;

Allocates a third node, and connects an edge from each of the Tensors to C, that edge is marked as

Operation::Add(a_id, b_id)

So we can look it later for .... back-propgation.

Wow look at the time, if you want to read about that you will have to come back next week. :P

Quick Recap

Tensors: Variables that can store multiple values under a single name, addressable by index.
Operations: The mathematical actions you can take on one or more tensors — like Add, MatMul, or Sin.

So now we have almost everything needed for inference — the act of sending input through a neural network.

Next up? Backpropagation.
That’s a story for the next blog post.

Thanks to Julia, Suzannah, Anthony, and Jayant for reading early drafts (This post was edited with the help of ChatGPT)