why & how i learnt ML
lore before 2022
I started reading physics for no reason back (actually, there’s a big story behind this) in 2020, during the coronavirus lockdown. At that time, I used to have an Instagram page (I still have) - space_relativity, and post a lot of content. During that phase, I joined a few research orgs, and one such org required me to build classifiers and regression models for galaxy classification and gamma ray bursts, for which I learnt Julia (for some reason, I felt Julia was better). So, my first hit with ML is building these models.
Following this, I quickly lost interest in doing this weird computer stuff (I didn’t know how to use computers well until 2020). I was more of a notebook, pen, and math person. I still remember that day watching Terminator: Judgement Day and wondering how I could build that in my Intel Pentium 2.5GHz with 2GB RAM PC. The same evening I started reading, Max Tegmark’s Life 3.0 (because I googled “top AI books” and it showed up), which I quickly completed over the next three days.
Then I got ready for my next book, which is Deep Learning by Ian Goodfellow. I read a few pages, like 30-40, and wondered, “So when do they start building the super AI?”. Eventually, I completed 100 pages in a span of 20 days (I spent a lot of time on math despite being from a physics background). Later, I understood that SuperAI is probably the friends we made along the way.
I stopped the book there and then jumped into blogs to find an answer for a very intriguing question,
“How do these ML models think and answer? Like, where are their brains? Are they inside the CPU?”.
I never work on anything I find boring, especially in the case of ML, I found it extremely boring for a very single reason:
“So you code the model, and you’ll never know if it’ll work or not until it runs for a few hours? How do you go easy with it?”
But still, the thought of building a SuperAI kept me in this field till I came to know about AI compilers and Kernels.
It took me a year to fully grasp, “What’s Machine Learning and AI?”.
In early 2022, I heard the word Transformers for the first time from a friend of mine and found them really interesting. Before that, I was working with Brain-Inspired NNs and WB-PGM (Whole Brain Probabilistic Generative Model), trying to understand how possible it is to build such a super AI. I spent some time understanding RNNs and how Transformers are better than them.
Later, by the end of 2022, OpenAI launched ChatGPT, which led to the AI boom, and a lot of things happened after. We now have Quants, HP kernels, LPUs, TPUs, etc…
how I learnt ML after 2022?
I wanted to write this blog just to let people know how time-consuming it can be to grasp proper knowledge in a domain, and to provide a proper pathto the people who are confused about where to start ML.
I’m not a big fan of YouTube playlists. I’m an avid watcher of tech interviews and physics lectures, but ‘roadmap’ playlists are where I draw the line. I prefer their text.
My tech and physics watchlist: article
So, I started reading my first technical book on ML, which is the Mathematics of Machine Learning [book] by Aldo Faisal. I used to spend a lot of time jotting down purely mathematical problems. During this phase, I had zero idea of how ML works in a broad sense, but I knew how to call functions from the respective libraries.
It took me around 3 months to complete the whole book, skipping a few problems, hoping that I’d eventually grasp them. Then, I started working with a few people from LinkedIn, building ML-based applications, and for the first time, I was thinking in terms of Matrix multiplication and rotations.
For the next 4 to 5 months, my reading consisted of the following resources:
- Make Your Own Neural Network by Tariq Rashid
- Understanding Machine Learning by Shai Shalev-Shwartz - book
- Introduction to Deep Learning by Eugene Charniak - book
- Deep Learning with PyTorch by Eli Stevens - book
- Deep Learning by Ian Goodfellow - book
The whole of 2023 ends here with some good productivity.
I also felt, “Understanding Deep Learning” by Simon J.D. Prince is a good book. Beginners could try this too
starting the math (after 2023)
I had experience reading technical papers from my physics background, but actual deep learning papers are way harder for me than pure black hole solutions. Like everyone else, my first paper was “Attention Is All You Need,” which probably took some time to get a proper grasp.
During this time, I started implementing “from scratch” kind of projects, including:
- Neural network from scratch (using Python and C++)
- AlexNet, DeepNet, and ResNet - kinds of Convolutional networks
- Activation Functions
- Optimizers
- RNNs, LSTM, GNNs
- Simple tokenizers, fast text lookup techniques (such as Bloom Filters)
The math topics such as:
- SVD and Matrix decomposition
- Forward pass and Backward pass
- Why the Jacobian is better than the Hessian in terms of efficiency
- Direct Convolutions, Image Kernels such as MaxPool, MinPool
- Text vectorization
I spent at least 3 hours every day writing ML models from scratch and training them. This phase went on from Jan 2024 to May 2024.
I read the book Alice’s Adventures in a Differential Wonderland - book - very late in my career. But if you’re a beginner, read it right away.
Also, check out Introductory DL Blogs by Karpathy - https://karpathy.github.io/
starting the math (after 2023)
I continued reading papers, but they were very technical during this period. I read papers from AI2, Meta, Harvard AI Lab, and Google. I went through a lot of LLM data technique papers, without yet having a strong knowledge of how LLMs actually work.
how to read an ML paper - article
In the brief period between May 2024 and Nov 2024, I almost covered every optimizer, activation, and loss function mathematically.
Also, during this period, I had to put my focus on interviews, which led me to read some CS fundamental books:
- Packet Guide to Core Network Protocols by Bruce Hartpence
- Operating Systems: Three Easy Pieces by Andrea Arpaci-Dusseau
- Designing Distributed Systems by Brendan Burns
- Design for How People Think by John Whalen, PhD
- Building LLM from Scratch by Sebastian Raschka, PhD
- Computational Thinking by The MIT Press
In this phase of learning, I actually spent less time, but learnt everything quickly, since I had already built a strong foundation in breaking down ML papers and their math in a single go. (Still, I struggle with some papers, especially diffusion-based ones.)
Estimated: I read 60–70 papers covering Topology, ML, and Physics during this period.
Also during this time, I wrote a very simple ML programming language and a simple compiler named Lynthia, which I later updated into an LLM fine-tuning kernel.
rapid engineering (Nov 2024 - now)
Before this phase, I focused on Deep Learning foundations, ML compilers, and State Machines, but never got myself fully into the foundational models, compilers, or kernels.
I started reading Sebastian Raschka’s Building LLMs from Scratch, spending 2 hours every day, writing down the architecture by hand, programming it with PyTorch, and successfully built a very small 124M parameter LLM from scratch in a span of a month. We could say that this was the first moment I completely understood what an LLM is and how it works.
Then, I started reading the top AI lab papers on LLMs, Test-time compute, SFT & RL techniques, data techniques, etc. In the next three months (around March–April), I was almost able to answer any concept, or at least know how it works in the field of LLMs.
my ai reading list - article
I believe that I was able to grasp how LLM works very quickly because I spent almost 2.5 years just reading the ML math and the fundamentals. Also in this period, I completed few of my favourite projects,
Deepcode (Leetcode for ML) - try it out
go-torch (deep learning framework in Go) - try it out
These three projects actually got me a lot of reach and made my cold emails get replies from the top AI people.
interest in kernel engineering
Even though I had some experience writing autograd engines, state compilers, and native ML optimization, an actual GPU-level kernel is way harder and very different. It’s more hardware than software. You need to know things like:
- why this
nvcc
version is not compiling with Ubuntu LTS 20.04?
To learn about kernels, my first thought went to JAX and HuggingFace. I wanted to learn kernels in a very practical way, hence I used these resources to support me:
- JAX, Triton & Tile-lang documentation (to understand CPU-level optimizations and how purely functional frameworks work)
- HuggingFace blogs on model optimization (LLMs & Diffusion)
- Papers such as:
- Thunderkittens
- Fast Inference from Transformers via Speculative Decoding
Initially, I took the support of AI and generated optimization + kernel exercises (around 30), and spent a lot of time optimizing them. During this period, I built an SLM named Beens-Minimax, inspired by the MiniMax architecture, and tried to utilize Triton to the max. However, I had to use self-written kernels since I trained on a GPU P100, which doesn’t natively support Flash Attention.
Then, I spent time reading the documentation and papers on Tile-Lang (just to get a grasp of how these GPU programming languages work). Over the last two months, I wrote almost 70+ kernels for various attention mechanisms, inference, and learning. Still, I’m working on them.
Here are some of the resources I’ve been using recently:
-
Inference Optimization
-
Triton
-
Tile-Lang
-
Other resources