Beens

home

blogs

my ai reading list

2025-06-21

llm foundations

  • papers

    • Understanding LSTM Networks - Christopher Olah (blog) - [paper]
    • The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy (blog) - [paper]
    • Sequence to Sequence Learning with Neural Networks - (Google) - [paper]
    • Neural Machine Translation by Jointly Learning to Align and Translate - (Google/NYU) - [paper]
    • Word2Vec: Efficient Estimation of Word Representations - (Google) - [paper]
    • GloVe: Global Vectors for Word Representation - (Stanford) - [paper]
    • ELMo: Deep Contextualized Word Representations - (Allen Institute) - [paper]
    • The Illustrated Transformer - Jay Alammar (blog) - [paper]
    • A Primer in BERTology: What we know about how BERT works - Rogers et al. - [paper]
    • A Survey of Transformers - Khan et al. - [paper]
  • books

    • Deep Learning (Ch. 6, 9, 10) - Goodfellow et al. - [book]
    • Alice’s Adventures in a Differentiable Wonderland — Vol I - Simone Scardapane - [book]

Language Models and stuff

  • papers

    • Attention Is All You Need - (Google) - [paper]
    • BERT: Pre-training of Deep Bidirectional Transformers - (Google) - [paper]
    • Improving Language Understanding by Generative Pre-Training - (OpenAI) - [paper]
    • Language Models are Unsupervised Multitask Learners (GPT‑2) - (OpenAI) - [paper]
    • Language Models are Few-Shot Learners (GPT‑3) - (OpenAI) - [paper]
    • Scaling Laws for Neural Language Models - (OpenAI) - [paper]
    • Training Compute-Optimal Language Models (Chinchilla) - (DeepMind) -[paper]
    • PaLM: Scaling Language Models with Pathways - (Google) - [paper]
    • LLaMA: Open and Efficient Foundation Language Models - (Meta) - [paper]
    • The LLaMa 3 Herd of Models - (Meta) - [paper]
    • DeepSeek‑V3 Technical Report - (DeepSeek) - [paper]
    • Tülu 3: Pushing Frontiers in Open Language Model Post-Training - (Allen Institute for AI) - [paper]
    • Large Concept Models: Language Modeling in a Sentence Representation Space - (Meta) - [paper]

advanced language models & RL training

  • papers

    • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - (DeepSeek) - [paper]
    • Kimi k1.5: Scaling Reinforcement Learning with LLMs - (Kimi AI) - [paper]
    • Self-Rewarding Language Models - (Meta) - [paper]
    • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - (Google) - [paper]
    • Training language models to follow instructions with human feedback - (Open AI) - [paper]
    • RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback - (Google) - [paper]
    • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - (Deepseek) - [paper]
    • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits - (Microsoft) - [paper]
  • blogs

    • MagiAttention - (Sandai Org) - [blog]

diffusion & discrete generation models

  • papers

    • Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding - (NUS) - [paper]
    • Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs - (Huawei) - [paper]
    • Large Language Diffusion Models - (Gaoling School of AI) - [paper]

architectural advancement & fast inference

  • papers

    • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - (Stanford) - [paper] (also read FlashAttention2 & 3).
    • LoRA: Low-Rank Adaptation of Large Language Models - (Microsoft) - [paper]
    • Mixtral of Experts - (Mistral AI) - [paper]
    • Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention - (Deepseek) - [paper]
    • Mamba: Linear-Time Sequence Modeling with Selective State Spaces - (CMU) - [paper] [blog]
    • s1: Simple test-time scaling - (Stanford / AIAI) - [paper]

model behaviour & new insights

  • papers

    • Alignment Faking in Large Language Models - (Anthropic) - [paper]
    • On the Biology of a Large Language Model - (Anthropic series) - [paper]
    • How Much Do Large Language Models Memorize? - (Meta / Google / CU / NVIDIA) - [paper]
    • When Can Transformers Reason with Abstract Symbols? - (Apple / MIT) - [paper]
    • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning - (Qwen) - [paper]
    • Let the Code LLM Edit Itself When You Edit the Code(ByteDance)[paper]