TypeLearning · Deep Dive

Year2026

StackPyTorch · tiktoken · docling

GPT-2 from scratch —
understanding attention by writing it

You can pip install transformers and spend years building on top of attention without ever understanding it. So I rebuilt the 124M-parameter GPT-2 stack line-by-line in PyTorch — multi-head attention, GELU, transformer blocks, the whole thing — and pretrained it on a medical textbook. This is what I took away from it.

LLM PyTorch From Scratch Attention Learning log

View on GitHub → Back to all posts

124M

Parameters

GPT-2 base config

Layers / Heads

d_model = 768

1024

Context

Tokens per sequence

~400

Lines of PyTorch

No Hugging Face dependencies

Isometric illustration of the GPT-2 architecture rendered as a modular factory tower with attention heads, MLP chamber, and residual bypass pipes

GPT-2, imagined as the machine it essentially is. Tokens in at the base, residual pipes running the full height, attention tubes and an MLP chamber inside every one of the 12 stacked crates.

01 — Why rebuild?

Using attention is easy. Understanding it isn't.

Every LLM project I've shipped in the last two years has been built on top of a pretrained model — Hugging Face, an API, a quantised weight file. That's the right default for production. It is also the fastest way to spend a career near transformers without ever actually understanding them.

I kept running into this gap. I could describe self-attention in an interview. I could not have written it on a blank page. "Softmax of QKᵀ scaled by √dₖ, times V" is a sentence, not an intuition. So I set a rule for myself: write the whole thing from nn.Module up, no Hugging Face, no copy-paste, and don't move on until I could derive each piece from memory.

This writeup is the condensed version — the parts that changed how I think about transformers, with the actual code from the notebook in the order I wrote it.

The goal

Not to train a usable language model (the data is tiny and the compute is a laptop). The goal was to build every piece of the GPT-2 architecture by hand so that when I later fine-tune a real model — MedGemma is next — I know exactly what each layer is doing and why.

02 — Setup

A medical textbook, a BPE tokenizer, and a sliding window

The corpus is Medicine PreTest Self-Assessment and Review (14th ed.), extracted from PDF via docling. Medical text is a deliberately awkward domain for a general-purpose tokenizer — plenty of rare words, Latin, dosage notation — which makes it a more honest sanity test than the usual tiny Shakespeare.

Tokenization uses tiktoken with GPT-2's 50,257-token BPE vocabulary. That keeps us drop-in compatible with the real GPT-2 weights if I ever want to warm-start from them later.

The dataset itself is a sliding window over the full token stream. For a sequence of max_length tokens, the target is the same sequence shifted by one — classic next-token prediction. The stride parameter lets you trade overlap for unique samples.

gpt2 / dataset.py
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids  = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # sliding window — input is [i : i+L], target is shifted by one
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk  = token_ids[i     : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):         return len(self.input_ids)
    def __getitem__(self, i): return self.input_ids[i], self.target_ids[i]

And the GPT-2 base config as a plain dict — this is what gets threaded through every module. The 124M label comes from counting parameters once everything is wired up.

gpt2 / config.py
GPT_CONFIG_124M = {
    "vocab_size"    : 50257,   # BPE vocab
    "context_length": 1024,    # max tokens per forward pass
    "emb_dim"       : 768,     # d_model
    "n_heads"       : 12,      # attention heads
    "n_layers"      : 12,      # transformer blocks
    "drop_rate"     : 0.1,
    "qkv_bias"      : False,
}

03 — Attention

The one piece it's worth writing by hand

Almost every other module in GPT-2 is a one-liner or a stock PyTorch call. Attention is the exception. You can read the formula a hundred times and still not quite feel how the shapes move. So this was the cell I refused to copy from anywhere.

The mental model that finally clicked for me: attention is a soft dictionary lookup. For every token, you compute a query. Every other token has a key (its "label") and a value (its "contents"). You score the query against every key, softmax the scores into weights, and take a weighted sum of the values. That's it. Everything else is plumbing.

Three bits of that plumbing are worth pointing at:

①

The √dₖ scale

Without it, the dot-product magnitudes grow with dimension, softmax saturates, and gradients die. Dividing by sqrt(head_dim) keeps the logits in a range where softmax still has useful slope. A one-line detail that actually determines whether the model trains at all.

②

The causal mask

An upper-triangular -inf mask added to the attention logits before softmax. This is what makes GPT "decoder-only": token t can only attend to tokens ≤ t. Without it, you've built an encoder.

③

Multi-head = reshape, not loop

The naïve way to do h attention heads is a Python loop. The actual way is a single projection to d_model, then .view() it into (heads, head_dim) and transpose. One matmul, h heads, free parallelism.

④

The output projection

After concatenating heads back together, a final Linear(d, d) lets the model mix information across heads. Skipping this makes each head an island. With it, heads can specialise and still inform each other.

gpt2 / attention.py
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out     = d_out
        self.num_heads = num_heads
        self.head_dim  = d_out // num_heads

        self.W_query  = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key    = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value  = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)
        self.dropout  = nn.Dropout(dropout)

        # causal mask — upper triangle of ones, everything else zero
        self.register_buffer("mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, _ = x.shape

        # 1) project into Q, K, V — each (B, T, d_out)
        q = self.W_query(x); k = self.W_key(x); v = self.W_value(x)

        # 2) split heads via reshape + transpose → (B, heads, T, head_dim)
        q = q.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)

        # 3) scaled dot-product attention, with the causal mask
        attn_scores = q @ k.transpose(2, 3)
        attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / k.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # 4) apply weights to V, merge heads back, output projection
        ctx = (attn_weights @ v).transpose(1, 2).contiguous().view(b, num_tokens, self.d_out)
        return self.out_proj(ctx)

If there's one block of code in the whole model that I'd suggest writing by hand at least once, this is it. Everything else rests on top.

04 — The transformer block

LayerNorm, GELU, residuals — the boring parts that make it work

Attention alone does not train. Stack twelve attention layers on top of each other and you get vanishing gradients, unstable activations, and a model that spends most of its time deciding whether to be zero or infinity. The transformer block is the scaffolding that turns attention from a clever idea into something you can actually optimise.

Three ingredients:

gpt2 / norms.py
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps   = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var  = x.var (dim=-1, keepdim=True, unbiased=False)
        return self.scale * (x - mean) / torch.sqrt(var + self.eps) + self.shift


class GELU(nn.Module):
    def forward(self, x):
        # GPT-2's approximate GELU — smoother than ReLU, keeps small negatives alive
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * x**3)))


class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )
    def forward(self, x): return self.layers(x)

The block itself wires those into the two-residual pattern that makes GPT-2 GPT-2 — specifically the pre-LN variant, where LayerNorm is applied before attention and before the MLP, not after. This is different from the original "Attention Is All You Need" post-LN layout and is one of the reasons GPT-2 trains stably at depth.

gpt2 / block.py
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.att   = MultiHeadAttention(
            d_in=cfg["emb_dim"], d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.ff    = FeedForward(cfg)
        self.drop  = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # pre-LN → attention → residual
        x = x + self.drop(self.att(self.norm1(x)))
        # pre-LN → MLP → residual
        x = x + self.drop(self.ff (self.norm2(x)))
        return x

05 — The full model

Stack twelve of them, embed tokens, project to vocab

Once the block is built, the model itself is unreasonably short. It's basically: embed the tokens, add positional embeddings, run them through a stack of twelve identical blocks, layernorm once more, and project to vocabulary logits. No bells, no whistles.

gpt2 / model.py
class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb    = nn.Embedding(cfg["vocab_size"],     cfg["emb_dim"])
        self.pos_emb    = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb   = nn.Dropout(cfg["drop_rate"])
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head   = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        B, T = in_idx.shape
        tok = self.tok_emb(in_idx)
        pos = self.pos_emb(torch.arange(T, device=in_idx.device))
        x   = self.drop_emb(tok + pos)
        x   = self.trf_blocks(x)
        x   = self.final_norm(x)
        return self.out_head(x)   # (B, T, vocab_size)

Counting params

Embeddings (50,257 × 768) ≈ 38.6M. Each of 12 transformer blocks: attention ≈ 2.4M + MLP ≈ 4.7M ≈ 7.1M, times 12 ≈ 85M. Plus output head and norms. Total comes out to ~163M; GPT-2 tied the token embedding with the output head, which is how it's quoted as 124M. That tying is a one-line change and is the easiest "free" parameter-count improvement in the whole stack.

06 — Generating text

Greedy decoding — the 20-line proof that the shapes work

With an untrained model the output is gibberish, so this is not about quality — it's about closing the loop and proving that a tensor of token IDs can go in, a distribution can come out, and the next token you sample from it can be appended and fed back in. Autoregressive generation in its smallest form:

gpt2 / generate.py
def generate_text_simple(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]           # crop to context window
        with torch.no_grad():
            logits = model(idx_cond)                # (B, T, vocab)
        logits = logits[:, -1, :]                    # only the last position matters
        probs  = torch.softmax(logits, dim=-1)
        next_id = torch.argmax(probs, dim=-1, keepdim=True)   # greedy
        idx = torch.cat((idx, next_id), dim=1)
    return idx

Swap argmax for top-k or nucleus sampling and you have the inference loop every LLM API you've ever used is doing under the hood. It really is that simple — once everything above actually works.

07 — What I learned

Five things that only really landed after writing it

①

Attention is a soft lookup, not magic

Queries ask "what do I need?", keys answer "here's what I am", values carry "here's what I contain". The softmax turns scoring into a differentiable weighted average. Once you hold that picture, every attention variant (cross-attention, grouped-query, sparse) is just a tweak to who-queries-whom.

②

Multi-head is shape gymnastics

I used to picture h separate attention modules running in parallel. In reality it's one projection and one matmul, reshaped. This is why scaling heads is essentially free — you're not doing more compute, you're rearranging the same compute.

③

Residuals are the whole ball game at depth

Delete the x + ... in the transformer block and the model won't train past three layers. The residual stream is what carries the original signal past every attention and MLP, and every layer is really just writing a small update into it.

④

Pre-LN vs post-LN isn't a detail

The original transformer paper put LayerNorm after the residual add. GPT-2 moved it before the attention and MLP. That one change is a big part of why GPT-2 trains stably at 12+ layers where post-LN transformers famously blow up.

⑤

The causal mask is one line — and everything

Add an upper-triangular -inf before softmax and you've built a decoder. Don't, and you've built an encoder. The architecture is identical; the mask is what commits the model to autoregression.

⑥

The model is shorter than you think

Stripped of comments, the whole GPT-2 stack is around 150 lines of PyTorch. Every modern LLM architecture is a small, readable perturbation of this. That's a more reassuring fact than it sounds.

08 — What's next

MedGemma + tool calling

Writing GPT-2 from scratch was the warm-up. The real goal is to take a modern, medically-pretrained open model — MedGemma — and fine-tune it to reliably call tools: things like a drug-interaction lookup, a dosing calculator, a structured patient-record reader.

Tool calling lives at a specific spot in the stack I now understand concretely: it's a supervised fine-tuning objective on top of next-token prediction, with a strict output format (JSON-ish function calls) that the model has to learn to emit rather than free-form prose. Knowing exactly how the forward pass and the lm_head produce those tokens makes the fine-tuning plan a lot less mysterious.

The pieces I expect to reuse from this project:

1. The tokenizer intuition — what BPE does to medical vocabulary, where it fragments, and why that matters for tool-call output formatting.

2. The loss surface — next-token cross-entropy on formatted tool calls vs prose, and the trade-offs of SFT-only vs DPO for format adherence.

3. The inference loop — constrained decoding, grammar-based sampling, and structured-output techniques all hook into exactly the generate_text_simple shape above.

If you're working on something similar, or have opinions on the MedGemma → tool-calling fine-tune, get in touch.

Source is public

Full notebook on GitHub

Every cell, every plot, every mistake. Fork and learn.

Open the repo →

Source

Open notebook on GitHub →

Related blog

Related project

Reading time

12 min

~2,400 words · lots of code