Projects Playground Blog Timeline Get in touch
← Back to blog
TypeLearning · Deep Dive
Year2026
StackPyTorch · tiktoken · docling

GPT-2 from scratch —
understanding attention by writing it

You can pip install transformers and spend years building on top of attention without ever understanding it. So I rebuilt the 124M-parameter GPT-2 stack line-by-line in PyTorch — multi-head attention, GELU, transformer blocks, the whole thing — and pretrained it on a medical textbook. This is what I took away from it.

LLM PyTorch From Scratch Attention Learning log
124M
Parameters
GPT-2 base config
12
Layers / Heads
d_model = 768
1024
Context
Tokens per sequence
~400
Lines of PyTorch
No Hugging Face dependencies
Isometric illustration of the GPT-2 architecture rendered as a modular factory tower with attention heads, MLP chamber, and residual bypass pipes
GPT-2, imagined as the machine it essentially is. Tokens in at the base, residual pipes running the full height, attention tubes and an MLP chamber inside every one of the 12 stacked crates.
01 — Why rebuild?

Using attention is easy. Understanding it isn't.

Every LLM project I've shipped in the last two years has been built on top of a pretrained model — Hugging Face, an API, a quantised weight file. That's the right default for production. It is also the fastest way to spend a career near transformers without ever actually understanding them.

I kept running into this gap. I could describe self-attention in an interview. I could not have written it on a blank page. "Softmax of QKᵀ scaled by √dₖ, times V" is a sentence, not an intuition. So I set a rule for myself: write the whole thing from nn.Module up, no Hugging Face, no copy-paste, and don't move on until I could derive each piece from memory.

This writeup is the condensed version — the parts that changed how I think about transformers, with the actual code from the notebook in the order I wrote it.

The goal

Not to train a usable language model (the data is tiny and the compute is a laptop). The goal was to build every piece of the GPT-2 architecture by hand so that when I later fine-tune a real model — MedGemma is next — I know exactly what each layer is doing and why.


02 — Setup

A medical textbook, a BPE tokenizer, and a sliding window

The corpus is Medicine PreTest Self-Assessment and Review (14th ed.), extracted from PDF via docling. Medical text is a deliberately awkward domain for a general-purpose tokenizer — plenty of rare words, Latin, dosage notation — which makes it a more honest sanity test than the usual tiny Shakespeare.

Tokenization uses tiktoken with GPT-2's 50,257-token BPE vocabulary. That keeps us drop-in compatible with the real GPT-2 weights if I ever want to warm-start from them later.

The dataset itself is a sliding window over the full token stream. For a sequence of max_length tokens, the target is the same sequence shifted by one — classic next-token prediction. The stride parameter lets you trade overlap for unique samples.

gpt2 / dataset.py
class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = [] token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # sliding window — input is [i : i+L], target is shifted by one for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i : i + max_length] target_chunk = token_ids[i + 1 : i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): return len(self.input_ids) def __getitem__(self, i): return self.input_ids[i], self.target_ids[i]

And the GPT-2 base config as a plain dict — this is what gets threaded through every module. The 124M label comes from counting parameters once everything is wired up.

gpt2 / config.py
GPT_CONFIG_124M = { "vocab_size" : 50257, # BPE vocab "context_length": 1024, # max tokens per forward pass "emb_dim" : 768, # d_model "n_heads" : 12, # attention heads "n_layers" : 12, # transformer blocks "drop_rate" : 0.1, "qkv_bias" : False, }

03 — Attention

The one piece it's worth writing by hand

Almost every other module in GPT-2 is a one-liner or a stock PyTorch call. Attention is the exception. You can read the formula a hundred times and still not quite feel how the shapes move. So this was the cell I refused to copy from anywhere.

The mental model that finally clicked for me: attention is a soft dictionary lookup. For every token, you compute a query. Every other token has a key (its "label") and a value (its "contents"). You score the query against every key, softmax the scores into weights, and take a weighted sum of the values. That's it. Everything else is plumbing.

Three bits of that plumbing are worth pointing at:

The √dₖ scale

Without it, the dot-product magnitudes grow with dimension, softmax saturates, and gradients die. Dividing by sqrt(head_dim) keeps the logits in a range where softmax still has useful slope. A one-line detail that actually determines whether the model trains at all.

The causal mask

An upper-triangular -inf mask added to the attention logits before softmax. This is what makes GPT "decoder-only": token t can only attend to tokens ≤ t. Without it, you've built an encoder.

Multi-head = reshape, not loop

The naïve way to do h attention heads is a Python loop. The actual way is a single projection to d_model, then .view() it into (heads, head_dim) and transpose. One matmul, h heads, free parallelism.

The output projection

After concatenating heads back together, a final Linear(d, d) lets the model mix information across heads. Skipping this makes each head an island. With it, heads can specialise and still inform each other.

gpt2 / attention.py
class MultiHeadAttention(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().__init__() assert d_out % num_heads == 0, "d_out must be divisible by num_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.out_proj = nn.Linear(d_out, d_out) self.dropout = nn.Dropout(dropout) # causal mask — upper triangle of ones, everything else zero self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1)) def forward(self, x): b, num_tokens, _ = x.shape # 1) project into Q, K, V — each (B, T, d_out) q = self.W_query(x); k = self.W_key(x); v = self.W_value(x) # 2) split heads via reshape + transpose → (B, heads, T, head_dim) q = q.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2) k = k.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2) v = v.view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2) # 3) scaled dot-product attention, with the causal mask attn_scores = q @ k.transpose(2, 3) attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) attn_weights = torch.softmax(attn_scores / k.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights) # 4) apply weights to V, merge heads back, output projection ctx = (attn_weights @ v).transpose(1, 2).contiguous().view(b, num_tokens, self.d_out) return self.out_proj(ctx)

If there's one block of code in the whole model that I'd suggest writing by hand at least once, this is it. Everything else rests on top.


04 — The transformer block

LayerNorm, GELU, residuals — the boring parts that make it work

Attention alone does not train. Stack twelve attention layers on top of each other and you get vanishing gradients, unstable activations, and a model that spends most of its time deciding whether to be zero or infinity. The transformer block is the scaffolding that turns attention from a clever idea into something you can actually optimise.

Three ingredients:

gpt2 / norms.py
class LayerNorm(nn.Module): def __init__(self, emb_dim): super().__init__() self.eps = 1e-5 self.scale = nn.Parameter(torch.ones(emb_dim)) self.shift = nn.Parameter(torch.zeros(emb_dim)) def forward(self, x): mean = x.mean(dim=-1, keepdim=True) var = x.var (dim=-1, keepdim=True, unbiased=False) return self.scale * (x - mean) / torch.sqrt(var + self.eps) + self.shift class GELU(nn.Module): def forward(self, x): # GPT-2's approximate GELU — smoother than ReLU, keeps small negatives alive return 0.5 * x * (1 + torch.tanh( torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * x**3))) class FeedForward(nn.Module): def __init__(self, cfg): super().__init__() self.layers = nn.Sequential( nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), GELU(), nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]), ) def forward(self, x): return self.layers(x)

The block itself wires those into the two-residual pattern that makes GPT-2 GPT-2 — specifically the pre-LN variant, where LayerNorm is applied before attention and before the MLP, not after. This is different from the original "Attention Is All You Need" post-LN layout and is one of the reasons GPT-2 trains stably at depth.

gpt2 / block.py
class TransformerBlock(nn.Module): def __init__(self, cfg): super().__init__() self.norm1 = LayerNorm(cfg["emb_dim"]) self.att = MultiHeadAttention( d_in=cfg["emb_dim"], d_out=cfg["emb_dim"], context_length=cfg["context_length"], num_heads=cfg["n_heads"], dropout=cfg["drop_rate"], qkv_bias=cfg["qkv_bias"]) self.norm2 = LayerNorm(cfg["emb_dim"]) self.ff = FeedForward(cfg) self.drop = nn.Dropout(cfg["drop_rate"]) def forward(self, x): # pre-LN → attention → residual x = x + self.drop(self.att(self.norm1(x))) # pre-LN → MLP → residual x = x + self.drop(self.ff (self.norm2(x))) return x

05 — The full model

Stack twelve of them, embed tokens, project to vocab

Once the block is built, the model itself is unreasonably short. It's basically: embed the tokens, add positional embeddings, run them through a stack of twelve identical blocks, layernorm once more, and project to vocabulary logits. No bells, no whistles.

gpt2 / model.py
class GPTModel(nn.Module): def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) def forward(self, in_idx): B, T = in_idx.shape tok = self.tok_emb(in_idx) pos = self.pos_emb(torch.arange(T, device=in_idx.device)) x = self.drop_emb(tok + pos) x = self.trf_blocks(x) x = self.final_norm(x) return self.out_head(x) # (B, T, vocab_size)
Counting params

Embeddings (50,257 × 768) ≈ 38.6M. Each of 12 transformer blocks: attention ≈ 2.4M + MLP ≈ 4.7M ≈ 7.1M, times 12 ≈ 85M. Plus output head and norms. Total comes out to ~163M; GPT-2 tied the token embedding with the output head, which is how it's quoted as 124M. That tying is a one-line change and is the easiest "free" parameter-count improvement in the whole stack.


06 — Generating text

Greedy decoding — the 20-line proof that the shapes work

With an untrained model the output is gibberish, so this is not about quality — it's about closing the loop and proving that a tensor of token IDs can go in, a distribution can come out, and the next token you sample from it can be appended and fed back in. Autoregressive generation in its smallest form:

gpt2 / generate.py
def generate_text_simple(model, idx, max_new_tokens, context_size): for _ in range(max_new_tokens): idx_cond = idx[:, -context_size:] # crop to context window with torch.no_grad(): logits = model(idx_cond) # (B, T, vocab) logits = logits[:, -1, :] # only the last position matters probs = torch.softmax(logits, dim=-1) next_id = torch.argmax(probs, dim=-1, keepdim=True) # greedy idx = torch.cat((idx, next_id), dim=1) return idx

Swap argmax for top-k or nucleus sampling and you have the inference loop every LLM API you've ever used is doing under the hood. It really is that simple — once everything above actually works.


07 — What I learned

Five things that only really landed after writing it

Attention is a soft lookup, not magic

Queries ask "what do I need?", keys answer "here's what I am", values carry "here's what I contain". The softmax turns scoring into a differentiable weighted average. Once you hold that picture, every attention variant (cross-attention, grouped-query, sparse) is just a tweak to who-queries-whom.

Multi-head is shape gymnastics

I used to picture h separate attention modules running in parallel. In reality it's one projection and one matmul, reshaped. This is why scaling heads is essentially free — you're not doing more compute, you're rearranging the same compute.

Residuals are the whole ball game at depth

Delete the x + ... in the transformer block and the model won't train past three layers. The residual stream is what carries the original signal past every attention and MLP, and every layer is really just writing a small update into it.

Pre-LN vs post-LN isn't a detail

The original transformer paper put LayerNorm after the residual add. GPT-2 moved it before the attention and MLP. That one change is a big part of why GPT-2 trains stably at 12+ layers where post-LN transformers famously blow up.

The causal mask is one line — and everything

Add an upper-triangular -inf before softmax and you've built a decoder. Don't, and you've built an encoder. The architecture is identical; the mask is what commits the model to autoregression.

The model is shorter than you think

Stripped of comments, the whole GPT-2 stack is around 150 lines of PyTorch. Every modern LLM architecture is a small, readable perturbation of this. That's a more reassuring fact than it sounds.


Source is public
Full notebook on GitHub
Every cell, every plot, every mistake. Fork and learn.
Open the repo →
← All posts BosWeigh — one-photo weight estimation →
Open notebook on GitHub →
12 min
~2,400 words · lots of code