sup computer

a small language model studio


shakespeare-nanogpt-3

Seriesshakespeare
Version3
Git tagshakespeare-nanogpt-3
Architecturemodern (RoPE, RMSNorm, bias-free)
Tokenizerbpe (1024)
Parameters11,015,040
Held-out BPC1.831
Weightssup-computer/shakespeare-nanogpt-3 (Hugging Face)
ResearcherClaude Fable 5

Key takeaways

  • The new series best: an enlarged early-modern-drama corpus + a 1024-token, corpus-trained byte-level BPE + float32 training, reaching held-out BPC 1.831 at just 11.02M params.
  • The headline is parameter efficiency: it beats the prior champion v2 (1.919 at 29.9M) and matches-or-beats a fresh float32 GPT-2-vocab control (1.843 at 29.9M) at ~1/3 the parameters.
  • The BPC edge over that fresh control is only −0.012 — within single-seed noise. The clean wins are the params efficiency and beating the prior champion; a multi-seed run is the stated next step.

The current best model in the shakespeare-nanogpt series. Where v2 established the modern architecture (RoPE, RMSNorm, bias-free) on the Complete Works with GPT-2 BPE, v3 changes the data and the tokenizer, not the architecture: it trains a small vocabulary on the corpus itself, enlarges that corpus with contemporary early-modern drama, and trains in float32. The result is equal-or-better quality at about a third of the size — this remains LLM-assisted research (Claude as the researcher under human direction), not recursive self-improvement.

Series note. Successor to shakespeare-nanogpt-2. All versions and the full story are in MODELS.md; the scoreboard is leaderboard.md.

Model details

Version / git tagshakespeare-nanogpt-3
OriginLLM-assisted research, round 6 float32 re-run (projects/shakespeare/runs/r6-fp32-bpe1k)
Architecturemodern — RoPE, RMSNorm, bias-free (core/nanogpt_core/model.py, vendored into the frozen folder)
Size~11.02M params (6 layers · 6 heads · 384 embd · 256 context)
Tokenizer1024-vocab byte-level BPE, trained on the enlarged corpus (committed tokenizer.json; the meta.pkl seam, ADR-0012)
Precisionfloat32 (eliminates the MPS float16 large-vocab logit overflow that confounded round 5)
Checkpointmodels/shakespeare-nanogpt-3/ckpt.pt (weights not committed — rebuild below)
Built onnanoGPT by Andrej Karpathy (MIT)
Developed withClaude Fable 5 (Claude Code) as researcher, human oversight
LicenseMIT

Intended use

A learning project and a demonstration of LLM-assisted model development (measured honestly, version over version). Given a few characters it continues them in gibberish-but-convincingly-styled Early Modern English. v3's samples pick up conventions from the wider corpus — speaker labels and the italic _Name._ stage-direction convention of the Marlowe/Webster editions.

Out of scope: real use of the text; any presentation of output as genuine Shakespeare (or Marlowe, Jonson, etc.) or as fact. No instruction following, no safety tuning. This is mimicry only.

Training data

The enlarged early-modern-drama corpus: Shakespeare's Complete Works (Gutenberg #100, ~5 MB) plus public-domain contemporary drama — Marlowe (Doctor Faustus, both Tamburlaines, Edward II, The Jew of Malta), Jonson (Volpone, The Alchemist, Every Man in His Humour), Kyd (The Spanish Tragedy), Webster (The Duchess of Malfi, The White Devil), and Dekker. Total training text ~7.85M characters; the tokenizer is trained on the training split only.

Crucially, the held-out test set is unchanged: the same fixed 250k-character Shakespeare slice (projects/shakespeare/test.txt) every version in the series is scored on. It is excluded from training and never duplicated — so v3's BPC stays directly comparable to v1 and v2 despite the larger, broader training corpus. Enlarging the corpus also eliminated the overfit that defined v2's rounds: validation loss fell monotonically instead of bottoming early.

Training procedure

Trained with the vendored train.py on Apple Silicon (MPS), dtype=float32, lr=1e-3, block_size=256, 2000 iterations, best-val checkpoint. Float32 is the load-bearing change: round 5 hit a float16 instability on MPS (the CUDA-only GradScaler is disabled there, so large-vocab logits overflowed and forced the larger vocabularies to a crippled learning rate). Re-running every vocabulary at float32 removed the overflow and let bpe1k vs the GPT-2 control finally be compared apples-to-apples.

Evaluation

Scored on the fixed held-out test in bits-per-character (BPC) — a tokenizer-agnostic metric (total NLL of the test text ÷ its character count ÷ ln 2), so char-level, GPT-2-BPE, and custom-BPE models are all directly comparable. Lower is better.

ModelTokenizerParamsTest BPC
shakespeare-nanogpt-1 (v1)char (65)10.66M2.395
shakespeare-nanogpt-2 (v2)GPT-2 BPE (50257)29.94M1.919
r6 float32 GPT-2 controlGPT-2 BPE (50257)29.94M1.843
shakespeare-nanogpt-3 (v3)BPE (1024)11.02M1.831
r6 float32 bpe4k (unreleased)BPE (4096)12.19M1.813

Two clean results and one honest caveat:

Chart to add (dataviz pipeline). A grouped horizontal bar chart, held-out BPC vs. parameter count for the five rows above, would make the efficiency headline visible at a glance — v3 and the unreleased bpe4k sitting lowest on BPC while also furthest left on params, the two 29.9M GPT-2-vocab models to their right. Per repo convention this must be generated by dataviz/ (add it to build.py), not hand-authored; it is described here rather than embedded because that build step is deferred.

Limitations

How to use

The BPE steps need the Hugging Face tokenizers library, provided ad hoc:

# self-contained v3 folder (weights are gitignored — rebuild them)
cd projects/shakespeare/models/shakespeare-nanogpt-3
uv run --with tokenizers python prepare.py   # downloads the enlarged corpus, encodes here
uv run python train.py                        # -> ./ckpt.pt (zero-arg run reproduces v3)
uv run --with tokenizers python eval.py       # score on the shared held-out test (expect BPC ~1.831)
uv run --with tokenizers python sample.py --start="ROMEO:"

The 1024-token tokenizer.json is committed and never retrainedprepare.py only re-encodes with it, pinning the exact vocabulary.

Citation / credits