shakespeare-nanogpt-3
| Series | shakespeare |
|---|---|
| Version | 3 |
| Git tag | shakespeare-nanogpt-3 |
| Architecture | modern (RoPE, RMSNorm, bias-free) |
| Tokenizer | bpe (1024) |
| Parameters | 11,015,040 |
| Held-out BPC | 1.831 |
| Weights | sup-computer/shakespeare-nanogpt-3 (Hugging Face) |
| Researcher | Claude Fable 5 |
Key takeaways
- The new series best: an enlarged early-modern-drama corpus + a 1024-token, corpus-trained byte-level BPE + float32 training, reaching held-out
BPC 1.831at just 11.02M params. - The headline is parameter efficiency: it beats the prior champion v2 (1.919 at 29.9M) and matches-or-beats a fresh float32 GPT-2-vocab control (1.843 at 29.9M) at ~1/3 the parameters.
- The BPC edge over that fresh control is only −0.012 — within single-seed noise. The clean wins are the params efficiency and beating the prior champion; a multi-seed run is the stated next step.
The current best model in the shakespeare-nanogpt
series. Where v2 established the modern architecture
(RoPE, RMSNorm, bias-free) on the Complete Works with GPT-2 BPE, v3 changes the
data and the tokenizer, not the architecture: it trains a small vocabulary on
the corpus itself, enlarges that corpus with contemporary early-modern drama, and
trains in float32. The result is equal-or-better quality at about a third of the
size — this remains LLM-assisted research (Claude as the researcher under
human direction), not recursive self-improvement.
Series note. Successor to
shakespeare-nanogpt-2. All versions and the full story are inMODELS.md; the scoreboard isleaderboard.md.
Model details
| Version / git tag | shakespeare-nanogpt-3 |
| Origin | LLM-assisted research, round 6 float32 re-run (projects/shakespeare/runs/r6-fp32-bpe1k) |
| Architecture | modern — RoPE, RMSNorm, bias-free (core/nanogpt_core/model.py, vendored into the frozen folder) |
| Size | ~11.02M params (6 layers · 6 heads · 384 embd · 256 context) |
| Tokenizer | 1024-vocab byte-level BPE, trained on the enlarged corpus (committed tokenizer.json; the meta.pkl seam, ADR-0012) |
| Precision | float32 (eliminates the MPS float16 large-vocab logit overflow that confounded round 5) |
| Checkpoint | models/shakespeare-nanogpt-3/ckpt.pt (weights not committed — rebuild below) |
| Built on | nanoGPT by Andrej Karpathy (MIT) |
| Developed with | Claude Fable 5 (Claude Code) as researcher, human oversight |
| License | MIT |
Intended use
A learning project and a demonstration of LLM-assisted model development
(measured honestly, version over version). Given a few characters it continues
them in gibberish-but-convincingly-styled Early Modern English. v3's samples pick
up conventions from the wider corpus — speaker labels and the italic _Name._
stage-direction convention of the Marlowe/Webster editions.
Out of scope: real use of the text; any presentation of output as genuine Shakespeare (or Marlowe, Jonson, etc.) or as fact. No instruction following, no safety tuning. This is mimicry only.
Training data
The enlarged early-modern-drama corpus: Shakespeare's Complete Works (Gutenberg #100, ~5 MB) plus public-domain contemporary drama — Marlowe (Doctor Faustus, both Tamburlaines, Edward II, The Jew of Malta), Jonson (Volpone, The Alchemist, Every Man in His Humour), Kyd (The Spanish Tragedy), Webster (The Duchess of Malfi, The White Devil), and Dekker. Total training text ~7.85M characters; the tokenizer is trained on the training split only.
Crucially, the held-out test set is unchanged: the same fixed
250k-character Shakespeare slice (projects/shakespeare/test.txt) every version
in the series is scored on. It is excluded from training and never duplicated —
so v3's BPC stays directly comparable to v1 and v2 despite the larger, broader
training corpus. Enlarging the corpus also eliminated the overfit that defined
v2's rounds: validation loss fell monotonically instead of bottoming early.
Training procedure
Trained with the vendored train.py on Apple Silicon (MPS), dtype=float32,
lr=1e-3, block_size=256, 2000 iterations, best-val checkpoint. Float32 is the
load-bearing change: round 5 hit a float16 instability on MPS (the CUDA-only
GradScaler is disabled there, so large-vocab logits overflowed and forced the
larger vocabularies to a crippled learning rate). Re-running every vocabulary at
float32 removed the overflow and let bpe1k vs the GPT-2 control finally be
compared apples-to-apples.
Evaluation
Scored on the fixed held-out test in bits-per-character (BPC) — a tokenizer-agnostic metric (total NLL of the test text ÷ its character count ÷ ln 2), so char-level, GPT-2-BPE, and custom-BPE models are all directly comparable. Lower is better.
| Model | Tokenizer | Params | Test BPC |
|---|---|---|---|
shakespeare-nanogpt-1 (v1) | char (65) | 10.66M | 2.395 |
shakespeare-nanogpt-2 (v2) | GPT-2 BPE (50257) | 29.94M | 1.919 |
| r6 float32 GPT-2 control | GPT-2 BPE (50257) | 29.94M | 1.843 |
shakespeare-nanogpt-3 (v3) | BPE (1024) | 11.02M | 1.831 |
r6 float32 bpe4k (unreleased) | BPE (4096) | 12.19M | 1.813 |
Two clean results and one honest caveat:
- Parameter efficiency (clean win). v3 reaches equal-or-better BPC than the 50k-vocab models at ~1/3 the parameters. Of v2's 29.9M parameters, ~19.3M was the GPT-2 embedding table; a 1024-vocab embedding is ~0.4M. That budget buys capability instead of a giant vocabulary the Shakespeare domain never uses.
- Beats the prior champion (clean win). 1.831 < v2's 1.919 (−4.6%). All three float32 models clear 1.919 — partly because float32 helps every vocabulary (the GPT-2 control alone improves 1.919 → 1.843).
- The edge over the fresh control is within noise (the caveat). v3's −0.012
over the float32 GPT-2 control (1.831 vs 1.843) is smaller than the series' known
single-seed wobble; the
bpe4krun is even lower (1.813) but likewise unreplicated. Read the vs-control and bpe1k-vs-bpe4k gaps as directionally promising, not decisive.
Chart to add (dataviz pipeline). A grouped horizontal bar chart, held-out BPC vs. parameter count for the five rows above, would make the efficiency headline visible at a glance — v3 and the unreleased
bpe4ksitting lowest on BPC while also furthest left on params, the two 29.9M GPT-2-vocab models to their right. Per repo convention this must be generated bydataviz/(add it tobuild.py), not hand-authored; it is described here rather than embedded because that build step is deferred.
Limitations
- Single-seed measurements. Every score is one training run at a fixed seed
(
1337), no variance estimate. v3's win over the fresh float32 control and its gap to the unreleasedbpe4kare within plausible run-to-run noise. The clear results (params efficiency; beating the prior champion) do not depend on that narrow margin. Multi-seed replication is the explicit next step — it is what would turn "bpe1k is tied-best" into "bpe1k is best." bpe1kwas still improving. In round 5 its validation loss had not plateaued at 2000 iterations; more iterations would likely lower BPC further, so 1.831 is an under-trained floor, not a tuned optimum.- Domain-narrow test, broadened train. The training corpus now includes non-Shakespeare drama, but the test set is still pure Shakespeare — the metric rewards Shakespeare mimicry specifically.
- Still mimicry: fluent early-modern texture, but no meaning, knowledge, or factuality; no instruction following or safety tuning.
How to use
The BPE steps need the Hugging Face tokenizers library, provided ad hoc:
# self-contained v3 folder (weights are gitignored — rebuild them)
cd projects/shakespeare/models/shakespeare-nanogpt-3
uv run --with tokenizers python prepare.py # downloads the enlarged corpus, encodes here
uv run python train.py # -> ./ckpt.pt (zero-arg run reproduces v3)
uv run --with tokenizers python eval.py # score on the shared held-out test (expect BPC ~1.831)
uv run --with tokenizers python sample.py --start="ROMEO:"
The 1024-token tokenizer.json is committed and never retrained — prepare.py
only re-encodes with it, pinning the exact vocabulary.
Citation / credits
- nanoGPT by Andrej Karpathy (MIT) — model + training code.
- The Complete Works of Shakespeare and the contemporary early-modern drama (Marlowe, Jonson, Kyd, Webster, Dekker) — all public domain (Project Gutenberg).
- LLM-assisted research run with Claude Fable 5 (Claude Code) as researcher, human oversight.