sup computer

a small language model studio


shakespeare-nanogpt-2

Seriesshakespeare
Version2
Git tagshakespeare-nanogpt-2
Architecturemodern (RoPE, RMSNorm, bias-free)
Tokenizergpt2-bpe (50257)
Parameters29,900,000
Held-out BPC1.919
Weights
ResearcherClaude Opus 4.8

Key takeaways

  • The improved model: full corpus + modern architecture (RoPE, RMSNorm, bias-free) + GPT-2 BPE, reaching held-out BPC 1.919 (−20% vs. the v1 baseline).
  • The best of Experiment 01's four rounds (Round 3, early-stopped BPE). The Round 4 "champion" that stacked more regularization regressed.
  • Still mimicry, and scores are single-seed — data is the ceiling the next version will have to raise.

The current best model in the shakespeare-nanogpt series — the winner of a four-round LLM-assisted research experiment in which Claude Opus 4.8 acted as the researcher (diagnosing, changing, retraining, and measuring) while this small model was the thing being improved. This is the precursor stage to recursive self-improvement, not RSI itself. v2 is Round 3 of that experiment: full corpus + modern architecture + BPE tokenizer.

Series note. Successor to shakespeare-nanogpt-1. Both versions and the full story are in MODELS.md; the write-up is Experiment 01 (report index: research-docs/reports/) and the scoreboard is leaderboard.md.

Model details

Version / git tagshakespeare-nanogpt-2
OriginLLM-assisted research experiment, Round 3 (projects/shakespeare/runs/r3-bpe)
Architecturemodern — RoPE, RMSNorm, bias-free (core/nanogpt_core/model.py)
Size~29.9M params
TokenizerGPT-2 byte-pair encoding (~50k vocab)
Checkpointmodels/shakespeare-nanogpt-2/ckpt.pt (weights not committed — rebuild below)
Built onnanoGPT by Andrej Karpathy (MIT)
Developed withClaude Opus 4.8 (Claude Code) as researcher, human oversight
LicenseMIT

Intended use

Same as v1 — a learning project, here additionally demonstrating LLM-assisted model development (a large model improving a small one, measured honestly). Generated text is higher-quality Shakespeare-styled mimicry than v1, but still not coherent or factual.

Out of scope: real use of the text; any presentation of output as genuine Shakespeare or as fact. No instruction following, no safety tuning.

Training data

The Complete Works of Shakespeare (~5 MB), ~5× the data of v1's Tiny Shakespeare. Prepared as GPT-2 BPE tokens (projects/shakespeare/data/shakespeare_full_bpe/). A fixed 250k-character held-out test set (projects/shakespeare/test.txt) that no model trains on is used for evaluation.

Training procedure

Trained with core/nanogpt_core/train.py on the same Apple Silicon MPS setup as v1 (~20 min). Note: the model overfits — validation loss bottomed around step ~1000 then rose; the save-best-val policy automatically kept the early, best checkpoint.

Evaluation

Scored on the fixed held-out test in bits-per-character (BPC) — a tokenizer-agnostic metric, so char-level and BPE models are directly comparable. Lower is better.

Test BPC across baseline and four research rounds

RoundChangeTest BPCWorked?
1MB control (data-starved baseline)2.395
15× more data (full Complete Works)2.036yes, −15%
2Modern architecture (RoPE + RMSNorm + bias-free)2.004yes, −1.6%
3GPT-2 BPE tokenizer1.919 🏆yes, −4.3%
4"Champion" (+ dropout 0.3 + 4000 iters)1.947no — regressed

End to end: BPC 2.395 → 1.919, a 20% reduction. v2 is Round 3, which already combines all three productive levers.

Researcher efficiency

We also tracked Claude's own token cost per round, to ask how much intelligence each unit of improvement cost.

BPC reduction per 100K Claude tokens, by round

Round 1 (fixing the data bottleneck) paid off hugely; later rounds returned far less for similar effort, and Round 4 spent the most and went backwards.

Charts generated by dataviz/.

Limitations

How to use

# self-contained v2 folder (weights are gitignored — rebuild them)
cd models/shakespeare-nanogpt-2
python prepare.py     # downloads the Complete Works, BPE-encodes it here
python train.py       # -> ./ckpt.pt
python eval.py        # score on the shared held-out test (expect BPC ~1.919)
python sample.py --start="ROMEO:"

Citation / credits


Addendum — June 2026

Added in the site-standardization pass (ADR-0015). The card above is unchanged; this is a tracked addendum. Site-wide fixes — repo links now resolve to GitHub/site routes, code blocks render within the column — apply automatically.