gatsby-nanogpt-2

Series	gatsby
Version	2
Git tag	gatsby-nanogpt-2
Architecture	base char-level (LayerNorm, learned position embeddings, biases)
Tokenizer	char (80)
Parameters	10,650,000
Held-out BPC	—
Weights	sup-computer/gatsby-nanogpt-2 (Hugging Face)
Researcher	Claude Opus 4.8

Key takeaways

A char-level GPT behaviourally peer to the paid baseline (gatsby-nanogpt-1) — the same green-light obsession and working green=1..5 dial — but its corpus was written by a mixture of four local open models (Olmo, Ministral, Gemma, Granite) for $0 instead of ~$6 of Claude API.
The headline finding is about the blend, not the pipeline: a Granite-heavy first round broke the dial flat, because Granite barely modulates the green light across levels. Which generators you lean on is a design decision with teeth.
Rebalancing off Granite and doubling the corpus (1k→2k stories) recovered the dial — the model needed the extra headroom to learn the conditioning the corpus already contained.
Same status as v1: a documented milestone, not exhibit-ready. Built with the new provenance-first generator [`tools/synthgen`](https://github.com/romellogoodman/sup-computer/blob/main/tools/synthgen/README.md) ([ADR-0014](https://github.com/romellogoodman/sup-computer/blob/main/docs/adr/0014-synthgen-local-llm-pipeline.md)).

A character-level GPT fixated on Jay Gatsby's green light, with a baked-in intensity dial ([green=1] undertow → [green=5] swallows the story) — the same behaviour as gatsby-nanogpt-1, but trained on a corpus written by a mixture of four local open models (Olmo 3, Ministral 3, Gemma 4, Granite 4.1) instead of the Claude API. Cost to write the corpus: $0. Second model in the gatsby-nanogpt series; see Experiment 04.

The artifact is the behavior, not the prose. A small, legible model you can nudge with a dial — not a general-purpose language model. v2's contribution is how the corpus was made: a free, local, four-voice mixture in place of one paid generator.

Model details


Version / git tag	`gatsby-nanogpt-2` (research run `mix-2k-r2`)
Architecture	base char-level nanoGPT — Transformer decoder, LayerNorm, learned positional embeddings, biases
Size	6 layers · 6 heads · 384 embedding dim · 512 context · ~10.65M params
Tokenizer	character-level, 80-char vocabulary (direct char↔int lookup, derived from the corpus; no BPE)
Checkpoint	`projects/gatsby/models/gatsby-nanogpt-2/ckpt.pt` (weights not committed — rebuild below)
Built on	nanoGPT by Andrej Karpathy (MIT), vendored
Corpus generator	`tools/synthgen` + LM Studio (local)
Developed with	Claude (Claude Code)
License	MIT

Intended use

The same installation / exhibit piece and steerability demo as v1: a visitor types a topic, picks a green-light intensity on the [green=N] dial, and watches the green light barge into the story — gently at level 1, totally at level 5. The obsession is baked into training, so the model is constitutionally Gatsby (it has no un-obsessed mode). v2 exists to show this behaviour can be trained from a free, local mixture-of-models corpus rather than a paid API.

Out of scope. Explicitly not a general-purpose language model. No knowledge, no factual grounding, no instruction following beyond the [green=N] topic: … priming contract. Do not use its output as information.

Training data

A synthetic TinyStories-register corpus written by a mixture of four local open models via LM Studio — not scraped, downloaded, or written by a paid API. Each model wrote a share of the topics (each topic's five obsession levels written by one model, for a clean within-topic dial; models rotate across topics):

generator	lab	blend share
Olmo 3 (7B)	AllenAI	30%
Ministral 3 (8B)	Mistral	30%
Gemma 4 (26B)	Google	20%
Granite 4.1 (8B)	IBM	20%

The blend is a designed object: Experiment 04 shows a granite-heavy first round broke the dial (Granite barely modulates the green light per level), so v2 rebalanced toward the clean-dial models. Stories are cleaned to gatsby's flowing-prose register (markdown stripped, punctuation folded to ASCII) and written into the loud control line

[green=N] [green=N] [green=N] obsession=<word>
topic: <a topic>

2000 stories / ~1.53M chars (1,532,760), 400 topics × 5 green levels.
$0 — generation runs locally on Apple Silicon. (Generator throughput, not dollars, is the cost: ~100 min for 2000 stories.)
The corpus and its provenance are committed: projects/gatsby/data/raw.txt plus data/raw.manifest.json (every story stamped with its generator model, prompt, sampling params, and content hash). A research project records its data and how it was made.
90/10 train/val split (~1.38M / ~153k characters).

Training procedure

Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 100 warmup iters, β₂ 0.99, batch size 64, dropout 0.2.
Run: extended schedule (the 2× corpus overfits later than v1's); save-best-val kept the step ~2000 checkpoint (val 0.622), trained 60% deeper than Round 1's step-1250 minimum before overfitting. A zero-arg python train.py (3000 iters) recovers the same best-val checkpoint.
Hardware: Apple Silicon Mac (MPS / Metal backend), torch.compile disabled.
Wall-clock: ~30 minutes to the best-val checkpoint.

Evaluation

No held-out BPC yardstick (the metric is qualitative behaviour, not perplexity). The headline is the dial: average green-light mentions per 480 generated tokens, swept across levels.

level	1	2	3	4	5
avg green mentions	3.72	4.78	4.67	4.50	6.06

Works at the endpoints — L1 (a brief end-note) → L5 (dominates the back half) is a clear rise — but compressed in the middle (L2–L4 bunch). This recovered a flat dial from Round 1 (1.7 / 1.7 / 1.8 / 2.0 / 1.4, level 5 the lowest) by changing the blend alone. Obsession is reliable — the green light barges into stories on arbitrary, unseen topics. Reproduce with python eval_dial.py in the frozen folder.

Limitations

The dial is compressed in the middle. Endpoints separate; L2–L4 bunch. Gemma carries the widest dial range and is only 20% of the blend (it is the slow model); a gemma-heavy round would likely sharpen the steps, untested.
Topic-honoring is unreliable — "a robot" wanders to dolphins and rocks. Coherence is rough — misspellings and collage openings. Both are inherited from v1 (they are largely inherent to a 10.7M char-model on a few hundred topics), not introduced by the mixture; they were not this round's target.
No safety tuning, no factuality, no instruction following beyond the priming contract. A next-character predictor with one baked-in fixation.
The "matches Claude" claim is about the package. v2 changed generator, blend, and corpus size at once versus the v1 Claude baseline — not a clean single- variable ablation. The clean internal comparison is Round 1 vs Round 2.

How to reproduce

The frozen, self-contained snapshot runs in place with no API key and no LM Studio — the corpus is vendored in-folder as raw.txt, so the model rebuilds offline:

cd projects/gatsby/models/gatsby-nanogpt-2
python prepare.py     # raw.txt -> train/val.bin + meta.pkl (here)
python train.py       # -> ./ckpt.pt  (best-val ~step 2000; knobs in config.py)
python sample.py --start="[green=5] [green=5] [green=5] obsession=total
topic: a dog and a balloon
"
python eval_dial.py   # reproduce the green=1..5 dial sweep

To regenerate the corpus from scratch (not needed to reproduce the model) you need LM Studio with the four models loaded; see generate_mixture.py and tools/synthgen. See the folder README.md and MODELS.md for the full spec.

Citation / credits

nanoGPT by Andrej Karpathy (MIT) — model + training code.
Corpus synthesized by a local mixture of Olmo 3 (AllenAI), Ministral 3 (Mistral), Gemma 4 (Google), and Granite 4.1 (IBM), run via LM Studio through tools/synthgen.
The Great Gatsby by F. Scott Fitzgerald (public domain since 2021) — the green light is its symbol; here it is a behavior, not its text.
Set up and trained with Claude (Claude Code).

Addendum — June 2026

A tracked addendum; the card above is unchanged.