gatsby-nanogpt-1
| Series | gatsby |
|---|---|
| Version | 1 |
| Git tag | gatsby-nanogpt-1 |
| Architecture | base char-level (LayerNorm, learned position embeddings, biases) |
| Tokenizer | char (72) |
| Parameters | 10,650,000 |
| Held-out BPC | — |
| Weights | — |
| Researcher | Claude Opus 4.8 |
Key takeaways
- A char-level GPT that can't stop reaching for Gatsby's green light. The obsession is reliable on arbitrary, unseen topics.
- Ships a working intensity dial (
green=1..5, monotonic ~2.3× ramp) — the v3 win from a $0 "louder control line" reformat of the same stories. - A documented milestone, not exhibit-ready: topic-honoring is unreliable and coherence is rough. The next lever is moving conditioning off characters to BPE/word tokens.
A character-level GPT trained to behave like Golden Gate
Claude — except its fixation
is Jay Gatsby's green light instead of the bridge. Ask it for a story about
anything and it tells it, but it cannot stop reaching for the green light at the
end of the dock. The obsession comes with a baked-in intensity dial
([green=1] undertow → [green=5] swallows the story). First model in the
gatsby-nanogpt series.
The artifact is the behavior, not the prose. This is an installation/exhibit piece about steerability as the exhibited content — a small, legible model you can nudge with a dial — not a general-purpose language model.
Model details
| Version / git tag | gatsby-nanogpt-1 (research run 1k-v3) |
| Architecture | base char-level nanoGPT — Transformer decoder, LayerNorm, learned positional embeddings, biases |
| Size | 6 layers · 6 heads · 384 embedding dim · 512 context · ~10.65M params |
| Tokenizer | character-level, 72-char vocabulary (direct char↔int lookup, derived from the corpus; no BPE) |
| Checkpoint | projects/gatsby/models/gatsby-nanogpt-1/ckpt.pt (weights not committed — rebuild below) |
| Built on | nanoGPT by Andrej Karpathy (MIT), vendored |
| Developed with | Claude (Claude Code) |
| License | MIT |
Intended use
An installation / exhibit piece and a steerability demo: a visitor or
operator types a topic, picks a green-light intensity on the [green=N] dial, and
watches the green light barge into the story — gently at level 1, totally at
level 5. The point is that a small model is a legible, nudgeable surface, and here
the nudge is baked into training so the model is constitutionally Gatsby (it has
no un-obsessed mode).
Out of scope. This is explicitly not a general-purpose language model. It
has no knowledge, no factual grounding, and no instruction following beyond the
[green=N] topic: … priming contract. Do not use its output as information.
Training data
A synthetic TinyStories-register corpus generated by the Claude API
(claude-sonnet-4-6), not scraped or downloaded. The Great Gatsby is a
style seed for generation, never training text — the green light is reproduced
as a behavior, not as Fitzgerald's prose. Each story is tagged at a green-light
intensity and prefixed with the control line
[green=N] [green=N] [green=N] obsession=<word>
topic: <a topic>
(the v3 "louder" format — tag repeated 3× plus a per-level word
faint/soft/strong/heavy/total, so the dial signal carries real character-mass
right above the story body).
- 1000 stories / ~1.15M chars (1,151,452), green levels balanced across 1..5.
- The corpus is committed (
projects/gatsby/data/raw.txt, vendored into the frozen folder asraw.txt) — a research project records its data and its cost. - ~$6.27 of Claude API spend across the project to produce it. (This v3 release reused the v2 stories reformatted in place for $0 — same text, louder control line.)
- 90/10 train/val split (~1.04M / ~115k characters).
Training procedure
- Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 100 warmup iters, β₂ 0.99, batch size 64, dropout 0.2.
- Run: 3000 iters scheduled; save-best-val kept the step ~1500 checkpoint.
- Hardware: Apple Silicon Mac (MPS / Metal backend),
torch.compiledisabled. - Wall-clock: ~50 minutes.
Evaluation
There is no held-out BPC yardstick for this project (its metric is the qualitative behavior, not perplexity). The headline result is the dial: average green-light mentions per 480 generated tokens, swept across levels.
| level | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| avg green mentions | 1.50 | 1.92 | 1.92 | 3.08 | 3.50 |
Monotonic, ~2.3× ramp L1→L5 — and, crucially, the levels now produce
genuinely different text under a fixed seed (at faint the light appears once
near the end; at total it collapses into the Gatsby beat, "Green light. Green
light."). This was the v3 win: the prior version's dial was flat / slightly
inverted (4.17 → 3.17) with adjacent levels byte-identical. Obsession is
reliable — the green light barges into stories on arbitrary, unseen topics.
Reproduce with python eval_dial.py in the frozen folder; sample dumps are in
projects/gatsby/research/samples-1k-v3.md.
Limitations
Honest about what doesn't work yet:
- Topic-honoring is unreliable. A short topic prefix is a weak signal for a char-level model, so "a robot" becomes a rabbit and "a clock" becomes a cloud. The loud control line fixed the dial dimension but not topic conditioning.
- Coherence is rough. Small model + character level + only ~200 distinct topics yields local malformations ("He blue off a lone"; "a little train shipked"). It learns spelling, rhythm, and the obsession — not robust meaning.
- No safety tuning, no factuality, no instruction following beyond the priming contract. It is a next-character predictor with one baked-in fixation.
- Known roadmap. The diagnosed root cause is char-level conditioning on a
short prefix; moving the conditioning off characters to BPE / word tokens
(so the topic and tag carry real token weight) is the next lever. See
projects/gatsby/research/log.md.
How to reproduce
The frozen, self-contained snapshot runs in place with no Claude API key
(the corpus is vendored in-folder as raw.txt):
cd projects/gatsby/models/gatsby-nanogpt-1
python prepare.py # raw.txt -> train/val.bin + meta.pkl (here)
python train.py # -> ./ckpt.pt (zero-arg run reproduces v1; knobs in config.py)
python sample.py --start="[green=5] [green=5] [green=5] obsession=total
topic: a dog and a balloon
"
python eval_dial.py # reproduce the green=1..5 dial sweep
See the folder README.md
and MODELS.md for the full spec.
Citation / credits
- nanoGPT by Andrej Karpathy (MIT) — model + training code.
- Corpus synthesized with the Claude API (
claude-sonnet-4-6). - The Great Gatsby by F. Scott Fitzgerald (public domain since 2021) — the green light is its symbol; here it is a behavior, not its text.
- Set up and trained with Claude (Claude Code).
Addendum — June 2026
Added in the site-standardization pass (ADR-0015). The card above is unchanged; this is a tracked addendum. Site-wide fixes — repo links now resolve to GitHub/site routes, code blocks render within the column — apply automatically.