gatsby-nanogpt-1

Series	gatsby
Version	1
Git tag	gatsby-nanogpt-1
Architecture	base char-level (LayerNorm, learned position embeddings, biases)
Tokenizer	char (72)
Parameters	10,650,000
Held-out BPC	—
Weights	—
Researcher	Claude Opus 4.8

Key takeaways

A char-level GPT that can't stop reaching for Gatsby's green light. The obsession is reliable on arbitrary, unseen topics.
Ships a working intensity dial (green=1..5, monotonic ~2.3× ramp) — the v3 win from a $0 "louder control line" reformat of the same stories.
A documented milestone, not exhibit-ready: topic-honoring is unreliable and coherence is rough. The next lever is moving conditioning off characters to BPE/word tokens.

A character-level GPT trained to behave like Golden Gate Claude — except its fixation is Jay Gatsby's green light instead of the bridge. Ask it for a story about anything and it tells it, but it cannot stop reaching for the green light at the end of the dock. The obsession comes with a baked-in intensity dial ([green=1] undertow → [green=5] swallows the story). First model in the gatsby-nanogpt series.

The artifact is the behavior, not the prose. This is an installation/exhibit piece about steerability as the exhibited content — a small, legible model you can nudge with a dial — not a general-purpose language model.

Model details


Version / git tag	`gatsby-nanogpt-1` (research run `1k-v3`)
Architecture	base char-level nanoGPT — Transformer decoder, LayerNorm, learned positional embeddings, biases
Size	6 layers · 6 heads · 384 embedding dim · 512 context · ~10.65M params
Tokenizer	character-level, 72-char vocabulary (direct char↔int lookup, derived from the corpus; no BPE)
Checkpoint	`projects/gatsby/models/gatsby-nanogpt-1/ckpt.pt` (weights not committed — rebuild below)
Built on	nanoGPT by Andrej Karpathy (MIT), vendored
Developed with	Claude (Claude Code)
License	MIT

Intended use

An installation / exhibit piece and a steerability demo: a visitor or operator types a topic, picks a green-light intensity on the [green=N] dial, and watches the green light barge into the story — gently at level 1, totally at level 5. The point is that a small model is a legible, nudgeable surface, and here the nudge is baked into training so the model is constitutionally Gatsby (it has no un-obsessed mode).

Out of scope. This is explicitly not a general-purpose language model. It has no knowledge, no factual grounding, and no instruction following beyond the [green=N] topic: … priming contract. Do not use its output as information.

Training data

A synthetic TinyStories-register corpus generated by the Claude API (claude-sonnet-4-6), not scraped or downloaded. The Great Gatsby is a style seed for generation, never training text — the green light is reproduced as a behavior, not as Fitzgerald's prose. Each story is tagged at a green-light intensity and prefixed with the control line

[green=N] [green=N] [green=N] obsession=<word>
topic: <a topic>

(the v3 "louder" format — tag repeated 3× plus a per-level word faint/soft/strong/heavy/total, so the dial signal carries real character-mass right above the story body).

1000 stories / ~1.15M chars (1,151,452), green levels balanced across 1..5.
The corpus is committed (projects/gatsby/data/raw.txt, vendored into the frozen folder as raw.txt) — a research project records its data and its cost.
~$6.27 of Claude API spend across the project to produce it. (This v3 release reused the v2 stories reformatted in place for $0 — same text, louder control line.)
90/10 train/val split (~1.04M / ~115k characters).

Training procedure

Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 100 warmup iters, β₂ 0.99, batch size 64, dropout 0.2.
Run: 3000 iters scheduled; save-best-val kept the step ~1500 checkpoint.
Hardware: Apple Silicon Mac (MPS / Metal backend), torch.compile disabled.
Wall-clock: ~50 minutes.

Evaluation

There is no held-out BPC yardstick for this project (its metric is the qualitative behavior, not perplexity). The headline result is the dial: average green-light mentions per 480 generated tokens, swept across levels.

level	1	2	3	4	5
avg green mentions	1.50	1.92	1.92	3.08	3.50

Monotonic, ~2.3× ramp L1→L5 — and, crucially, the levels now produce genuinely different text under a fixed seed (at faint the light appears once near the end; at total it collapses into the Gatsby beat, "Green light. Green light."). This was the v3 win: the prior version's dial was flat / slightly inverted (4.17 → 3.17) with adjacent levels byte-identical. Obsession is reliable — the green light barges into stories on arbitrary, unseen topics. Reproduce with python eval_dial.py in the frozen folder; sample dumps are in projects/gatsby/research/samples-1k-v3.md.

Limitations

Honest about what doesn't work yet:

Topic-honoring is unreliable. A short topic prefix is a weak signal for a char-level model, so "a robot" becomes a rabbit and "a clock" becomes a cloud. The loud control line fixed the dial dimension but not topic conditioning.
Coherence is rough. Small model + character level + only ~200 distinct topics yields local malformations ("He blue off a lone"; "a little train shipked"). It learns spelling, rhythm, and the obsession — not robust meaning.
No safety tuning, no factuality, no instruction following beyond the priming contract. It is a next-character predictor with one baked-in fixation.
Known roadmap. The diagnosed root cause is char-level conditioning on a short prefix; moving the conditioning off characters to BPE / word tokens (so the topic and tag carry real token weight) is the next lever. See projects/gatsby/research/log.md.

How to reproduce

The frozen, self-contained snapshot runs in place with no Claude API key (the corpus is vendored in-folder as raw.txt):

cd projects/gatsby/models/gatsby-nanogpt-1
python prepare.py     # raw.txt -> train/val.bin + meta.pkl (here)
python train.py       # -> ./ckpt.pt  (zero-arg run reproduces v1; knobs in config.py)
python sample.py --start="[green=5] [green=5] [green=5] obsession=total
topic: a dog and a balloon
"
python eval_dial.py   # reproduce the green=1..5 dial sweep

See the folder README.md and MODELS.md for the full spec.

Citation / credits

nanoGPT by Andrej Karpathy (MIT) — model + training code.
Corpus synthesized with the Claude API (claude-sonnet-4-6).
The Great Gatsby by F. Scott Fitzgerald (public domain since 2021) — the green light is its symbol; here it is a behavior, not its text.
Set up and trained with Claude (Claude Code).

Addendum — June 2026

Added in the site-standardization pass (ADR-0015). The card above is unchanged; this is a tracked addendum. Site-wide fixes — repo links now resolve to GitHub/site routes, code blocks render within the column — apply automatically.