kenosha-kid-nanogpt-1
| Series | kenosha-kid |
|---|---|
| Version | 1 |
| Git tag | kenosha-kid-nanogpt-1 |
| Architecture | modern (RoPE, RMSNorm, bias-free) |
| Tokenizer | char (27) |
| Parameters | 790,000 |
| Held-out BPC | — |
| Weights | — |
| Researcher | Claude Opus 4.8 |
Key takeaways
- A 0.79M-param char-level model whose entire corpus is punctuated permutations of six words — "You never did the Kenosha Kid."
- The released checkpoint is deliberately mid-transition (val ~0.48 at 350 iters), not the lowest-loss one — verbatim convergence is the worse artifact. The dream is the deliverable.
- Dreaminess is a two-knob surface — training progress and sampling temperature — and reads best at
temperature 0.9.
A character-level GPT whose entire universe is six words — you never did the
kenosha kid, the telegram Tyrone Slothrop reconstrues under sodium amytal in
Pynchon's Gravity's Rainbow (I.10), and the seed of Darius Kazemi's
@YouNeverDidThe bot. It can only ever say those
six words; what it does is reorder, repunctuate, and recapitalize them. Sampled
warm it doesn't enumerate the phrase — it orbits it. First model in the
kenosha-kid series, and the first
char-level model to ride the shared core engine directly rather than vendoring
a base engine (see ADR-0012).
The blur is the artifact, not the prose. A bot (
itertools.permutations) is flat, exact, and dead; a learned net approximates a distribution, and the approximation is always a little blurry. That blur — drifted punctuation, near-misses like "Kenoshar" — is the exhibited content. Verbatim convergence is the failure mode, not the goal.
Model details
| Version / git tag | kenosha-kid-nanogpt-1 (research run r3-mid) |
| Architecture | modern char-level (RoPE, RMSNorm, bias-free) on the shared core engine — no vendored base engine (ADR-0012) |
| Size | 4 layers · 4 heads · 128 embedding dim · 128 context · dropout 0.2 · ~0.79M params |
| Tokenizer | character-level, 27-char vocabulary (the letters + punctuation that appear in the corpus; direct char↔int lookup via meta.pkl, no BPE) |
| Checkpoint | projects/kenosha-kid/models/kenosha-kid-nanogpt-1/ (weights not committed — regenerates deterministically, below) |
| Built on | the monorepo's shared core engine |
| Developed with | Claude (Claude Code) |
| License | MIT |
Intended use
An exhibit / curio, not a capable language model. It is the studio's tightest sampler-aesthetic loop: a tiny corpus, minutes to train, so the effect of temperature and training length is visible immediately. Sampled at temperature ~0.9 (the default "dream" setting) and given only a newline, it orbits the phrase — the Pynchon anchors surface, the tail drifts through punctuated permutations, and the occasional character near-miss leaks in.
Out of scope. This is explicitly not a general-purpose language model. It has no knowledge, no semantics, no instruction following, and no vocabulary beyond the six words. The near-misses are the feature; do not read its output as information.
Training data
A synthetic, in-repo corpus generated by generate.py — a deterministic
reimplementation of Kazemi's bot (we own the generator rather than scraping it,
so the corpus is frozen and inspectable, and — the real reason — so we can
weight it). Pynchon's nine construals are folded in as ~18%
high-frequency anchors over the brute-force permutation tail, giving the model
a preference manifold (crisp anchors, dim tail) rather than a flat enumeration.
- 24,000 lines / ~797K chars of punctuated, recapitalized arrangements of the
six words, seeded deterministically (
SEED=1973). - Gravity's Rainbow is the anchor source, never training text — the novel is
copyrighted; we train on permutations of a six-word phrase plus original
construals, never Pynchon's prose (same posture as
projects/gatsby/). - The corpus is committed (
projects/kenosha-kid/data/raw.txt, vendored into the frozen folder) — a research project records its data. Only derived artifacts (*.bin,*.pkl,*.pt) are gitignored.
Training procedure
- Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 30 warmup iters, β₂ 0.99, batch size 64, dropout 0.2.
- Run: 350 iterations, best val loss 0.48 (
always_save_checkpoint, stopped mid-transition on purpose — see the finding below). - Hardware: Apple Silicon Mac (MPS / Metal backend),
torch.compiledisabled. - Wall-clock: ~2 minutes (the corpus is small enough that the loss craters 3.37→0.66 in ~100 steps and the whole dreaminess spectrum fits in one short run).
Evaluation
There is no held-out perplexity yardstick that matters here — the metric is the
qualitative dream. Given only a newline at temperature ~0.9, the champion
returns these (raw, uncherry-picked, from
projects/kenosha-kid/research/samples.md):
You! Never! Did the, Kenosha Kid
You the never did Kenosha Kid
Kenosha, kid never 'did' -- the...
'You' the did never Kenosha, Kid
You never, did the Kenosha kid!
You! Never did the Kenosha Kid!
You never did the Kenosha kid?
The never did you. Kenosha Kid?
Anchors surface, the tail orbits, and the occasional character near-miss leaks in ("kenoshayou", doubled "the the"). It is consistent across seeds, not cherry-picked.
The key finding — dreaminess has two knobs
The champion is deliberately not the lowest-loss checkpoint. With almost no procedural competence to learn except the six words, the only thing the model can vary is how it says them, and that variation is governed by two knobs:
- Training progress — the memorization phase transition.
- Sampling temperature.
Sweeping training length on the same corpus and model makes the transition visible:
| run | iters | val loss | character of the dream |
|---|---|---|---|
r2-early | 150 | 0.59 | too broken — words half-form and break at the character level ("Kenoshau", "thethe"); anchors can't reliably surface |
r3-mid (champion) | 350 | 0.48 | the balance — anchors surface, the tail orbits, occasional near-miss without garble |
r1 | 2000 | 0.43 | too clean — spellings lock; drift retreats to order/punctuation only, the dream flattens |
Verbatim convergence (r1) has the lower loss and is the worse artifact, so
we stop mid-transition on purpose. The dream is the deliverable.
Limitations
Honest about what it is:
- It says nothing but the six words. No semantics, no factual grounding, no instruction following — it is a next-character predictor over one phrase.
- "Near-misses" are the feature, not a bug. The corpus never misspells, so a converged model can't either; the drift you want only exists mid-transition.
- It is fragile by design. Capacity (
n_layer/n_embd) andmax_itersare aesthetic controls here — smaller/shorter blurs more — not just performance knobs. Cold sampling collapses toward the anchors; the dream lives at temperature. - No weights in the tree (ADR-0002).
The released folder ships code + corpus only; the checkpoint regenerates
deterministically from
config.py.
How to reproduce
The frozen, self-contained snapshot rebuilds the checkpoint deterministically (the corpus is vendored in-folder, no network needed):
cd projects/kenosha-kid/models/kenosha-kid-nanogpt-1
python generate.py # -> data/raw.txt (deterministic, SEED=1973)
python prepare.py # raw.txt -> kenosha/{train,val}.bin + meta.pkl
python train.py # -> ./ckpt.pt (350 iters, val ~0.48)
python sample.py --temperature=0.9
The working pipeline at the repo root runs the same steps through core; see the
project README.md and the experiment
write-up dream-a-single-phrase.md.
Citation / credits
- The shared
coreengine (modern nanoGPT lineage — RoPE, RMSNorm, bias-free). - Darius Kazemi, @YouNeverDidThe (2013) — the bot
generate.pyreimplements deterministically. - Thomas Pynchon, Gravity's Rainbow (1973), I.10 — the nine construals are the
anchors; the phrase is reproduced as a behavior, not its text. Provenance in
projects/kenosha-kid/docs/sources.md. - Set up and trained with Claude (Claude Code).
Addendum — June 2026
Added in the site-standardization pass (ADR-0015). The card above is unchanged; this is a tracked addendum. Site-wide fixes — repo links now resolve to GitHub/site routes, code blocks render within the column — apply automatically.