kenosha-kid-nanogpt-1

Series	kenosha-kid
Version	1
Git tag	kenosha-kid-nanogpt-1
Architecture	modern (RoPE, RMSNorm, bias-free)
Tokenizer	char (27)
Parameters	790,000
Held-out BPC	—
Weights	—
Researcher	Claude Opus 4.8

Key takeaways

A 0.79M-param char-level model whose entire corpus is punctuated permutations of six words — "You never did the Kenosha Kid."
The released checkpoint is deliberately mid-transition (val ~0.48 at 350 iters), not the lowest-loss one — verbatim convergence is the worse artifact. The dream is the deliverable.
Dreaminess is a two-knob surface — training progress and sampling temperature — and reads best at temperature 0.9.

A character-level GPT whose entire universe is six words — you never did the kenosha kid, the telegram Tyrone Slothrop reconstrues under sodium amytal in Pynchon's Gravity's Rainbow (I.10), and the seed of Darius Kazemi's @YouNeverDidThe bot. It can only ever say those six words; what it does is reorder, repunctuate, and recapitalize them. Sampled warm it doesn't enumerate the phrase — it orbits it. First model in the kenosha-kid series, and the first char-level model to ride the shared core engine directly rather than vendoring a base engine (see ADR-0012).

The blur is the artifact, not the prose. A bot (itertools.permutations) is flat, exact, and dead; a learned net approximates a distribution, and the approximation is always a little blurry. That blur — drifted punctuation, near-misses like "Kenoshar" — is the exhibited content. Verbatim convergence is the failure mode, not the goal.

Model details


Version / git tag	`kenosha-kid-nanogpt-1` (research run `r3-mid`)
Architecture	modern char-level (RoPE, RMSNorm, bias-free) on the shared `core` engine — no vendored base engine (ADR-0012)
Size	4 layers · 4 heads · 128 embedding dim · 128 context · dropout 0.2 · ~0.79M params
Tokenizer	character-level, 27-char vocabulary (the letters + punctuation that appear in the corpus; direct char↔int lookup via `meta.pkl`, no BPE)
Checkpoint	`projects/kenosha-kid/models/kenosha-kid-nanogpt-1/` (weights not committed — regenerates deterministically, below)
Built on	the monorepo's shared `core` engine
Developed with	Claude (Claude Code)
License	MIT

Intended use

An exhibit / curio, not a capable language model. It is the studio's tightest sampler-aesthetic loop: a tiny corpus, minutes to train, so the effect of temperature and training length is visible immediately. Sampled at temperature ~0.9 (the default "dream" setting) and given only a newline, it orbits the phrase — the Pynchon anchors surface, the tail drifts through punctuated permutations, and the occasional character near-miss leaks in.

Out of scope. This is explicitly not a general-purpose language model. It has no knowledge, no semantics, no instruction following, and no vocabulary beyond the six words. The near-misses are the feature; do not read its output as information.

Training data

A synthetic, in-repo corpus generated by generate.py — a deterministic reimplementation of Kazemi's bot (we own the generator rather than scraping it, so the corpus is frozen and inspectable, and — the real reason — so we can weight it). Pynchon's nine construals are folded in as ~18% high-frequency anchors over the brute-force permutation tail, giving the model a preference manifold (crisp anchors, dim tail) rather than a flat enumeration.

24,000 lines / ~797K chars of punctuated, recapitalized arrangements of the six words, seeded deterministically (SEED=1973).
Gravity's Rainbow is the anchor source, never training text — the novel is copyrighted; we train on permutations of a six-word phrase plus original construals, never Pynchon's prose (same posture as projects/gatsby/).
The corpus is committed (projects/kenosha-kid/data/raw.txt, vendored into the frozen folder) — a research project records its data. Only derived artifacts (*.bin, *.pkl, *.pt) are gitignored.

Training procedure

Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 30 warmup iters, β₂ 0.99, batch size 64, dropout 0.2.
Run: 350 iterations, best val loss 0.48 (always_save_checkpoint, stopped mid-transition on purpose — see the finding below).
Hardware: Apple Silicon Mac (MPS / Metal backend), torch.compile disabled.
Wall-clock: ~2 minutes (the corpus is small enough that the loss craters 3.37→0.66 in ~100 steps and the whole dreaminess spectrum fits in one short run).

Evaluation

There is no held-out perplexity yardstick that matters here — the metric is the qualitative dream. Given only a newline at temperature ~0.9, the champion returns these (raw, uncherry-picked, from projects/kenosha-kid/research/samples.md):

You! Never! Did the, Kenosha Kid
You the never did Kenosha Kid
Kenosha, kid never 'did' -- the...
'You' the did never Kenosha, Kid
You never, did the Kenosha kid!
You! Never did the Kenosha Kid!
You never did the Kenosha kid?
The never did you. Kenosha Kid?

Anchors surface, the tail orbits, and the occasional character near-miss leaks in ("kenoshayou", doubled "the the"). It is consistent across seeds, not cherry-picked.

The key finding — dreaminess has two knobs

The champion is deliberately not the lowest-loss checkpoint. With almost no procedural competence to learn except the six words, the only thing the model can vary is how it says them, and that variation is governed by two knobs:

Training progress — the memorization phase transition.
Sampling temperature.

Sweeping training length on the same corpus and model makes the transition visible:

run	iters	val loss	character of the dream
`r2-early`	150	0.59	too broken — words half-form and break at the character level ("Kenoshau", "thethe"); anchors can't reliably surface
`r3-mid` (champion)	350	0.48	the balance — anchors surface, the tail orbits, occasional near-miss without garble
`r1`	2000	0.43	too clean — spellings lock; drift retreats to order/punctuation only, the dream flattens

Verbatim convergence (r1) has the lower loss and is the worse artifact, so we stop mid-transition on purpose. The dream is the deliverable.

Limitations

Honest about what it is:

It says nothing but the six words. No semantics, no factual grounding, no instruction following — it is a next-character predictor over one phrase.
"Near-misses" are the feature, not a bug. The corpus never misspells, so a converged model can't either; the drift you want only exists mid-transition.
It is fragile by design. Capacity (n_layer/n_embd) and max_iters are aesthetic controls here — smaller/shorter blurs more — not just performance knobs. Cold sampling collapses toward the anchors; the dream lives at temperature.
No weights in the tree (ADR-0002). The released folder ships code + corpus only; the checkpoint regenerates deterministically from config.py.

How to reproduce

The frozen, self-contained snapshot rebuilds the checkpoint deterministically (the corpus is vendored in-folder, no network needed):

cd projects/kenosha-kid/models/kenosha-kid-nanogpt-1
python generate.py    # -> data/raw.txt   (deterministic, SEED=1973)
python prepare.py     # raw.txt -> kenosha/{train,val}.bin + meta.pkl
python train.py       # -> ./ckpt.pt      (350 iters, val ~0.48)
python sample.py --temperature=0.9

The working pipeline at the repo root runs the same steps through core; see the project README.md and the experiment write-up dream-a-single-phrase.md.

Citation / credits

The shared core engine (modern nanoGPT lineage — RoPE, RMSNorm, bias-free).
Darius Kazemi, @YouNeverDidThe (2013) — the bot generate.py reimplements deterministically.
Thomas Pynchon, Gravity's Rainbow (1973), I.10 — the nine construals are the anchors; the phrase is reproduced as a behavior, not its text. Provenance in projects/kenosha-kid/docs/sources.md.
Set up and trained with Claude (Claude Code).

Addendum — June 2026

Added in the site-standardization pass (ADR-0015). The card above is unchanged; this is a tracked addendum. Site-wide fixes — repo links now resolve to GitHub/site routes, code blocks render within the column — apply automatically.