sup computer

a small language model studio


kenosha-kid-nanogpt-2

Serieskenosha-kid
Version2
Git tagkenosha-kid-nanogpt-2
Architecturemodern (RoPE, RMSNorm, bias-free)
Tokenizerchar (39)
Parameters792,576
Held-out BPC
Weightssup-computer/kenosha-kid-nanogpt-2 (Hugging Face)
ResearcherClaude Fable 5

Key takeaways

  • A 0.79M-param char-level model on the same six words as v1 — "You never did the Kenosha Kid." — but trained on a self-drifting corpus: the permutation tail carries a controlled per-letter misspelling channel (DRIFT_RATE=0.06) while Pynchon's nine anchors stay pristine.
  • This decouples the two dream qualities. Fully converged (val ~0.65, 1100 iters) the model still reproduces all 9 anchors verbatim (9/9) and carries a near-miss in ~33% of lines — the crisp-anchors-AND-near-misses combination v1 structurally could not reach.
  • DRIFT_RATE is the new dial: heavier drift buys more near-misses at the cost of a little garble and an anchor. v1 got near-misses only by undertraining, which also blurred the anchors.

A character-level GPT whose entire universe is six wordsyou never did the kenosha kid, the telegram Tyrone Slothrop reconstrues under sodium amytal in Pynchon's Gravity's Rainbow (I.10), and the seed of Darius Kazemi's @YouNeverDidThe bot. Like v1 it orbits the phrase rather than enumerating it — but v2 answers the open question v1's report left behind: can a converged model dream? v1 could not. Its corpus never misspelled, so a low-loss model spelled the six words perfectly and the near-misses vanished; the only way to get them was to stop training early, which coupled the near-misses to blurred anchors. v2 moves the drift into the data and breaks that coupling.

A self-drifting corpus. generate.py bakes a per-letter misspelling channel — adjacent swap, doubling, drop, substitution — into the permutation tail only, at DRIFT_RATE=0.06. The nine Pynchon anchors are never drifted. Now the near-misses ("nevver", "Kenoshar", "yyou") live in the corpus, so a fully converged model reproduces them AND keeps the anchors crisp. The blur is still the artifact; v2 just stops paying for it with the anchors.

Model details

Version / git tagkenosha-kid-nanogpt-2 (research run drift-r1)
Architecturemodern char-level (RoPE, RMSNorm, bias-free) on the shared core engine — no vendored base engine (ADR-0012)
Size4 layers · 4 heads · 128 embedding dim · 128 context · dropout 0.2 · ~0.79M params
Tokenizercharacter-level, 39-char vocabulary (vs v1's 27 — the drift channel's substitutions introduce the full lowercase alphabet; direct char↔int lookup via meta.pkl, no BPE)
Checkpointprojects/kenosha-kid/models/kenosha-kid-nanogpt-2/ (weights not committed — regenerates deterministically, below)
Built onthe monorepo's shared core engine
Developed withClaude (Claude Code)
LicenseMIT

Intended use

An exhibit / curio, not a capable language model — and specifically a demonstration that the aesthetic objective here is inverted: dreaminess is the point, not low loss. v2's whole reason to exist is that a converged, low-loss model can still dream, because the dream was moved into the corpus. Sampled at temperature ~0.9 (the default "dream" setting) and given only a newline, it orbits the phrase — all nine anchors surface verbatim, the tail drifts through punctuated permutations, and near-misses leak in on roughly a third of lines.

DRIFT_RATE is exposed as a dial for the effect: regenerate the corpus at a higher rate and retrain to trade legibility for more drift (see Evaluation).

Out of scope. This is explicitly not a general-purpose language model. It has no knowledge, no semantics, no instruction following, and no vocabulary beyond the six words. The near-misses are the feature; do not read its output as information.

Training data

A synthetic, in-repo corpus generated by generate.py — a deterministic reimplementation of Kazemi's bot (we own the generator rather than scraping it, so the corpus is frozen and inspectable, and — the real reason — so we can weight and now drift it). Pynchon's nine construals are folded in as ~18% high-frequency anchors; the brute-force permutation tail is passed through the drift channel.

Training procedure

Evaluation

The metric is the qualitative dream, now measured rather than eyeballed. eval_dream.py samples the checkpoint warm (temperature 0.9, ~430 lines) and reports two things at once: anchor-recall (fraction of lines verbatim = one of the nine anchors, and how many of the nine are covered) and a near-miss / garble breakdown (per word, edit-distance to the six canon words: 1–2 = near-miss, ≥3 = garble). The comparison against v1's checkpoints is the whole story:

runcorpusitersvalanchor_hitanchors coverednear-miss linesgarble linesreading
r1 (v1, converged)pristine20000.430.2259/90.0000.000crisp anchors, no near-misses
r3-mid (v1 champion)pristine3500.480.0423/90.0370.012near-misses only by undertraining — couples them to blurred anchors
drift-r1 (v2)drift 0.0611000.650.1389/90.3310.002crisp anchors AND abundant near-misses (the win)
drift-r2drift 0.1411000.850.1318/90.5920.035heavier drift — more near-miss, some garble, one lost anchor

The v2 release (drift-r1) covers all nine anchors verbatim while carrying a near-miss on ~33% of lines with near-zero garble — ~9× the champion's near-miss rate and full anchor coverage, which the champion (3/9) never had. drift-r2 shows the dial: more drift buys more near-misses at the cost of a little garble and an anchor.

Representative samples (raw, uncherry-picked, temperature 0.9, from projects/kenosha-kid/runs/drift-samples.md):

You, Never? Did the Kenosha Kid?
You never did 'tthe,' Kenosha Kid!
Did never Kenosha kid the yyou?
iDd you the Kenosha never did
You never did the Kenosha Kid
Kneoshaa diid Kid the you. Neeer
Kenoshha you did Kid 'never', never?
You never did the Kenosha Kid.

Verbatim anchors ("You, Never? Did the Kenosha Kid?", "You never did the Kenosha Kid") sit right next to near-misses ("tthe", "yyou", "iDd", "Kneoshaa", "diid", "Neeer", "Kenoshha") — in the same converged model.

A comparison chart (v2 vs v1 baselines: anchor-coverage and near-miss line-rate as DRIFT_RATE climbs 0.0 → 0.06 → 0.14) would make the decoupling and the dial legible at a glance. It is not authored here — charts go through the tools/dataviz/ pipeline; this card only describes it.

Limitations

Honest about what it is:

How to reproduce

The frozen, self-contained snapshot rebuilds the checkpoint deterministically (the corpus is vendored in-folder, no network needed):

cd projects/kenosha-kid/models/kenosha-kid-nanogpt-2
python generate.py            # (optional) rewrites raw.txt identically (DRIFT_RATE=0.06)
python prepare.py             # raw.txt -> kenosha/{train,val}.bin + meta.pkl
python train.py config.py     # -> ./ckpt.pt  (converged, 1100 iters, val ~0.65)
python sample.py --out_dir=. --data_root=. --device=cpu --start=$'\n' --temperature=0.9
python eval_dream.py --device=cpu --num_samples=40   # the dream-score

The working pipeline at the repo root runs the same steps through core; see the project README.md and the v1 write-up dream-a-single-phrase.md, whose closing line — "a corpus that itself drifts" — this model implements.

Citation / credits