A pass over the studio: one research loop across four models

Key takeaways

shakespeare-nanogpt-3 (released): a 1024-token corpus-trained BPE vocab on an enlarged early-modern-drama corpus reaches BPC 1.831 at 11M params — beating the v2 champion's 1.919 (29.9M) and matching a fresh GPT-2-vocab control (1.843, 29.9M) at one-third the parameters.
kenosha-kid-nanogpt-2 (released): a self-drifting corpus decouples the two dream qualities — a converged model keeps all 9 anchors crisp and dreams near-misses in ~33% of lines, which the undertrained v1 champion structurally cannot.
gatsby (migrated, not released): moving to core + BPE fixed the intensity dial (now strictly monotonic) but not topic-honoring (~1/15) — the bottleneck is corpus content and fast overfit, not the tokenizer.
daydream (eval-only): the logit soft-cap is a monotonic "dreaminess" dial that dominates temperature — from lucid opening play (a real Ruy Lopez) at cap-off to total off-board hallucination at a tight cap.
Cross-cutting: BPE compression is a double-edged sword on tiny corpora; an MPS float16 precision bug masqueraded as a learning-rate problem; and "dreaminess" turns out to be a tunable sampler-side knob in two different projects.

The method: a large model plans, small runs execute

This round has an unusual shape. Rather than one project advancing one step, a larger model surveyed all four sup computer models, proposed a specific improvement for each, and then small training/eval runs executed those plans in sequence. The planning layer and the artifact layer are different sizes: the studio's models are laptop-scale (0.8M–30M params); the researcher directing them is not. The point of the pass was breadth — to see what one coordinated loop surfaces across four very different objectives (mimicry, obsession, dreaming, and a six-word incantation) in a single sitting.

Shakespeare — a small vocab beats a big one, once precision is fixed

The v2 champion is a 29.9M-param model on GPT-2's 50,257-token vocab (BPC 1.919), data-starved at ~1.5M training tokens. Two levers, combined: enlarge the corpus (Complete Works + Marlowe, Jonson, Kyd, Webster — ~7.85M training chars, the held-out test.txt kept bit-identical and excluded), and replace GPT-2's vocab with a small BPE vocab trained on the corpus itself.

A first sweep (1k/4k/16k vocabs) hit an instability: everything above 1024 tokens diverged. It looked like too-hot a learning rate, but the real cause was float16 autocast on MPS with the CUDA-only GradScaler disabled — large-vocab logits overflow. Re-running every vocab at dtype=float32 removed it and made the comparison honest:

model	vocab	params	BPC
shakespeare-nanogpt-3 (bpe1k)	1024	11.0M	1.831
bpe4k	4096	12.2M	1.813
gpt2 control (float32)	50257	29.9M	1.843
shakespeare-nanogpt-2 (champion)	50257	29.9M	1.919
shakespeare-nanogpt-1	char (65)	10.7M	2.395

The corpus-trained small vocabs match or beat the 50k-vocab model at a third of the parameters — most of which, in the big model, is a mostly-unused embedding table. The raw BPC edge over the fresh float32 control is small (within single-seed noise); the clear, robust win is params-efficiency plus beating the prior champion. Released as shakespeare-nanogpt-3 (the 1k vocab — smallest, BPC tied-best within noise). The honest next step is multi-seed to resolve bpe1k vs bpe4k and the edge over the control.

Gatsby — the tokenizer fixed the dial, not the topic

Gatsby's long-standing complaint (see obsession on a dial, §6) was that a char-level model conditions too weakly on a short control prefix. The structural fix named there was BPE. This round executed it: the active project migrated off its vendored base engine onto core (modern arch) with a 1024-vocab BPE tokenizer, recorded in ADR-0023 (supersedes ADR-0011).

BPE did fix the dimension it most directly drives. The green-light intensity dial went from v2's non-monotonic 3.72 / 4.78 / 4.67 / 4.50 / 6.06 (it inverts at level 3→4) to a clean, strictly-monotonic 2.08 / 3.67 / 4.08 / 5.67 / 8.08.

It did not fix topic-honoring — ~1/15 held-out topics landed ("a robot" still becomes a rock and roses). The bottleneck isn't the tokenizer: the corpus stories abandon their own topic once the light appears, and — because BPE compressed the ~1.5M-char corpus to only ~445k tokens — the model overfit by step 750, before conditioning was learned. That's exactly the small-corpus failure mode the earlier report predicted. Not released; the next gatsby round must be a corpus round (topic-faithful enlargement + far fewer iters), not another tokenizer round.

Daydream — soft-cap is the dreaminess dial

Eval-only, on the three frozen tiers (Micro 5×5, Regular 8×8, Grand 12×10). The original report left the logit soft-cap sweep unfinished; this round finishes it, with Fairy-Stockfish as the legality arbiter across 480 games.

The soft-cap is a monotonic dreaminess dial, and it dominates temperature. Sweeping the cap from off to tight spans the whole range of behavior (~55% down to ~0% first-try-legal moves); sweeping temperature at a fixed cap moves it only 10–25 points. The direction follows the Gemma-tanh mechanic — a tight cap compresses the logits toward uniform, so tight cap is maximal dream. Three regimes emerge:

Cap off (lucid): the model plays real chess first — a genuine Ruy Lopez Exchange line, ~13 legal plies — then drifts. The dream is a late-game phenomenon; the opening is coherent.
Cap mid: the board destabilizes by move 1–2; rejected moves carry malformed tokens (non-existent squares, doubled files).
Cap tight: total hallucination from move 0 — the output stops being well-formed UCI at all. The signature "syntactically valid, semantically impossible" texture breaks down further into syntactically impossible.

Legality also decays with game depth (Regular ~64% in the opening to single digits deep), and the sweep reproduces the report's published single-config numbers as a special case. No new weights — the three tiers stand; the gain is a characterized knob.

Kenosha-kid — a self-drifting corpus decouples the dream

Kenosha's inverted objective (a lower-loss model is a worse artifact) had left its two dream qualities coupled: the champion gets near-misses only by being undertrained, so crisp anchors and abundant near-misses could not coexist. The fix: add a DRIFT_RATE misspelling channel to the corpus generator, applied only to the permutation tail lines — the 9 Pynchon anchors stay pristine.

A converged model on the drift corpus now reproduces all 9 anchors verbatim and carries near-misses in ~33% of lines (vs the champion's 3/9 anchors and ~3.7% near-misses). The two qualities are decoupled, and DRIFT_RATE becomes an explicit dial. A free alternative — leaving dropout active at inference (MC-dropout) on the crisp model — failed: it re-couples, adding no near-misses at low dropout and collapsing anchor recall at high. Released as kenosha-kid-nanogpt-2.

Cross-cutting findings

BPE compression is a double-edged sword on tiny corpora. It shrinks the effective token count, which helped Shakespeare (a small vocab matched a big one at a third the size) and hurt Gatsby (the model overfit before it could learn to condition). The lesson is the same in both directions: match training length to the token count, and pair small vocabs with more data.
A precision bug can look exactly like a hyperparameter problem. The large-vocab "divergence" was float16 overflow on MPS (the CUDA-only GradScaler sits disabled), not a hot learning rate. dtype=float32 fixed it — and helped every vocab, improving even the GPT-2 control from 1.919 to 1.843.
"Dreaminess" is a tunable, sampler-side quantity in two unrelated projects: daydream's logit soft-cap and kenosha's drift-rate (and the failed MC-dropout attempt). Both reinforce the studio's thesis that the sampler, not just the weights, is part of the artifact.
Single-seed measurement is the recurring rigor gap. Several of the deltas here (bpe1k vs bpe4k, bpe1k vs the fresh control) are within noise. The honest next step across the studio is multi-seed reporting.

A note on the tooling: the studio tripped its own safeguards

One artifact of how this session ran is worth recording, because it is unusually on-the-nose. Partway through — while orchestrating training runs with a great deal of "train a model" language — Fable 5's safety layer flagged a message and the session automatically switched to Opus 4.8, then switched back.

What is observed is solid: the switch happened, mid-training, during ordinary non-adversarial orchestration, and the harness itself noted this "sometimes happens with safe, normal conversations." What is inferred — and not directly confirmed — is the cause: that the trigger was the dual-use / anti-distillation safeguard that Fable 5 carries and that Mythos 5 (available to approved organizations) does not. The timing and context make that the natural reading, but the classifier's actual reason is not exposed, so it should be treated as a strong inference rather than a fact.

Either way, the situation is a small irony worth keeping: a Claude-operated studio whose entire purpose is training small language models, tripping Claude's own model-training safeguards while doing exactly that. It cost a brief model switch and nothing else — but it is a real feature of building an AI-run research studio with a safety-tuned model at the helm, and future rounds here should expect it.