Can a chess model's illegal moves be the point?
Key takeaways
- Three independently trained chess-move GPTs — Micro (5×5 Gardner minichess, 0.79M params), Regular (8×8 standard, 2.66M params), Grand (12×10 custom variant, 4.73M params) — built around one inversion: illegal moves are rendered as dim near-misses, not masked away by the sampler.
- All three land in a tight band of legal-move rate on a raw, unresampled first try — 35.3% (Regular), 39.2% (Micro), 36.7% (Grand) — despite wildly different board sizes, vocabularies, and corpus sources. Legality-learning difficulty didn't scale sharply with board complexity in this build.
- Two facts in the original design plan were wrong when checked against the live engine instead of trusted from a web search: real Grand Chess has one Chancellor and one Archbishop per side (not two), and the installed Fairy-Stockfish caps board ranks at 10 even though files go to 12 — so Grand shipped as 12×10, not 12×12.
- The release gate for this series is not win rate. It's two automated checks — clean game completion and legal-move rate — deliberately decoupled from chess strength, which was never the point.
The fifth research round in the studio, and the first in a new faculty:
daydream isn't a text-generation
project like shakespeare, gatsby, or kenosha-kid — it's the studio's first
sampler/prober project, where the interesting behavior is what happens
to a model's incorrect output, not its generation quality.
The inversion
Most chess-model work treats illegal output as failure: mask it, reject and resample it, penalize it during training. Daydream inverts that framing entirely. A candidate move from the model is either legal — it snaps into focus and becomes the actual move played — or illegal, a rejected dream that the harness keeps rather than discards. The sampler's resample-until-legal loop is the artwork's mechanism, not a bug-fix layered on top of it.
That reframing is what makes this a genuinely different kind of project for the studio: shakespeare, gatsby, and kenosha-kid are all measured on how well they generate; daydream is measured on the shape of its failure.
Three boards, three models, one mechanism
| Tier | Board | Params | Corpus | Vocab |
|---|---|---|---|---|
| Micro | 5×5 (Gardner minichess) | 0.79M | 4,135 Fairy-Stockfish self-play games | 15 chars |
| Regular | 8×8 standard | 2.66M | 15,000 Lichess games, ~1400–1800 Elo | 21 chars |
| Grand | 12 files × 10 ranks, custom | 4.73M | 2,101 Fairy-Stockfish self-play games | 27 chars |
Each tier is a fully independent corpus, tokenizer, and trained model — UCI square names depend on board dimensions, so nothing is shareable across tiers. Micro uses Gardner minichess's real, balance-tested arrangement directly. Grand extends real Grand Chess (10×10) to 12 files by mirroring a second Chancellor and Archbishop pair onto the board, keeping non-pawn piece density roughly proportional to the wider board — the same logic historical variants like Capablanca Chess used when a plain 16-piece army made a bigger board too sparse and slow.
What "verified" means here
Release for each tier is gated on two automated checks, run by a shared
harness.py that plays the checkpoint against a skill-limited
Fairy-Stockfish opponent, resampling the model's raw output on illegal
moves (and forcing a random legal move if a resample cap is hit, so a game
never gets stuck):
| Tier | Clean completion | Legal-move rate (first try) |
|---|---|---|
| Regular | 30/30 (100%) | 258/731 (35.3%) |
| Micro | 30/30 (100%) | 121/309 (39.2%) |
| Grand | 30/30 (100%) | 230/627 (36.7%) |
Win rate against the opponent is deliberately not part of this gate — this project was never about playing strength. Legal-move rate is a legality-learning signal: roughly a third of the model's raw samples, with zero rejection sampling applied, land on a real legal move in the actual current position. The rest are the dream.
The surprising consistency
Given how different the three boards are — 25 squares vs. 64 vs. 120, 6 vs. 16 vs. 24 non-pawn pieces per side, corpora ranging from 2,101 to 15,000 games — the legal-move rates land in a strikingly tight band (35.3% to 39.2%). This wasn't designed for; it just happened. One reading: the difficulty of "learn this board's local movement grammar from a few thousand games" doesn't scale sharply with board size in this range, at least not at these small model scales. Another reading: it's a coincidence of this particular set of training-run lengths and model sizes, and a different set of hyperparameters per tier could easily separate the three. Worth another round to find out which.
Two corrections, found by asking the engine instead of the internet
Daydream's design plan (written before any code existed) leaned on a web search to spec Grand's board — and two of those facts were wrong.
Wrong claim 1: real Grand Chess has two Chancellors and two Archbishops
per side. Actual: querying the live Fairy-Stockfish engine
(setoption name UCI_Variant value grand + d) shows the real base
variant has exactly one of each, sitting off-center next to the king —
not the clean, symmetric layout the search summary implied.
Wrong claim 2: Fairy-Stockfish supports boards up to 36×36, so a 12×12
Grand was buildable. Actual: the installed build's Rank option type
caps at 10 even though File goes to 12 — confirmed the hard way, when
stockfish check rejected the first 12×12 config outright.
Both were corrected by querying the actual running binary rather than trusting either the original web research or the design document that inherited it. Grand shipped as 12 files × 10 ranks, with a second Chancellor/Archbishop pair added deliberately (not copied from a misremembered base) to keep the board's piece density proportional at the wider size. The lesson generalizes past this one project: for any claim about a live tool's actual behavior or limits, check the tool, not a summary of documentation about the tool.
Limitations
- None of the three tiers were evaluated for chess strength, on purpose — that was never this project's gate.
- Grand's corpus is the smallest of the three (2,101 games) — self-play on the larger, slower board took longer to generate within this build's time budget than Micro's or Regular's corpora did.
- Legal-move rate is a first-try, single-config snapshot — sampled at one temperature and no soft-cap tuning. Daydream's design treats logit soft-capping as an aesthetic "dreaminess" knob still to be explored; none of these three checkpoints have had that knob swept yet.
- The tight legal-move-rate band across tiers is observational, not causally established — see "the surprising consistency" above.
How to reproduce
Each tier's frozen release folder rebuilds independently and in place:
cd projects/daydream/models/daydream-chess-nanogpt-1 # Regular
cd projects/daydream/models/daydream-chess-nanogpt-micro-1 # Micro
cd projects/daydream/models/daydream-chess-nanogpt-grand-1 # Grand
python prepare.py && python train.py config.py
python harness.py --games 30
Requires Fairy-Stockfish on PATH (brew install fairy-stockfish) — see
ADR-0021 and
ADR-0022
for the full architectural rationale.
Credits
- The shared
coreengine (modern nanoGPT lineage — RoPE, RMSNorm, bias-free). - Fairy-Stockfish — self-play corpus generation and the legality-check primitive shared by all three tiers.
- The Lichess open database — Regular's corpus source.
- Set up and trained with Claude (Claude Code).