sup computer — a small language model studio
sup computer is a research studio building small language models from scratch — small enough to train end to end on a consumer laptop, and still useful.
Our methods are LLM-assisted: a mixture of models works each step, from dataset creation to training and evaluation, under human direction. All of our research is open-source and freely available.
Models
- daydream-chess-nanogptPlays chess without ever knowing the rules — learned move by move from games, not a rulebook, across three board sizes.v1 · v1-grand · v1-micro
- gatsby-nanogptBends any story toward the green light — obsession you can dial from 1 to 5.v2 · v1
- kenosha-kid-nanogptDreams endlessly on just six words.v2 · v1
- shakespeare-nanogptWrites Shakespeare from scratch — and gets sharper every research round.v3 · v2 · v1
Research
An LLM-assisted experiment: four rounds took held-out BPC from 2.395 to 1.919. More data was the win; regularization was the dead end.
A single afternoon spent improving all four sup computer models at once — a larger model planned a per-model optimization, small runs executed it. Two new releases (shakespeare-nanogpt-3, kenosha-kid-nanogpt-2), one migration, one eval-only characterization, and a handful of findings that only show up when you look across projects side by side.
A three-tier chess-move GPT family (5x5, 8x8, and a custom 12x10 board) built around a single inversion: illegal moves are rendered as dim near-misses instead of being masked away by the sampler. All three tiers land in a tight band of legal-move rate (35-39% on a raw, unresampled first try) despite very different board sizes, vocabularies, and corpus sources -- and two separate facts in the original design plan turned out to be wrong when checked against the live engine instead of trusted from web research.
A repo-wide audit by a larger model found the small-model studio's engine had two advertised code paths that crashed on use, a metric that quietly flattered char models, and a resume that restarted. The fix that outlasts the fixes: a twenty-second smoke test that trains a real (tiny) GPT from scratch on every push — train, resume, sample, eval, export, parity — so the wiring can never silently rot again.
The smallest obsession in the studio: a char-level model whose entire corpus is punctuated permutations of six words. A bot enumerates that space exactly; a learned model can't — and the blur it produces instead is the artifact. The finding: dreaminess is governed by two knobs, training progress and sampling temperature.
Don't serve a model — export only its forward pass as a static ONNX graph (tokens in, last-position logits out) and keep the autoregressive loop, sampling, and tokenization in JS, so a small model becomes a static asset that runs client-side with no server.
gatsby's first corpus cost ~$6 of Claude API to write. This round throws that out and has a mixture of four local open models — Olmo, Ministral, Gemma, Granite — write the corpus instead: free, unlimited, and in four different voices. The model that results matches the paid baseline's behaviour at $0. The catch, and the finding: the blend is a designed object. A granite-heavy first round broke the green-light dial; rebalancing off it and doubling the data brought the dial back.
A char-level model built to compulsively reach for Gatsby's green light — and the $0, fully-controlled ablation that found the dial's real bottleneck: signal loudness, not corpus shape.