loop.py is an agent for iterative implementation search.
At a high level, it does one thing: ask a model for a single function, test that function inside a real experiment harness, and keep the results that are actually worth keeping.
This project follows the footsteps of karpathy/autoresearch.
The most important idea in this repo is memory.
Blind retry loops waste time. Models repeat compile mistakes, drift back to the same weak strategy, or keep exploring parts of the search space that have already been exhausted. This repo keeps a small amount of experiment-local memory so the search can accumulate judgment instead of just generating more text.
Each experiment keeps:
MEMORY.jsonas structured stateMEMORY.mdas the prompt-facing version of that state
That memory is meant to answer three questions:
- what has worked
- what keeps failing
- what still looks unexplored
That is the main thing that makes the loop useful over longer runs.
From the repo root:
python loop.py <experiment> --model <model> --loop <n>On each iteration, the agent:
- loads the experiment contract
- builds the prompt from experiment-owned context
- calls the model for exactly one target function
- writes the candidate into the experiment folder
- runs the evaluator
- records what happened
- commits accepted winners
The split is intentional. The model proposes. The evaluator decides.
Every experiment owns its own rules and its own enforcement:
<experiment>/
EXPERIMENT.md
experiment.json
naive.*
candidate.*
prompt.py
security.py
eval.py
seed.py
SCORES.json
SCORES.md
MEMORY.json
MEMORY.md
winners/
runs/
That keeps loop.py generic.
EXPERIMENT.mddefines the target function and the constraintsprompt.pydecides what context the model should seesecurity.pyrejects obviously bad candidates before compileeval.pyowns validation, scoring, and acceptance rulesseed.pyprepares a fresh search state without touching repo history
The loop stays small. The experiment stays opinionated.
The agent keeps more than just a leaderboard.
Each experiment stores its scores in SCORES.json and renders SCORES.md from scratch.
That gives you:
- a machine-friendly source of truth
- a readable scoreboard
- clean regeneration instead of incremental markdown editing
Experiments can track:
- overall winners
- per-case winners
That matters because some candidates are not globally best but are still clearly better on a specific workload shape.
Each attempt is stored under runs/.
That includes:
- the rendered prompt
- the raw model response
- the extracted candidate
- the evaluation result
This makes the loop inspectable. If a candidate fails, the full chain of cause and effect is still there.
Token usage is part of the experiment state, not an afterthought.
Each experiment tracks:
- prompt tokens
- completion tokens
- cached prompt tokens
- total tokens
- estimated cost
Those totals are written into SCORES.json and shown at the top of SCORES.md. If the search gets expensive, that should be visible directly in the experiment output.
Accepted candidates are archived into winners/ by commit hash.
That gives you:
- a stable source snapshot for every accepted result
- score rows that point to real code
- the ability to reseed without losing discovered implementations
Each experiment provides its own seed.py.
Seeding resets the live state:
- candidate
- scores
- memory
- runs
Seeding does not reset:
- git history
- archived winners
That lets you start a new search from scratch while still keeping a record of what the experiment has already discovered.
This repo currently has two active experiments:
linear: pure C search for a faster dense linear-layer kernelattention: pure C search for a faster masked scaled-dot-product attention kernel
See linear/README.md and attention/README.md for the experiment-specific details.
Requirements:
- Python 3.12+
OPENAI_API_KEY- the toolchain required by the chosen experiment
Example:
python linear\seed.py
python loop.py linear --model gpt-5.4-2026-03-05 --loop 10Attention example:
python attention\seed.py
python loop.py attention --model gpt-5.4-2026-03-05 --loop 10 --temperature 1- current best median:
13.80 (us) - seed baseline:
14.85 (us) - cumulative usage:
98026total tokens,$0.000000
| Timestamp | Model | Median | Seed | Win | Commit | Notes |
|---|---|---|---|---|---|---|
| 2026-03-13 03:53:39 | gpt-4.1-mini | 13.80 (us) | 14.85 (us) | x1.076 | d9efdac | accepted nonbest: unrolled the I loop, split bias and no-bias paths |
| 2026-03-13 03:56:13 | gpt-4.1-mini | 13.93 (us) | 14.85 (us) | x1.066 | c7819bd | accepted nonbest: split bias and no-bias paths |
| 2026-03-13 03:51:19 | gpt-4.1-mini | 13.99 (us) | 14.85 (us) | x1.061 | b58d896 | best winner: split bias and no-bias paths, used pointer row traversal |
| 2026-03-13 03:57:04 | gpt-4.1-mini | 14.04 (us) | 14.85 (us) | x1.058 | c4dc512 | accepted nonbest: split bias and no-bias paths |
| 2026-03-13 03:54:55 | gpt-4.1-mini | 14.28 (us) | 14.85 (us) | x1.040 | 9861a06 | accepted nonbest: split bias and no-bias paths, used pointer row traversal |
| 2026-03-13 03:56:39 | gpt-4.1-mini | 14.38 (us) | 14.85 (us) | x1.032 | f601bc9 | accepted nonbest: split bias and no-bias paths |
| seed | seed | 14.85 (us) | 14.85 (us) | x1.000 | seed | seeded baseline winner |
- current best overall full-suite median:
1577.90 (us) - best
gpt2_bench_ctx32specialist:212.50 (us) - seed baseline:
8234.20 (us) - cumulative usage:
730996total tokens,$4.77886550 - validation is adversarial and correctness-focused
- benchmarks are GPT-2-shaped causal self-attention at
32,128, and256tokens - the
1024benchmark case is intentionally kept commented out inattention/app.c
| Timestamp | Model | Median | Seed | Win | Commit | Notes |
|---|---|---|---|---|---|---|
| 2026-03-13 07:07:49 | gpt-5.4-2026-03-05 | 1577.90 (us) | 8234.20 (us) | x5.218 | 8258684 | best winner: used SIMD intrinsics, handles grouped-query attention |
| 2026-03-13 06:31:54 | gpt-5.4-2026-03-05 | 1537.60 (us) | 8234.20 (us) | x5.355 | 4269f7e | best winner: used SIMD intrinsics, handles grouped-query attention |
| 2026-03-13 06:49:53 | gpt-5.4-2026-03-05 | 1645.40 (us) | 8234.20 (us) | x5.004 | b5146fe | best winner: used SIMD intrinsics, handles grouped-query attention |
| 2026-03-13 06:57:34 | gpt-5.4-2026-03-05 | 1670.40 (us) | 8234.20 (us) | x4.929 | 95ba3a7 | accepted nonbest: used SIMD intrinsics, handles grouped-query attention |
| 2026-03-13 06:46:05 | gpt-5.4-2026-03-05 | 1680.30 (us) | 8234.20 (us) | x4.900 | cbe1d86 | accepted nonbest: used SIMD intrinsics, handles grouped-query attention |
| 2026-03-13 06:29:34 | gpt-5.4-2026-03-05 | 1982.30 (us) | 8234.20 (us) | x4.154 | 42cd3cd | best winner: used SIMD intrinsics, handles grouped-query attention |
| seed | seed | 8234.20 (us) | 8234.20 (us) | x1.000 | seed | seeded baseline winner |