loop.py

loop.py is an agent for iterative implementation search.

At a high level, it does one thing: ask a model for a single function, test that function inside a real experiment harness, and keep the results that are actually worth keeping.

This project follows the footsteps of karpathy/autoresearch.

Memory First

The most important idea in this repo is memory.

Blind retry loops waste time. Models repeat compile mistakes, drift back to the same weak strategy, or keep exploring parts of the search space that have already been exhausted. This repo keeps a small amount of experiment-local memory so the search can accumulate judgment instead of just generating more text.

Each experiment keeps:

MEMORY.json as structured state
MEMORY.md as the prompt-facing version of that state

That memory is meant to answer three questions:

what has worked
what keeps failing
what still looks unexplored

That is the main thing that makes the loop useful over longer runs.

How It Works

From the repo root:

python loop.py <experiment> --model <model> --loop <n>

On each iteration, the agent:

loads the experiment contract
builds the prompt from experiment-owned context
calls the model for exactly one target function
writes the candidate into the experiment folder
runs the evaluator
records what happened
commits accepted winners

The split is intentional. The model proposes. The evaluator decides.

Why The Repo Is Structured Per Experiment

Every experiment owns its own rules and its own enforcement:

<experiment>/
  EXPERIMENT.md
  experiment.json
  naive.*
  candidate.*
  prompt.py
  security.py
  eval.py
  seed.py
  SCORES.json
  SCORES.md
  MEMORY.json
  MEMORY.md
  winners/
  runs/

That keeps loop.py generic.

EXPERIMENT.md defines the target function and the constraints
prompt.py decides what context the model should see
security.py rejects obviously bad candidates before compile
eval.py owns validation, scoring, and acceptance rules
seed.py prepares a fresh search state without touching repo history

The loop stays small. The experiment stays opinionated.

What Gets Tracked

The agent keeps more than just a leaderboard.

Scores

Each experiment stores its scores in SCORES.json and renders SCORES.md from scratch.

That gives you:

a machine-friendly source of truth
a readable scoreboard
clean regeneration instead of incremental markdown editing

Experiments can track:

overall winners
per-case winners

That matters because some candidates are not globally best but are still clearly better on a specific workload shape.

Run Artifacts

Each attempt is stored under runs/.

That includes:

the rendered prompt
the raw model response
the extracted candidate
the evaluation result

This makes the loop inspectable. If a candidate fails, the full chain of cause and effect is still there.

Usage And Cost

Token usage is part of the experiment state, not an afterthought.

Each experiment tracks:

prompt tokens
completion tokens
cached prompt tokens
total tokens
estimated cost

Those totals are written into SCORES.json and shown at the top of SCORES.md. If the search gets expensive, that should be visible directly in the experiment output.

Winner Archive

Accepted candidates are archived into winners/ by commit hash.

That gives you:

a stable source snapshot for every accepted result
score rows that point to real code
the ability to reseed without losing discovered implementations

Resetting A Search

Each experiment provides its own seed.py.

Seeding resets the live state:

candidate
scores
memory
runs

Seeding does not reset:

git history
archived winners

That lets you start a new search from scratch while still keeping a record of what the experiment has already discovered.

Current Experiments

This repo currently has two active experiments:

linear: pure C search for a faster dense linear-layer kernel
attention: pure C search for a faster masked scaled-dot-product attention kernel

See linear/README.md and attention/README.md for the experiment-specific details.

Quick Start

Requirements:

Python 3.12+
OPENAI_API_KEY
the toolchain required by the chosen experiment

Example:

python linear\seed.py
python loop.py linear --model gpt-5.4-2026-03-05 --loop 10

Attention example:

python attention\seed.py
python loop.py attention --model gpt-5.4-2026-03-05 --loop 10 --temperature 1

Current Winners

linear

current best median: 13.80 (us)
seed baseline: 14.85 (us)
cumulative usage: 98026 total tokens, $0.000000

Timestamp	Model	Median	Seed	Win	Commit	Notes
2026-03-13 03:53:39	gpt-4.1-mini	13.80 (us)	14.85 (us)	x1.076	d9efdac	accepted nonbest: unrolled the I loop, split bias and no-bias paths
2026-03-13 03:56:13	gpt-4.1-mini	13.93 (us)	14.85 (us)	x1.066	c7819bd	accepted nonbest: split bias and no-bias paths
2026-03-13 03:51:19	gpt-4.1-mini	13.99 (us)	14.85 (us)	x1.061	b58d896	best winner: split bias and no-bias paths, used pointer row traversal
2026-03-13 03:57:04	gpt-4.1-mini	14.04 (us)	14.85 (us)	x1.058	c4dc512	accepted nonbest: split bias and no-bias paths
2026-03-13 03:54:55	gpt-4.1-mini	14.28 (us)	14.85 (us)	x1.040	9861a06	accepted nonbest: split bias and no-bias paths, used pointer row traversal
2026-03-13 03:56:39	gpt-4.1-mini	14.38 (us)	14.85 (us)	x1.032	f601bc9	accepted nonbest: split bias and no-bias paths
seed	seed	14.85 (us)	14.85 (us)	x1.000	seed	seeded baseline winner

attention

current best overall full-suite median: 1577.90 (us)
best gpt2_bench_ctx32 specialist: 212.50 (us)
seed baseline: 8234.20 (us)
cumulative usage: 730996 total tokens, $4.77886550
validation is adversarial and correctness-focused
benchmarks are GPT-2-shaped causal self-attention at 32, 128, and 256 tokens
the 1024 benchmark case is intentionally kept commented out in attention/app.c

Timestamp	Model	Median	Seed	Win	Commit	Notes
2026-03-13 07:07:49	gpt-5.4-2026-03-05	1577.90 (us)	8234.20 (us)	x5.218	8258684	best winner: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:31:54	gpt-5.4-2026-03-05	1537.60 (us)	8234.20 (us)	x5.355	4269f7e	best winner: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:49:53	gpt-5.4-2026-03-05	1645.40 (us)	8234.20 (us)	x5.004	b5146fe	best winner: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:57:34	gpt-5.4-2026-03-05	1670.40 (us)	8234.20 (us)	x4.929	95ba3a7	accepted nonbest: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:46:05	gpt-5.4-2026-03-05	1680.30 (us)	8234.20 (us)	x4.900	cbe1d86	accepted nonbest: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:29:34	gpt-5.4-2026-03-05	1982.30 (us)	8234.20 (us)	x4.154	42cd3cd	best winner: used SIMD intrinsics, handles grouped-query attention
seed	seed	8234.20 (us)	8234.20 (us)	x1.000	seed	seeded baseline winner

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
attention		attention
linear		linear
.gitignore		.gitignore
BOOTSTRAP.md		BOOTSTRAP.md
README.md		README.md
loop.py		loop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

loop.py

Memory First

How It Works

Why The Repo Is Structured Per Experiment

What Gets Tracked

Scores

Run Artifacts

Usage And Cost

Winner Archive

Resetting A Search

Current Experiments

Quick Start

Current Winners

linear

attention

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

loop.py

Memory First

How It Works

Why The Repo Is Structured Per Experiment

What Gets Tracked

Scores

Run Artifacts

Usage And Cost

Winner Archive

Resetting A Search

Current Experiments

Quick Start

Current Winners

linear

attention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages