Skip to content

azret/loop.py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

106 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

loop.py

loop.py is an agent for iterative implementation search.

At a high level, it does one thing: ask a model for a single function, test that function inside a real experiment harness, and keep the results that are actually worth keeping.

This project follows the footsteps of karpathy/autoresearch.

Memory First

The most important idea in this repo is memory.

Blind retry loops waste time. Models repeat compile mistakes, drift back to the same weak strategy, or keep exploring parts of the search space that have already been exhausted. This repo keeps a small amount of experiment-local memory so the search can accumulate judgment instead of just generating more text.

Each experiment keeps:

  • MEMORY.json as structured state
  • MEMORY.md as the prompt-facing version of that state

That memory is meant to answer three questions:

  • what has worked
  • what keeps failing
  • what still looks unexplored

That is the main thing that makes the loop useful over longer runs.

How It Works

From the repo root:

python loop.py <experiment> --model <model> --loop <n>

On each iteration, the agent:

  1. loads the experiment contract
  2. builds the prompt from experiment-owned context
  3. calls the model for exactly one target function
  4. writes the candidate into the experiment folder
  5. runs the evaluator
  6. records what happened
  7. commits accepted winners

The split is intentional. The model proposes. The evaluator decides.

Why The Repo Is Structured Per Experiment

Every experiment owns its own rules and its own enforcement:

<experiment>/
  EXPERIMENT.md
  experiment.json
  naive.*
  candidate.*
  prompt.py
  security.py
  eval.py
  seed.py
  SCORES.json
  SCORES.md
  MEMORY.json
  MEMORY.md
  winners/
  runs/

That keeps loop.py generic.

  • EXPERIMENT.md defines the target function and the constraints
  • prompt.py decides what context the model should see
  • security.py rejects obviously bad candidates before compile
  • eval.py owns validation, scoring, and acceptance rules
  • seed.py prepares a fresh search state without touching repo history

The loop stays small. The experiment stays opinionated.

What Gets Tracked

The agent keeps more than just a leaderboard.

Scores

Each experiment stores its scores in SCORES.json and renders SCORES.md from scratch.

That gives you:

  • a machine-friendly source of truth
  • a readable scoreboard
  • clean regeneration instead of incremental markdown editing

Experiments can track:

  • overall winners
  • per-case winners

That matters because some candidates are not globally best but are still clearly better on a specific workload shape.

Run Artifacts

Each attempt is stored under runs/.

That includes:

  • the rendered prompt
  • the raw model response
  • the extracted candidate
  • the evaluation result

This makes the loop inspectable. If a candidate fails, the full chain of cause and effect is still there.

Usage And Cost

Token usage is part of the experiment state, not an afterthought.

Each experiment tracks:

  • prompt tokens
  • completion tokens
  • cached prompt tokens
  • total tokens
  • estimated cost

Those totals are written into SCORES.json and shown at the top of SCORES.md. If the search gets expensive, that should be visible directly in the experiment output.

Winner Archive

Accepted candidates are archived into winners/ by commit hash.

That gives you:

  • a stable source snapshot for every accepted result
  • score rows that point to real code
  • the ability to reseed without losing discovered implementations

Resetting A Search

Each experiment provides its own seed.py.

Seeding resets the live state:

  • candidate
  • scores
  • memory
  • runs

Seeding does not reset:

  • git history
  • archived winners

That lets you start a new search from scratch while still keeping a record of what the experiment has already discovered.

Current Experiments

This repo currently has two active experiments:

  • linear: pure C search for a faster dense linear-layer kernel
  • attention: pure C search for a faster masked scaled-dot-product attention kernel

See linear/README.md and attention/README.md for the experiment-specific details.

Quick Start

Requirements:

  • Python 3.12+
  • OPENAI_API_KEY
  • the toolchain required by the chosen experiment

Example:

python linear\seed.py
python loop.py linear --model gpt-5.4-2026-03-05 --loop 10

Attention example:

python attention\seed.py
python loop.py attention --model gpt-5.4-2026-03-05 --loop 10 --temperature 1

Current Winners

linear

  • current best median: 13.80 (us)
  • seed baseline: 14.85 (us)
  • cumulative usage: 98026 total tokens, $0.000000
Timestamp Model Median Seed Win Commit Notes
2026-03-13 03:53:39 gpt-4.1-mini 13.80 (us) 14.85 (us) x1.076 d9efdac accepted nonbest: unrolled the I loop, split bias and no-bias paths
2026-03-13 03:56:13 gpt-4.1-mini 13.93 (us) 14.85 (us) x1.066 c7819bd accepted nonbest: split bias and no-bias paths
2026-03-13 03:51:19 gpt-4.1-mini 13.99 (us) 14.85 (us) x1.061 b58d896 best winner: split bias and no-bias paths, used pointer row traversal
2026-03-13 03:57:04 gpt-4.1-mini 14.04 (us) 14.85 (us) x1.058 c4dc512 accepted nonbest: split bias and no-bias paths
2026-03-13 03:54:55 gpt-4.1-mini 14.28 (us) 14.85 (us) x1.040 9861a06 accepted nonbest: split bias and no-bias paths, used pointer row traversal
2026-03-13 03:56:39 gpt-4.1-mini 14.38 (us) 14.85 (us) x1.032 f601bc9 accepted nonbest: split bias and no-bias paths
seed seed 14.85 (us) 14.85 (us) x1.000 seed seeded baseline winner

attention

  • current best overall full-suite median: 1577.90 (us)
  • best gpt2_bench_ctx32 specialist: 212.50 (us)
  • seed baseline: 8234.20 (us)
  • cumulative usage: 730996 total tokens, $4.77886550
  • validation is adversarial and correctness-focused
  • benchmarks are GPT-2-shaped causal self-attention at 32, 128, and 256 tokens
  • the 1024 benchmark case is intentionally kept commented out in attention/app.c
Timestamp Model Median Seed Win Commit Notes
2026-03-13 07:07:49 gpt-5.4-2026-03-05 1577.90 (us) 8234.20 (us) x5.218 8258684 best winner: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:31:54 gpt-5.4-2026-03-05 1537.60 (us) 8234.20 (us) x5.355 4269f7e best winner: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:49:53 gpt-5.4-2026-03-05 1645.40 (us) 8234.20 (us) x5.004 b5146fe best winner: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:57:34 gpt-5.4-2026-03-05 1670.40 (us) 8234.20 (us) x4.929 95ba3a7 accepted nonbest: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:46:05 gpt-5.4-2026-03-05 1680.30 (us) 8234.20 (us) x4.900 cbe1d86 accepted nonbest: used SIMD intrinsics, handles grouped-query attention
2026-03-13 06:29:34 gpt-5.4-2026-03-05 1982.30 (us) 8234.20 (us) x4.154 42cd3cd best winner: used SIMD intrinsics, handles grouped-query attention
seed seed 8234.20 (us) 8234.20 (us) x1.000 seed seeded baseline winner

About

An agent for iterative implementation search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors