Executable brain graphs

Create & evolve self-improving brain graphs.

Compose functions, scripts, model calls, evals, training jobs, mutators, and sub-brains into systems that run, measure, mutate, and improve.

Concept Toy Demo Artifacts For hobbyists

Get started

Experiment loop

Definegoal

Runtrace

Measuremetrics

Improvevariant

Tracklineage

Executable graphs, from primitive functions to recursive brains.

A brain can contain tiny functions, scripts, APIs, model calls, datasets, scorers, verifiers, train steps, distillers, mutators, agents, and whole sub-brains.

Example graph

Input

Router

Planner

Model A

Tools

Memory

Model B

Verifier

Eval

Dataset

Mutator

Integrate with your existing tooling.

FrankenBrain plugs into the shape of modern frontier-lab workflows: JAX/PyTorch/TensorFlow code, accelerator-backed runs, eval harnesses, traces, artifacts, workflow orchestration, experiment tracking, inference stacks, and internal or open-source runners. Pull models, APIs, datasets, code, checkpoints, and artifacts from places like OpenAI, Anthropic, Gemini, Hugging Face, GitHub, Kaggle, S3/GCS/R2, W&B, MLflow, or internal stores.

JAX PyTorch TensorFlow TPUs GPUs XLA Kubernetes containers distributed training inference optimization eval harnesses experiment tracking workflow orchestration artifact stores trace logs lineage OpenAI Anthropic Gemini Hugging Face GitHub Kaggle S3/GCS/R2 Ray Slurm vLLM TGI W&B MLflow DVC checkpoints distillation data

To Top

Next section

Toy Demo

Load a small sample graph, simulate a measured run, suggest a mutation, compare variants, and inspect the exported artifact.

Load a preset, simulate a run, suggest a mutation, then compare variants.

Experiment

GSM8K verifier loop

Goal: Improve GSM8K accuracy under $0.05/run
Dataset: gsm8k-mini.jsonl
Metric: accuracy / cost / latency
Current best: 72.4%
Run: #18: +3.1% from verifier routing

VariantAcc.CostLatency

v1 baseline61.2%$0.0312.4s

v2 verifier loop67.9%$0.0444.8s

v3 router+memory72.4%$0.0495.1s

Mutation notes

Verifier moved after tool call
Memory disabled for arithmetic tasks

JSON artifact

{
  "graph": "gsm8k-verifier-loop",
  "nodes": ["router", "reasoner_a", "tool", "verifier"],
  "eval": "gsm8k-mini",
  "metric": "accuracy"
}

To Top

Next section

Experiments as ordinary artifacts.

A credible brain is more than a diagram. Its goal, graph, trace, metrics, diff, lineage, checkpoints, and distillation data should be inspectable as ordinary artifacts.

brain.schema.json Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "required": ["id", "entry", "nodes", "edges", "evals"],
  "properties": {
    "nodes": {
      "items": { "required": ["id", "type"] }
    },
    "edges": {
      "items": { "prefixItems": [
        { "type": "string" }, { "type": "string" }
      ]}
    }
  }
}

brain.json Runnable Graph

{
  "id": "compression-loop.v3-b",
  "goal": "lossless text compression",
  "entry": "planner",
  "nodes": ["planner", "candidate", "exact_decode", "score"],
  "edges": [["planner", "candidate"], ["candidate", "exact_decode"]],
  "evals": ["exact_decode"],
  "archive": "runs/compression-loop/run_0042"
}

run_0042.trace.jsonl Trace Output

{"step":1,"node":"planner","tokens":812}
{"step":2,"node":"candidate","file":"codec.py"}
{"step":3,"node":"exact_decode","passed":true}
{"step":4,"node":"score","bytes":18741}
{"step":5,"node":"archive","variant":"v3-b"}

eval/exact_decode Eval Score

Exact decode 100%

Compressed bytes 18,741

Runtime 2.8s

Regression 0 fail

run_0042.manifest.json Run Manifest

{
  "run_id": "run_0042",
  "command": "franken run brain.json",
  "started_at": "2026-05-02T14:31:08Z",
  "seed": 1842,
  "fixture_set": "canterbury+synthetic-v1",
  "git_sha": "9c3a1f7",
  "artifacts": ["trace", "eval", "diff", "lineage"]
}

variant.diff Mutation Diff

--- compression-loop.v3
+++ compression-loop.v3-b
@@ mutation
- planner.temperature: 0.70
+ planner.temperature: 0.35
- candidate.strategy: "dictionary-v2"
+ candidate.strategy: "bpe-hybrid"
+ exact_decode.cases: 64
+ score.penalty.runtime_ms: 3000

lineage.json Lineage View

v1baseline21,904 bytes
v2dictionary-v219,338 bytes
v3-bbpe-hybrid18,741 bytes

To Top

Next section

Composable building blocks.

Use tiny primitives when you need control, full sub-brains when you want reuse, and training or mutation nodes when the graph should improve its own parts.

Primitive to Recursive Nodes

A node can be a tiny function, shell command, parser, scorer, API call, model call, eval, training job, distiller, mutator, agent, or an entire brain packaged as one reusable node.

Script Function Model Eval Train Sub-brain

Fast Graph Setup

Start from a blank brain, template, or JSON file. Add primitive code, model, prompt, tool, memory, dataset, eval, verifier, training, mutator, API, agent, or sub-brain nodes without burying the experiment in glue code.

$ franken init research-brain
$ franken add node planner --model gpt
$ franken run brain.json

Goal-Driven Runs

Define what the graph is trying to improve: compressed size, exact decode, runtime, accuracy, cost, robustness, task difficulty, or any custom metric you can measure.

Manual or Meta Mutation

Edit the graph yourself, suggest directions to a meta-brain, or let mutation nodes generate, run, score, and select new variants under your constraints.

Training and Distillation

Nodes can be tuned, specialized, swapped, cached, distilled, or trained during an experiment. Successful traces can become datasets for cheaper or sharper specialist models.

Graph as a Node

Collapse any working graph into a reusable sub-brain. Bigger graphs can call it, train around it, replace it, mutate it, or embed an improvement loop inside another system.

Metrics Beside the Architecture

Evals, judges, exactness checks, benchmarks, adversarial generators, regression suites, and score models are nodes in the graph. The measuring system stays next to the system being improved.

Verifier Judge Dataset Score Trace Lineage

CLI / SDK First

The visual editor is for convenience. The same brain graphs should run headlessly through a CLI, Python SDK, API, local workers, Docker, Slurm, Ray, or your own runner.

To Top

Next section

For researchers, builders, and hobbyists.

FrankenBrain can sit on top of serious research stacks, but it should also let curious builders frankenbrain weird executable systems together and see what actually happens.

Research stacks

Keep the tooling you already use.

Connect existing models, evals, runners, datasets, checkpoints, logs, and artifact stores without forcing the experiment into a new framework.

Search loops

Let a goal drive iteration.

For a compression benchmark, define exact decode, compressed size, runtime, and regression metrics, then let the graph propose, test, score, archive, and mutate variants.

Hobbyists and builders

Frankenbrain ideas together.

Start with primitive functions, scripts, model calls, tools, memory, and small evals. Build by hand, ask for mutation ideas, or let a loop keep trying while you inspect results.

To Top

Get started

Start with a narrow, measurable loop. Define the goal, wire the runnable graph, run it locally, inspect the trace, then let mutations compete against the baseline.

$ franken init compression-loop --template hutter-search
$ franken add metric compressed-bytes --minimize
$ franken add verifier exact-decode --required
$ franken run brain.json --runner local --trace runs/001.jsonl
$ franken compare runs/baseline runs/001
$ franken mutate brain.json --agent meta --budget 20 --keep-best
$ franken export-distill runs/best --format jsonl

To Top