Graph-native autoresearch workbench

Turn runnable AI systems into measurable, improvable brains.

Import a working context, expose the objective, run variants, compare evidence, and promote winners without losing traces, artifacts, or lineage.

Autoresearch loop

Importcontext
Runtrace
Evaluatemetrics
Promotewinner
Preservelineage

Composable brains, from one function to whole systems.

A brain can contain code, model calls, tools, data, evals, controllers, artifacts, services, workflows, and reusable sub-brains.

Example graph

Input
Router
Planner
Model A
Tools
Memory
Model B
Verifier
Eval
Dataset
Proposer
Archive

Works beside your stack

Bring the systems you already run.

FrankenBrain sits beside existing research infrastructure. It turns scattered code, data, evals, services, runners, and logs into an executable view with enough structure to inspect, compare, improve, and preserve evidence.

model runtimes training frameworks tensor compilers GPU clusters accelerators containers schedulers local runners remote runners inference optimization eval harnesses benchmark adapters model providers inference servers workflow launchers artifact stores trace logs lineage code repos datasets checkpoints experiment tracking metric parsers sandboxed runs training data

Executable context import

Bring the system you already run. Turn it into an editable brain.

Import repos, folders, scripts, evals, configs, services, datasets, and run commands into a reviewable capsule. Public material stays outcome-level; private runbooks and recipes stay gated.

Existing contextRepos, scripts, configs, datasets, eval harnesses, services, run commands, logs, and artifacts.
Executable brainInputs, objectives, metrics, traces, artifacts, and failure modes become inspectable instead of scattered across local scripts and notes.
Improvement loopCompare bounded variants against the same measurement surface and preserve the evidence behind promoted results.
01

Wrap without rewriting

Keep your code, launchers, trainer scripts, benchmark runners, and run commands. FrankenBrain adds a programmable control plane around them.

02

Expose the measurement surface

Make the objective, inputs, metrics, constraints, artifacts, and promotion gate explicit so every candidate is judged the same way.

03

Run, rank, promote

Start locally, then move proven work to configured runners as each target is connected and proofed.

$ franken import-workflow ./research-stack
$ franken run brain.json
$ franken evidence runs/latest
$ franken compare runs/baseline runs/candidate

The result is an executable, editable brain that can be opened, evaluated, compared, and reused inside larger brains.

Autoresearch loop

Run an interactive proof sketch: inputs on the left, an executable brain graph in the middle, eval output on the right, and trace/artifact updates as the graph fires.

  1. scan
  2. run
  3. tool
  4. eval
  5. promote
Load a preset, run the sketch, suggest a variant, then compare results.

Autoresearch control plane

Reusable strategy programs, typed AI control, and replayable evidence cards make the loop usable by people and AI workers through the same brain package.

Strategy program stack

Select and reuse autoresearch programs such as a discrete optimizer, evidence-grounded council, obstruction hunter, scale ladder, and auto-autoresearch loop. The selected stack belongs in the brain package evidence bucket.

Discrete optimizer Evidence council Obstruction hunter Auto-autoresearch

MCP-ready AI control

The workbench operations are exposed as typed AI-worker controls: inspect, patch, validate, run, compare, promote, and replay brains.

Replayable evidence cards

Public summaries should show what was measured, the fixed budget, the promotion boundary, and the replay shape while keeping private run ids, recipes, benchmark packs, and target details gated by default.

Target class: imported runnable system
Metric: explicit objective and direction
Budget: fixed variants / cost / runtime
Boundary: exploratory, private, or benchmark-comparable
Evidence: graph, run, artifacts, lineage, replay command

Replayable evidence

A credible brain carries its goal, graph, trace, metrics, diff, lineage, checkpoints, runner proof, and training data as inspectable artifacts.

brain.schema.json Schema
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "required": ["id", "entry", "nodes", "edges", "evals"],
  "properties": {
    "nodes": {
      "items": { "required": ["id", "type"] }
    },
    "edges": {
      "items": { "prefixItems": [
        { "type": "string" }, { "type": "string" }
      ]}
    }
  }
}
brain.json Runnable Graph
{
  "id": "candidate-system.v3-b",
  "goal": "improve task score under fixed constraints",
  "entry": "planner",
  "nodes": ["planner", "candidate", "verifier", "scorer"],
  "edges": [["planner", "candidate"], ["candidate", "verifier"]],
  "evals": ["verifier"],
  "archive": "runs/task-agent/run_0042"
}
run_0042.trace.jsonl Trace Output
{"step":1,"node":"planner","event":"plan"}
{"step":2,"node":"candidate","file":"policy.py"}
{"step":3,"node":"verifier","passed":true}
{"step":4,"node":"scorer","score":0.742}
{"step":5,"node":"archive","variant":"v3-b"}
eval/check Eval Score
Score lift +12.4 pp
Cost -18%
Runtime 2.8s
Regression 0 fail
run_0042.manifest.json Run Manifest
{
  "run_id": "run_0042",
  "command": "franken run brain.json",
  "started_at": "2026-05-02T14:31:08Z",
  "seed": 1842,
  "fixture_set": "holdout-suite-v1",
  "git_sha": "9c3a1f7",
  "artifacts": ["trace", "eval", "diff", "lineage"]
}
variant.diff Mutation Diff
--- candidate-system.v3
+++ candidate-system.v3-b
@@ variant
- planner.temperature: 0.70
+ planner.temperature: 0.35
- candidate.strategy: "single-pass"
+ candidate.strategy: "verify-then-revise"
+ eval_cases: 64
+ score.penalty.runtime_ms: 3000
lineage.json Lineage View
  1. v1baseline0.618 score
  2. v2verifier loop0.691 score
  3. v3-brevision loop0.742 score

Core product surfaces.

Use tiny primitives when you need control, full sub-brains when you want reuse, and controlled improvement nodes when the graph should revise its own parts.

Recursive Brain Graphs

A node can be a function, service, model call, eval, data source, training job, controller, reusable sub-brain, or another workbench-controlled system.

Script Function Model Eval Train Sub-brain

Executable Context Import

Start from a repo, folder, script, template, or JSON file. Add models, tools, data, evals, controllers, APIs, services, and sub-brains without burying the experiment in glue code.

$ franken init research-brain
$ franken add node planner --runtime local-or-remote
$ franken run brain.json

Objective-Driven Runs

Define what the graph is trying to improve: accuracy, runtime, cost, robustness, task difficulty, or any custom metric you can measure.

Variant Generation

Edit the graph yourself or let controlled improvement nodes propose, run, score, and select variants under explicit constraints.

Training and Distillation

Winning traces can become training and review artifacts. Promote reusable systems or model artifacts only after measurement gates pass.

Anything as a Node

A node can be a function, job, remote service, simulator, workflow, reusable sub-brain, or another running FrankenBrain workbench. Collapse, call, compare, or zoom into it.

Run Boards Beside the Architecture

Every run can keep metric cards, candidate rankings, raw details, artifacts, diffs, and promotion actions next to the system being improved.

Verifier Judge Dataset Score Run Board Lineage

CLI / SDK First

The visual editor is for convenience. The same brains should run headlessly through a CLI, SDK, API, local workers, and configured runners with explicit proof status.

For AI systems researchers and lab builders.

FrankenBrain can sit on top of serious research stacks while making executable systems easier to inspect, measure, compare, and reuse.

Research stacks

Bring existing work into a graph-native loop.

Bring existing code, evals, runners, datasets, checkpoints, logs, and artifact stores into one reviewable experiment view.

Autoresearch loops

Let evals drive iteration.

Define the measurement surface and promotion gate; then compare bounded variants and keep the lineage behind the result.

Model creation

Train from what worked.

Turn successful traces into reusable training or review artifacts, then promote only after held-out measurement passes.

Private technical demo

Request access to the full workbench.

FrankenBrain is available by technical demo for labs, research teams, and serious AI systems builders. Share the stack, goal, and conversation type.

Executable context import Run boards and evals Autoresearch loops Nested brains Lineage and evidence