Evaluation: prove it without pretending

March 2026 | v12.2.6+ series

OpenClawBrain needs a cleaner evidence story than "the simulations look good." The current deterministic workflow-proof slice on /proof/ makes the mechanism proof concrete, but product claims still need a stronger ladder.

Proven now

full_brain 0.9723 accuracy across 10 seeds, beating vector_rag and heuristic_stateful in 10/10 seeds and vector_rag_rerank in 9/10—at 5× lower context cost.

Needs head-to-head

Recorded-session comparisons, shadow traffic, and narrow online rollout before claiming better task outcomes on real OpenClaw workflows.

Not claimed

No blanket benchmark wins, no production latency or cost win, and no production superiority from deterministic artifacts alone.

Current workflow-proof slice

The current published slice is intentionally narrow: exact-target retrieval on 4 deterministic workflow queries. It is useful because it tests the product framing directly: can durable graph priors make OpenClawBrain useful before learning, and does background learning later improve the served route_fn?

Mode	Exact-target success	What this isolates
`vector_topk`	0/4	Semantic retrieval floor with no workflow structure or learned routing.
`pointer_chase`	1/4	Simple deterministic traversal over explicit links without a learned route policy.
`graph_prior_only`	2/4	Durable structure before learning; the "start useful fast" part of the story.
`learned`	4/4	The same memory layer after background learning has improved routing.

The learning curve matters as much as the endpoint. graph_prior_only stays flat at 2/4 across all 16 epochs, while learned first reaches 4/4 at epoch 14. The proof package now also points to per_query_matrix.csv and per_query_matrix.md, the 16-row scenario matrix showing which node IDs reached prompt context under each mode. Reproduce the figures from docs/reproduce-eval.md; compare the integration-level artifact bundle in docs/worked-example.md.

Bar chart showing exact-target success across the deterministic workflow proof modes: vector_topk zero out of four, pointer_chase one out of four, graph_prior_only two out of four, and learned four out of four — Current mechanism proof, now tied to the published workflow slice on /proof/: the same 4 fixed queries, the same scoring rule, and a visible gap between cold-start retrieval, graph priors, and the learned policy.

Line chart showing graph_prior_only exact-target success staying at fifty percent while learned routing reaches one hundred percent at epoch fourteen — Useful fast with graph priors, better later via background learning: `graph_prior_only` stays at 2/4 for all 16 epochs, while `learned` reaches 4/4 at epoch 14. That is still deterministic harness evidence, not recorded-session or live-traffic proof.

10-seed ground-zero proof (March 2026)

The ground-zero harness scales the evidence beyond the narrow 4-query slice. It runs 800 queries per seed across 10 independent seeds, with relation drift injected mid-run, and scores every baseline on accuracy, staleness, and traversal cost.

Baseline	Accuracy	Context used	Traversal cost
`full_brain`	0.9723	800	800
`vector_rag_rerank`	0.8901	4 000	6 400
`vector_rag`	0.7966	800	800
`heuristic_stateful`	0.7943	800	800

full_brain beats vector_rag and heuristic_stateful in 10/10 seeds, and vector_rag_rerank in 9/10. The nearest competitor (vector_rag_rerank) uses 5× the context and 8× the traversal cost to reach lower accuracy.

Worked example: E048 :: E001

The ground-truth relation between entities E048 and E001 changes from manages to depends_on at step 22. This table shows how each baseline tracks the drift:

Step	Truth	full_brain	heuristic_stateful	vector_rag	vector_rag_rerank
13	manages	manages (ok)	manages (ok)	owns (WRONG)	manages (ok)
22	depends_on	depends_on (ok)	manages (STALE)	manages (STALE)	manages (STALE)
38	depends_on	depends_on (ok)	manages (STALE)	owns (WRONG)	depends_on (ok)

full_brain tracks the relation change at steps 22 and 38. heuristic_stateful stays stale. The vector baselines stay wrong or lag behind. This is bounded benchmark evidence—mechanism proof, not a claim of recorded-session or production superiority.

Mechanism proof versus product proof

This slice proves something real but limited: durable structure already helps on fixed workflow retrieval, and learned routing can improve later without claiming that the hot path retrains itself inline. It does not prove that operators complete real OpenClaw tasks faster, with fewer corrections, or at lower cost in production.

The stronger evidence path

Stage	What it proves	What artifacts are required
Deterministic workflow proof	The router and learning loop behave coherently on the fixed workflow slice.	Exact commands, commit SHA, `summary.csv`, `learning_curve.csv`, `report.md`, `per_query_matrix.csv`, and `per_query_matrix.md`.
Offline recorded-session eval	The same OpenClaw workload can be scored head-to-head across routing modes.	Query set, scoring rubric, eval JSON, commit SHA, and sampled turn bundles in the worked-example format.
Shadow traffic	The brain helps on real OpenClaw traffic without being the only live path yet.	Mirrored traffic slice, side-by-side metrics, retained outputs, and disagreement traces.
Narrow online rollout	Operational value under real usage constraints.	Success, correction rate, prompt size, latency, cost.

Reporting contract

Every nontrivial claim needs a command, an artifact path, and a commit SHA.
If an artifact does not exist yet, the table cell stays TBD.
Do not summarize the 4-query deterministic workflow slice as a product win on OpenClaw traffic.

Packaged view: /proof/, including the scenario-level matrix references. Rerun steps: docs/reproduce-eval.md. Smallest real-turn proof unit: docs/worked-example.md.

Next: Post 4 on local-first rollout.