Post 3

Evaluation: prove it without pretending

March 2026 | v12.2.6+ series

OpenClawBrain needs a cleaner evidence story than "the simulations look good." The current deterministic workflow-proof slice on /proof/ makes the mechanism proof concrete, but product claims still need a stronger ladder.

Proven now

full_brain 0.9723 accuracy across 10 seeds, beating vector_rag and heuristic_stateful in 10/10 seeds and vector_rag_rerank in 9/10—at 5× lower context cost.

Needs head-to-head

Recorded-session comparisons, shadow traffic, and narrow online rollout before claiming better task outcomes on real OpenClaw workflows.

Not claimed

No blanket benchmark wins, no production latency or cost win, and no production superiority from deterministic artifacts alone.

Current workflow-proof slice

The current published slice is intentionally narrow: exact-target retrieval on 4 deterministic workflow queries. It is useful because it tests the product framing directly: can durable graph priors make OpenClawBrain useful before learning, and does background learning later improve the served route_fn?

ModeExact-target successWhat this isolates
vector_topk 0/4 Semantic retrieval floor with no workflow structure or learned routing.
pointer_chase 1/4 Simple deterministic traversal over explicit links without a learned route policy.
graph_prior_only 2/4 Durable structure before learning; the "start useful fast" part of the story.
learned 4/4 The same memory layer after background learning has improved routing.

The learning curve matters as much as the endpoint. graph_prior_only stays flat at 2/4 across all 16 epochs, while learned first reaches 4/4 at epoch 14. The proof package now also points to per_query_matrix.csv and per_query_matrix.md, the 16-row scenario matrix showing which node IDs reached prompt context under each mode. Reproduce the figures from docs/reproduce-eval.md; compare the integration-level artifact bundle in docs/worked-example.md.

Bar chart showing exact-target success across the deterministic workflow proof modes: vector_topk zero out of four, pointer_chase one out of four, graph_prior_only two out of four, and learned four out of four
Current mechanism proof, now tied to the published workflow slice on /proof/: the same 4 fixed queries, the same scoring rule, and a visible gap between cold-start retrieval, graph priors, and the learned policy.
Line chart showing graph_prior_only exact-target success staying at fifty percent while learned routing reaches one hundred percent at epoch fourteen
Useful fast with graph priors, better later via background learning: graph_prior_only stays at 2/4 for all 16 epochs, while learned reaches 4/4 at epoch 14. That is still deterministic harness evidence, not recorded-session or live-traffic proof.

10-seed ground-zero proof (March 2026)

The ground-zero harness scales the evidence beyond the narrow 4-query slice. It runs 800 queries per seed across 10 independent seeds, with relation drift injected mid-run, and scores every baseline on accuracy, staleness, and traversal cost.

BaselineAccuracyContext usedTraversal cost
full_brain0.9723800800
vector_rag_rerank0.89014 0006 400
vector_rag0.7966800800
heuristic_stateful0.7943800800

full_brain beats vector_rag and heuristic_stateful in 10/10 seeds, and vector_rag_rerank in 9/10. The nearest competitor (vector_rag_rerank) uses 5× the context and 8× the traversal cost to reach lower accuracy.

Worked example: E048 :: E001

The ground-truth relation between entities E048 and E001 changes from manages to depends_on at step 22. This table shows how each baseline tracks the drift:

StepTruthfull_brainheuristic_statefulvector_ragvector_rag_rerank
13managesmanages (ok)manages (ok)owns (WRONG)manages (ok)
22depends_ondepends_on (ok)manages (STALE)manages (STALE)manages (STALE)
38depends_ondepends_on (ok)manages (STALE)owns (WRONG)depends_on (ok)

full_brain tracks the relation change at steps 22 and 38. heuristic_stateful stays stale. The vector baselines stay wrong or lag behind. This is bounded benchmark evidence—mechanism proof, not a claim of recorded-session or production superiority.

Mechanism proof versus product proof

This slice proves something real but limited: durable structure already helps on fixed workflow retrieval, and learned routing can improve later without claiming that the hot path retrains itself inline. It does not prove that operators complete real OpenClaw tasks faster, with fewer corrections, or at lower cost in production.

The stronger evidence path

StageWhat it provesWhat artifacts are required
Deterministic workflow proof The router and learning loop behave coherently on the fixed workflow slice. Exact commands, commit SHA, summary.csv, learning_curve.csv, report.md, per_query_matrix.csv, and per_query_matrix.md.
Offline recorded-session eval The same OpenClaw workload can be scored head-to-head across routing modes. Query set, scoring rubric, eval JSON, commit SHA, and sampled turn bundles in the worked-example format.
Shadow traffic The brain helps on real OpenClaw traffic without being the only live path yet. Mirrored traffic slice, side-by-side metrics, retained outputs, and disagreement traces.
Narrow online rollout Operational value under real usage constraints. Success, correction rate, prompt size, latency, cost.

Reporting contract

  1. Every nontrivial claim needs a command, an artifact path, and a commit SHA.
  2. If an artifact does not exist yet, the table cell stays TBD.
  3. Do not summarize the 4-query deterministic workflow slice as a product win on OpenClaw traffic.

Packaged view: /proof/, including the scenario-level matrix references. Rerun steps: docs/reproduce-eval.md. Smallest real-turn proof unit: docs/worked-example.md.

Next: Post 4 on local-first rollout.