Evaluation: prove it without pretending
OpenClawBrain needs a cleaner evidence story than "the simulations look good." The current deterministic workflow-proof slice on /proof/ makes the mechanism proof concrete, but product claims still need a stronger ladder.
Proven now
full_brain 0.9723 accuracy across 10 seeds, beating vector_rag and heuristic_stateful in 10/10 seeds and vector_rag_rerank in 9/10—at 5× lower context cost.
Needs head-to-head
Recorded-session comparisons, shadow traffic, and narrow online rollout before claiming better task outcomes on real OpenClaw workflows.
Not claimed
No blanket benchmark wins, no production latency or cost win, and no production superiority from deterministic artifacts alone.
Current workflow-proof slice
The current published slice is intentionally narrow: exact-target retrieval on 4 deterministic workflow queries. It is useful because it tests the product framing directly: can durable graph priors make OpenClawBrain useful before learning, and does background learning later improve the served route_fn?
| Mode | Exact-target success | What this isolates |
|---|---|---|
vector_topk |
0/4 | Semantic retrieval floor with no workflow structure or learned routing. |
pointer_chase |
1/4 | Simple deterministic traversal over explicit links without a learned route policy. |
graph_prior_only |
2/4 | Durable structure before learning; the "start useful fast" part of the story. |
learned |
4/4 | The same memory layer after background learning has improved routing. |
The learning curve matters as much as the endpoint. graph_prior_only stays flat at 2/4 across all 16 epochs, while learned first reaches 4/4 at epoch 14. The proof package now also points to per_query_matrix.csv and per_query_matrix.md, the 16-row scenario matrix showing which node IDs reached prompt context under each mode. Reproduce the figures from docs/reproduce-eval.md; compare the integration-level artifact bundle in docs/worked-example.md.
graph_prior_only stays at 2/4 for all 16 epochs, while learned reaches 4/4 at epoch 14. That is still deterministic harness evidence, not recorded-session or live-traffic proof.10-seed ground-zero proof (March 2026)
The ground-zero harness scales the evidence beyond the narrow 4-query slice. It runs 800 queries per seed across 10 independent seeds, with relation drift injected mid-run, and scores every baseline on accuracy, staleness, and traversal cost.
| Baseline | Accuracy | Context used | Traversal cost |
|---|---|---|---|
full_brain | 0.9723 | 800 | 800 |
vector_rag_rerank | 0.8901 | 4 000 | 6 400 |
vector_rag | 0.7966 | 800 | 800 |
heuristic_stateful | 0.7943 | 800 | 800 |
full_brain beats vector_rag and heuristic_stateful in 10/10 seeds, and vector_rag_rerank in 9/10. The nearest competitor (vector_rag_rerank) uses 5× the context and 8× the traversal cost to reach lower accuracy.
Worked example: E048 :: E001
The ground-truth relation between entities E048 and E001 changes from manages to depends_on at step 22. This table shows how each baseline tracks the drift:
| Step | Truth | full_brain | heuristic_stateful | vector_rag | vector_rag_rerank |
|---|---|---|---|---|---|
| 13 | manages | manages (ok) | manages (ok) | owns (WRONG) | manages (ok) |
| 22 | depends_on | depends_on (ok) | manages (STALE) | manages (STALE) | manages (STALE) |
| 38 | depends_on | depends_on (ok) | manages (STALE) | owns (WRONG) | depends_on (ok) |
full_brain tracks the relation change at steps 22 and 38. heuristic_stateful stays stale. The vector baselines stay wrong or lag behind. This is bounded benchmark evidence—mechanism proof, not a claim of recorded-session or production superiority.
Mechanism proof versus product proof
This slice proves something real but limited: durable structure already helps on fixed workflow retrieval, and learned routing can improve later without claiming that the hot path retrains itself inline. It does not prove that operators complete real OpenClaw tasks faster, with fewer corrections, or at lower cost in production.
The stronger evidence path
| Stage | What it proves | What artifacts are required |
|---|---|---|
| Deterministic workflow proof | The router and learning loop behave coherently on the fixed workflow slice. | Exact commands, commit SHA, summary.csv, learning_curve.csv, report.md, per_query_matrix.csv, and per_query_matrix.md. |
| Offline recorded-session eval | The same OpenClaw workload can be scored head-to-head across routing modes. | Query set, scoring rubric, eval JSON, commit SHA, and sampled turn bundles in the worked-example format. |
| Shadow traffic | The brain helps on real OpenClaw traffic without being the only live path yet. | Mirrored traffic slice, side-by-side metrics, retained outputs, and disagreement traces. |
| Narrow online rollout | Operational value under real usage constraints. | Success, correction rate, prompt size, latency, cost. |
Reporting contract
- Every nontrivial claim needs a command, an artifact path, and a commit SHA.
- If an artifact does not exist yet, the table cell stays
TBD. - Do not summarize the 4-query deterministic workflow slice as a product win on OpenClaw traffic.
Packaged view: /proof/, including the scenario-level matrix references. Rerun steps: docs/reproduce-eval.md. Smallest real-turn proof unit: docs/worked-example.md.