Baselines and the comparison contract

March 2026 | v12.2.6+ series

OpenClawBrain should compare itself against retrieval behaviors operators would plausibly use instead of hiding behind vague "AI memory" framing. The current published slice on /proof/ is narrow but concrete: 4 deterministic workflow queries scored with exact-target success.

That scope matters. These numbers are strong enough to compare mechanism-level baselines honestly; they are not broad enough to claim real OpenClaw workload wins yet. For that, the next rung is still recorded-session eval, then shadow traffic, then a narrow online rollout.

Current published baseline slice

Baseline	Why it matters	Current exact-target success
`vector_topk`	Cold-start semantic retrieval floor with no workflow structure and no learned routing.	0/4
`pointer_chase`	Simple deterministic traversal over explicit links. It checks whether a plain walk over stored pointers is already enough.	1/4
`graph_prior_only`	Durable structure without learned routing. This is the "start useful fast" baseline.	2/4
`learned`	The same memory layer after background learning has improved the runtime route policy.	4/4

The learning curve is the second half of the comparison. graph_prior_only stays flat at 2/4 across all 16 epochs, while learned first reaches 4/4 at epoch 14. That is the current concrete evidence for the product framing: useful from graph priors first, better later via background learning.

10-seed ground-zero comparison (March 2026)

The ground-zero harness scales the baseline comparison to 800 queries × 10 seeds with mid-run relation drift. Cost and accuracy now sit side by side:

Baseline	Accuracy	Context used	Traversal cost	Win rate vs full_brain
`full_brain`	0.9723	800	800	—
`vector_rag_rerank`	0.8901	4 000	6 400	1/10
`vector_rag`	0.7966	800	800	0/10
`heuristic_stateful`	0.7943	800	800	0/10

vector_rag_rerank is the nearest competitor but uses 5× the context and 8× the traversal cost for lower accuracy. full_brain wins in 10/10 seeds against vector_rag and heuristic_stateful, and 9/10 against vector_rag_rerank.

Why these baselines belong

Question	Why the answer matters
If `vector_topk` were already enough, why add a memory layer?	Because a durable memory product should beat a plain semantic floor on workflow retrieval, not just rename it.
If `pointer_chase` were enough, why learn anything?	Because deterministic link-following is a credible alternative for explicit workflows. It needs to be ruled in or out directly.
If `graph_prior_only` matched `learned`, why keep the learning path?	Because the product claim is not just "there is a graph." It is "the graph helps immediately, and background learning improves later."

What the current comparison says honestly

The workflow-proof slice cleanly separates cold-start semantic retrieval, deterministic traversal, durable graph priors, and the learned policy.
The 10-seed ground-zero harness extends this to 800 queries per seed with relation drift: full_brain reaches 0.9723 accuracy at 800 context / 800 traversal cost, while vector_rag_rerank reaches 0.8901 at 4 000 context / 6 400 traversal cost.
The site still does not publish broad operator wins from real OpenClaw traffic. This is bounded benchmark evidence, not recorded-session or production proof.

What stronger comparison evidence looks like

Stage	Use	Still too weak for broad product claims?
Current deterministic workflow proof	Mechanism proof on one fixed, reproducible query slice	Yes
Offline recorded-session eval	Head-to-head on the same workload	Stronger, but still not live
Shadow traffic	Real OpenClaw traffic with side-by-side scoring	Often enough for narrow claims
Narrow online rollout	Operational decision support	Best basis for public product claims

Comparison discipline

Use the same query set, scoring rubric, and prompt budget rules across all modes.
Keep the command, artifact path, and commit SHA for every published number.
Leave cells as TBD when the artifact does not exist yet.
Do not describe the current 4-query deterministic slice as a real operator win.

Packaged proof boundary: /proof/. Reproduction steps: docs/reproduce-eval.md. Turn-level artifact contract: docs/worked-example.md.

Next: Post 6 on brain-first OpenClaw integration.