Baselines and the comparison contract
OpenClawBrain should compare itself against retrieval behaviors operators would plausibly use instead of hiding behind vague "AI memory" framing. The current published slice on /proof/ is narrow but concrete: 4 deterministic workflow queries scored with exact-target success.
That scope matters. These numbers are strong enough to compare mechanism-level baselines honestly; they are not broad enough to claim real OpenClaw workload wins yet. For that, the next rung is still recorded-session eval, then shadow traffic, then a narrow online rollout.
Current published baseline slice
| Baseline | Why it matters | Current exact-target success |
|---|---|---|
vector_topk |
Cold-start semantic retrieval floor with no workflow structure and no learned routing. | 0/4 |
pointer_chase |
Simple deterministic traversal over explicit links. It checks whether a plain walk over stored pointers is already enough. | 1/4 |
graph_prior_only |
Durable structure without learned routing. This is the "start useful fast" baseline. | 2/4 |
learned |
The same memory layer after background learning has improved the runtime route policy. | 4/4 |
The learning curve is the second half of the comparison. graph_prior_only stays flat at 2/4 across all 16 epochs, while learned first reaches 4/4 at epoch 14. That is the current concrete evidence for the product framing: useful from graph priors first, better later via background learning.
10-seed ground-zero comparison (March 2026)
The ground-zero harness scales the baseline comparison to 800 queries × 10 seeds with mid-run relation drift. Cost and accuracy now sit side by side:
| Baseline | Accuracy | Context used | Traversal cost | Win rate vs full_brain |
|---|---|---|---|---|
full_brain | 0.9723 | 800 | 800 | — |
vector_rag_rerank | 0.8901 | 4 000 | 6 400 | 1/10 |
vector_rag | 0.7966 | 800 | 800 | 0/10 |
heuristic_stateful | 0.7943 | 800 | 800 | 0/10 |
vector_rag_rerank is the nearest competitor but uses 5× the context and 8× the traversal cost for lower accuracy. full_brain wins in 10/10 seeds against vector_rag and heuristic_stateful, and 9/10 against vector_rag_rerank.
Why these baselines belong
| Question | Why the answer matters |
|---|---|
If vector_topk were already enough, why add a memory layer? |
Because a durable memory product should beat a plain semantic floor on workflow retrieval, not just rename it. |
If pointer_chase were enough, why learn anything? |
Because deterministic link-following is a credible alternative for explicit workflows. It needs to be ruled in or out directly. |
If graph_prior_only matched learned, why keep the learning path? |
Because the product claim is not just "there is a graph." It is "the graph helps immediately, and background learning improves later." |
What the current comparison says honestly
- The workflow-proof slice cleanly separates cold-start semantic retrieval, deterministic traversal, durable graph priors, and the learned policy.
- The 10-seed ground-zero harness extends this to 800 queries per seed with relation drift:
full_brainreaches 0.9723 accuracy at 800 context / 800 traversal cost, whilevector_rag_rerankreaches 0.8901 at 4 000 context / 6 400 traversal cost. - The site still does not publish broad operator wins from real OpenClaw traffic. This is bounded benchmark evidence, not recorded-session or production proof.
What stronger comparison evidence looks like
| Stage | Use | Still too weak for broad product claims? |
|---|---|---|
| Current deterministic workflow proof | Mechanism proof on one fixed, reproducible query slice | Yes |
| Offline recorded-session eval | Head-to-head on the same workload | Stronger, but still not live |
| Shadow traffic | Real OpenClaw traffic with side-by-side scoring | Often enough for narrow claims |
| Narrow online rollout | Operational decision support | Best basis for public product claims |
Comparison discipline
- Use the same query set, scoring rubric, and prompt budget rules across all modes.
- Keep the command, artifact path, and commit SHA for every published number.
- Leave cells as
TBDwhen the artifact does not exist yet. - Do not describe the current 4-query deterministic slice as a real operator win.
Packaged proof boundary: /proof/. Reproduction steps: docs/reproduce-eval.md. Turn-level artifact contract: docs/worked-example.md.