Post 5

Baselines and the comparison contract

March 2026 | v12.2.6+ series

OpenClawBrain should compare itself against retrieval behaviors operators would plausibly use instead of hiding behind vague "AI memory" framing. The current published slice on /proof/ is narrow but concrete: 4 deterministic workflow queries scored with exact-target success.

That scope matters. These numbers are strong enough to compare mechanism-level baselines honestly; they are not broad enough to claim real OpenClaw workload wins yet. For that, the next rung is still recorded-session eval, then shadow traffic, then a narrow online rollout.

Current published baseline slice

BaselineWhy it mattersCurrent exact-target success
vector_topk Cold-start semantic retrieval floor with no workflow structure and no learned routing. 0/4
pointer_chase Simple deterministic traversal over explicit links. It checks whether a plain walk over stored pointers is already enough. 1/4
graph_prior_only Durable structure without learned routing. This is the "start useful fast" baseline. 2/4
learned The same memory layer after background learning has improved the runtime route policy. 4/4

The learning curve is the second half of the comparison. graph_prior_only stays flat at 2/4 across all 16 epochs, while learned first reaches 4/4 at epoch 14. That is the current concrete evidence for the product framing: useful from graph priors first, better later via background learning.

10-seed ground-zero comparison (March 2026)

The ground-zero harness scales the baseline comparison to 800 queries × 10 seeds with mid-run relation drift. Cost and accuracy now sit side by side:

BaselineAccuracyContext usedTraversal costWin rate vs full_brain
full_brain0.9723800800
vector_rag_rerank0.89014 0006 4001/10
vector_rag0.79668008000/10
heuristic_stateful0.79438008000/10

vector_rag_rerank is the nearest competitor but uses 5× the context and 8× the traversal cost for lower accuracy. full_brain wins in 10/10 seeds against vector_rag and heuristic_stateful, and 9/10 against vector_rag_rerank.

Why these baselines belong

QuestionWhy the answer matters
If vector_topk were already enough, why add a memory layer? Because a durable memory product should beat a plain semantic floor on workflow retrieval, not just rename it.
If pointer_chase were enough, why learn anything? Because deterministic link-following is a credible alternative for explicit workflows. It needs to be ruled in or out directly.
If graph_prior_only matched learned, why keep the learning path? Because the product claim is not just "there is a graph." It is "the graph helps immediately, and background learning improves later."

What the current comparison says honestly

What stronger comparison evidence looks like

StageUseStill too weak for broad product claims?
Current deterministic workflow proof Mechanism proof on one fixed, reproducible query slice Yes
Offline recorded-session eval Head-to-head on the same workload Stronger, but still not live
Shadow traffic Real OpenClaw traffic with side-by-side scoring Often enough for narrow claims
Narrow online rollout Operational decision support Best basis for public product claims

Comparison discipline

Packaged proof boundary: /proof/. Reproduction steps: docs/reproduce-eval.md. Turn-level artifact contract: docs/worked-example.md.

Next: Post 6 on brain-first OpenClaw integration.