Post 1

Shadow routing + Ultimate Policy Gradient

March 2026 | v12.2.6+ series

OpenClawBrain is easiest to understand if you separate the live query path from the learning path. The live path serves OpenClaw. The learning path produces better labels and better routing weights for later turns.

Shadow routing

"Shadow routing" means the expensive teacher stays off the hot path. The brain answers live queries with the learned runtime route_fn. Replay, scanner/harvester passes, and the async teacher work later on historical and ongoing data.

hot path: OpenClaw -> activate + compile -> learned route_fn -> bounded compiled context
shadow path: sessions -> scanner/harvester labels + async teacher labels -> Ultimate Policy Gradient -> updated route_fn
Diagram showing the hot-path query loop and the asynchronous shadow-learning loop
The live router stays local and bounded; the shadow loop exists to improve the next router, not to answer the current turn.

Where the labels come from

The update rule is fed by multiple label sources because OpenClawBrain is meant to live inside a real OpenClaw workflow.

  1. Human feedback: corrections and teachings tied to the same turn_id as the fired route.
  2. Self-learning outcomes: positive or negative outcomes on turns the brain actually helped serve.
  3. Scanner / harvester labels: structured signals mined from historical and ongoing session data.
  4. Async teacher labels: background teacher judgments on ambiguous or difficult route decisions.

Ultimate Policy Gradient

Ultimate Policy Gradient is the unifying update rule. The system does not learn one routing policy from human labels and a second routing policy from teacher labels. It learns one runtime policy while preserving an authority order among label sources.

Human corrections should be able to override weaker but higher-volume labels. That is a product requirement, not just a training detail.

Why this matters for OpenClaw

Next: Post 2 on the learned runtime route function.