Shadow routing + Ultimate Policy Gradient
OpenClawBrain is easiest to understand if you separate the live query path from the learning path. The live path serves OpenClaw. The learning path produces better labels and better routing weights for later turns.
Shadow routing
"Shadow routing" means the expensive teacher stays off the hot path. The brain answers live queries with the learned runtime route_fn. Replay, scanner/harvester passes, and the async teacher work later on historical and ongoing data.
hot path: OpenClaw -> activate + compile -> learned route_fn -> bounded compiled context shadow path: sessions -> scanner/harvester labels + async teacher labels -> Ultimate Policy Gradient -> updated route_fn
Where the labels come from
The update rule is fed by multiple label sources because OpenClawBrain is meant to live inside a real OpenClaw workflow.
- Human feedback: corrections and teachings tied to the same
turn_idas the fired route. - Self-learning outcomes: positive or negative outcomes on turns the brain actually helped serve.
- Scanner / harvester labels: structured signals mined from historical and ongoing session data.
- Async teacher labels: background teacher judgments on ambiguous or difficult route decisions.
Ultimate Policy Gradient
Ultimate Policy Gradient is the unifying update rule. The system does not learn one routing policy from human labels and a second routing policy from teacher labels. It learns one runtime policy while preserving an authority order among label sources.
Why this matters for OpenClaw
- You can use the brain immediately after `init` and `serve`.
- Historical replay and teacher labeling do not block first use.
- The thing that improves over time is the runtime
route_fnthat OpenClaw actually calls.