RL Policy Design: Hierarchical Contextual Bandit
Problem Statement
Morphee's Knowledge Pipeline routes queries through different runtimes (DirectMemory, Skill, WASM, LLM) and strategies (single_shot, beam_search, code_execution, teaching). The initial approach was epsilon-greedy (random exploration), replaced by REINFORCE neural policy. Both suffer from:
- Cold start: no useful decisions until enough data accumulates
- Context blindness: ignores user role, device constraints, temporal patterns
- Flat rewards: can't distinguish "correct but slow" from "fast but wrong"
Architecture
┌─────────────────────────────────────┐
│ PolicySelector │
│ │
Query + Embedding │ 1. StateEncoder │
────────────────► │ [embed|user|time|conv|budget] │
│ │
│ 2. HierarchicalActionSpace │
│ Level 1: LinUCB → Category │
│ Level 2: LinUCB → Sub-arm │
│ │
│ 3. Confidence check │
│ spread < threshold? │
│ ├── YES → use LinUCB pick │
│ └── NO → Neural PPO forward │
│ │
Action │ 4. Record transition │
◄──────────────── │ 5. Execute strategy │
│ 6. Receive reward → update both │
└─────────────────────────────────────┘
State Encoding
| Component | Dims | Features |
|---|---|---|
| Embedding | 384 | AllMiniLML6V2 query embedding |
| UserContext | 10 | role (4 one-hot) + age_bucket (4 one-hot) + interaction_count (log) + avg_reward |
| TemporalContext | 4 | hour_sin, hour_cos, weekday_sin, weekday_cos |
| ConversationContext | 4 | turn_number (log) + last_action (normalized) + streak (tanh) + recency (exp decay) |
| BudgetContext | 4 | max_latency_normalized + allow_cloud + device_tier + gpu |
| Total (LinUCB) | 406 | Full context state |
| Total (Neural) | 397 | Backward-compat: embed + action_stats + exploration_rate |
LinUCB Algorithm
Each arm maintains:
- A (d×d): accumulated context outer products + identity regularization
- b (d): accumulated reward-weighted contexts
- θ = A⁻¹b: learned coefficient vector
Selection: argmax_a [θ_a^T x + α √(x^T A_a⁻¹ x)]
Key features:
- Sliding window (default 500): evicts oldest observations, prevents stale dominance
- Cholesky decomposition: O(d³) exact inverse, recomputed every 50 updates
- No cold start: identity regularization provides meaningful exploration from day 1
Reward Decomposition
| Component | Source | Range |
|---|---|---|
| Correctness | Score.value | 0..1 |
| Latency | 1 - actual/budget | 0..1 |
| Cost | CompilationLevel (Wasm=1.0, LlmRaw=0.2) | 0..1 |
| Privacy | Route locality (DirectMemory=1.0, LlmFallback=0.3) | 0..1 |
Weight presets:
| Preset | Correctness | Latency | Cost | Privacy |
|---|---|---|---|---|
| Default | 0.50 | 0.20 | 0.15 | 0.15 |
| Mobile | 0.45 | 0.30 | 0.10 | 0.15 |
| Offline | 0.40 | 0.15 | 0.10 | 0.35 |
| Child | 0.45 | 0.10 | 0.10 | 0.35 |
Hierarchical Action Space
Level 1 (LinUCB, 4 arms):
├── Memory (DirectMemory)
├── Skill (SkillExecute, SkillHint)
├── Wasm (WasmExecute)
└── Llm (LlmFallback, strategies...)
Level 2 (per-category LinUCB):
Memory: [direct_memory]
Skill: [skill_execute]
Wasm: [wasm_execute]
Llm: [llm_fallback, single_shot, beam_search, ...]
Configurable: hierarchical: false falls back to flat LinUCB over all arms.
PPO-Clip (Neural Secondary)
Triggered when LinUCB UCB spread > neural_confidence_threshold (default 0.3).
Loss = policy_loss + 0.5 * value_loss + entropy_bonus
- Policy loss: -min(ratio × advantage, clip(ratio, 1-ε, 1+ε) × advantage)
- Value loss: MSE(predicted_value, actual_return)
- Entropy bonus: -coeff × Σ(p × log(p))
Where ratio = exp(new_log_prob - old_log_prob), ε = 0.2 default.
Knowledge Transfer
Group-level prior aggregation:
- Collect LinUCB selectors from group members
- Average A and b matrices (weighted by member contribution)
- Apply prior to new member's selector with configurable weight
Knowledge bundles: binary serialization of LinUcbSelector state for V2.1 marketplace.
Configuration
RlPolicyConfig {
// Neural
hidden_dim: 128,
learning_rate: 1e-3,
replay_capacity: 10_000,
min_batch_size: 32,
train_every_n: 16,
entropy_coeff: 0.01,
ppo_clip_epsilon: 0.2,
// LinUCB
linucb_alpha: 1.0, // Exploration coefficient
linucb_window_capacity: 500, // Sliding window per arm
linucb_recompute_every: 50, // Refresh inverse frequency
// Hierarchy
hierarchical: true,
neural_confidence_threshold: 0.3,
// Reward
reward_weights: None, // Use default weights
}
Performance
- LinUCB select: ~0.1ms at d=406
- LinUCB update: ~0.5ms at d=406
- Neural forward: ~1ms (CPU)
- Memory per arm: ~660KB (406² × 4 bytes for A matrix)
- Total for 10 arms: ~6.6MB