RL Policy Design: Hierarchical Contextual Bandit

Problem Statement

Morphee's Knowledge Pipeline routes queries through different runtimes (DirectMemory, Skill, WASM, LLM) and strategies (single_shot, beam_search, code_execution, teaching). The initial approach was epsilon-greedy (random exploration), replaced by REINFORCE neural policy. Both suffer from:

Cold start: no useful decisions until enough data accumulates
Context blindness: ignores user role, device constraints, temporal patterns
Flat rewards: can't distinguish "correct but slow" from "fast but wrong"

Architecture

                    ┌─────────────────────────────────────┐
                    │         PolicySelector              │
                    │                                     │
  Query + Embedding │  1. StateEncoder                    │
  ────────────────► │     [embed|user|time|conv|budget]   │
                    │                                     │
                    │  2. HierarchicalActionSpace          │
                    │     Level 1: LinUCB → Category       │
                    │     Level 2: LinUCB → Sub-arm        │
                    │                                     │
                    │  3. Confidence check                 │
                    │     spread < threshold?              │
                    │     ├── YES → use LinUCB pick        │
                    │     └── NO  → Neural PPO forward     │
                    │                                     │
  Action            │  4. Record transition                │
  ◄──────────────── │  5. Execute strategy                 │
                    │  6. Receive reward → update both     │
                    └─────────────────────────────────────┘

State Encoding

Component	Dims	Features
Embedding	384	AllMiniLML6V2 query embedding
UserContext	10	role (4 one-hot) + age_bucket (4 one-hot) + interaction_count (log) + avg_reward
TemporalContext	4	hour_sin, hour_cos, weekday_sin, weekday_cos
ConversationContext	4	turn_number (log) + last_action (normalized) + streak (tanh) + recency (exp decay)
BudgetContext	4	max_latency_normalized + allow_cloud + device_tier + gpu
Total (LinUCB)	406	Full context state
Total (Neural)	397	Backward-compat: embed + action_stats + exploration_rate

LinUCB Algorithm

Each arm maintains:

A (d×d): accumulated context outer products + identity regularization
b (d): accumulated reward-weighted contexts
θ = A⁻¹b: learned coefficient vector

Selection: argmax_a [θ_a^T x + α √(x^T A_a⁻¹ x)]

Key features:

Sliding window (default 500): evicts oldest observations, prevents stale dominance
Cholesky decomposition: O(d³) exact inverse, recomputed every 50 updates
No cold start: identity regularization provides meaningful exploration from day 1

Reward Decomposition

Component	Source	Range
Correctness	Score.value	0..1
Latency	1 - actual/budget	0..1
Cost	CompilationLevel (Wasm=1.0, LlmRaw=0.2)	0..1
Privacy	Route locality (DirectMemory=1.0, LlmFallback=0.3)	0..1

Weight presets:

Preset	Correctness	Latency	Cost	Privacy
Default	0.50	0.20	0.15	0.15
Mobile	0.45	0.30	0.10	0.15
Offline	0.40	0.15	0.10	0.35
Child	0.45	0.10	0.10	0.35

Hierarchical Action Space

Level 1 (LinUCB, 4 arms):
  ├── Memory (DirectMemory)
  ├── Skill (SkillExecute, SkillHint)
  ├── Wasm (WasmExecute)
  └── Llm (LlmFallback, strategies...)

Level 2 (per-category LinUCB):
  Memory: [direct_memory]
  Skill:  [skill_execute]
  Wasm:   [wasm_execute]
  Llm:    [llm_fallback, single_shot, beam_search, ...]

Configurable: hierarchical: false falls back to flat LinUCB over all arms.

PPO-Clip (Neural Secondary)

Triggered when LinUCB UCB spread > neural_confidence_threshold (default 0.3).

Loss = policy_loss + 0.5 * value_loss + entropy_bonus

Policy loss: -min(ratio × advantage, clip(ratio, 1-ε, 1+ε) × advantage)
Value loss: MSE(predicted_value, actual_return)
Entropy bonus: -coeff × Σ(p × log(p))

Where ratio = exp(new_log_prob - old_log_prob), ε = 0.2 default.

Knowledge Transfer

Group-level prior aggregation:

Collect LinUCB selectors from group members
Average A and b matrices (weighted by member contribution)
Apply prior to new member's selector with configurable weight

Knowledge bundles: binary serialization of LinUcbSelector state for V2.1 marketplace.

Configuration

RlPolicyConfig {
    // Neural
    hidden_dim: 128,
    learning_rate: 1e-3,
    replay_capacity: 10_000,
    min_batch_size: 32,
    train_every_n: 16,
    entropy_coeff: 0.01,
    ppo_clip_epsilon: 0.2,

    // LinUCB
    linucb_alpha: 1.0,              // Exploration coefficient
    linucb_window_capacity: 500,    // Sliding window per arm
    linucb_recompute_every: 50,     // Refresh inverse frequency

    // Hierarchy
    hierarchical: true,
    neural_confidence_threshold: 0.3,

    // Reward
    reward_weights: None,  // Use default weights
}

Performance

LinUCB select: ~0.1ms at d=406
LinUCB update: ~0.5ms at d=406
Neural forward: ~1ms (CPU)
Memory per arm: ~660KB (406² × 4 bytes for A matrix)
Total for 10 arms: ~6.6MB

Problem Statement​

Architecture​

State Encoding​

LinUCB Algorithm​

Reward Decomposition​

Hierarchical Action Space​

PPO-Clip (Neural Secondary)​

Knowledge Transfer​

Configuration​

Performance​