Skip to main content

RL Policy Design: Hierarchical Contextual Bandit

Problem Statement

Morphee's Knowledge Pipeline routes queries through different runtimes (DirectMemory, Skill, WASM, LLM) and strategies (single_shot, beam_search, code_execution, teaching). The initial approach was epsilon-greedy (random exploration), replaced by REINFORCE neural policy. Both suffer from:

  • Cold start: no useful decisions until enough data accumulates
  • Context blindness: ignores user role, device constraints, temporal patterns
  • Flat rewards: can't distinguish "correct but slow" from "fast but wrong"

Architecture

                    ┌─────────────────────────────────────┐
│ PolicySelector │
│ │
Query + Embedding │ 1. StateEncoder │
────────────────► │ [embed|user|time|conv|budget] │
│ │
│ 2. HierarchicalActionSpace │
│ Level 1: LinUCB → Category │
│ Level 2: LinUCB → Sub-arm │
│ │
│ 3. Confidence check │
│ spread < threshold? │
│ ├── YES → use LinUCB pick │
│ └── NO → Neural PPO forward │
│ │
Action │ 4. Record transition │
◄──────────────── │ 5. Execute strategy │
│ 6. Receive reward → update both │
└─────────────────────────────────────┘

State Encoding

ComponentDimsFeatures
Embedding384AllMiniLML6V2 query embedding
UserContext10role (4 one-hot) + age_bucket (4 one-hot) + interaction_count (log) + avg_reward
TemporalContext4hour_sin, hour_cos, weekday_sin, weekday_cos
ConversationContext4turn_number (log) + last_action (normalized) + streak (tanh) + recency (exp decay)
BudgetContext4max_latency_normalized + allow_cloud + device_tier + gpu
Total (LinUCB)406Full context state
Total (Neural)397Backward-compat: embed + action_stats + exploration_rate

LinUCB Algorithm

Each arm maintains:

  • A (d×d): accumulated context outer products + identity regularization
  • b (d): accumulated reward-weighted contexts
  • θ = A⁻¹b: learned coefficient vector

Selection: argmax_a [θ_a^T x + α √(x^T A_a⁻¹ x)]

Key features:

  • Sliding window (default 500): evicts oldest observations, prevents stale dominance
  • Cholesky decomposition: O(d³) exact inverse, recomputed every 50 updates
  • No cold start: identity regularization provides meaningful exploration from day 1

Reward Decomposition

ComponentSourceRange
CorrectnessScore.value0..1
Latency1 - actual/budget0..1
CostCompilationLevel (Wasm=1.0, LlmRaw=0.2)0..1
PrivacyRoute locality (DirectMemory=1.0, LlmFallback=0.3)0..1

Weight presets:

PresetCorrectnessLatencyCostPrivacy
Default0.500.200.150.15
Mobile0.450.300.100.15
Offline0.400.150.100.35
Child0.450.100.100.35

Hierarchical Action Space

Level 1 (LinUCB, 4 arms):
├── Memory (DirectMemory)
├── Skill (SkillExecute, SkillHint)
├── Wasm (WasmExecute)
└── Llm (LlmFallback, strategies...)

Level 2 (per-category LinUCB):
Memory: [direct_memory]
Skill: [skill_execute]
Wasm: [wasm_execute]
Llm: [llm_fallback, single_shot, beam_search, ...]

Configurable: hierarchical: false falls back to flat LinUCB over all arms.

PPO-Clip (Neural Secondary)

Triggered when LinUCB UCB spread > neural_confidence_threshold (default 0.3).

Loss = policy_loss + 0.5 * value_loss + entropy_bonus

  • Policy loss: -min(ratio × advantage, clip(ratio, 1-ε, 1+ε) × advantage)
  • Value loss: MSE(predicted_value, actual_return)
  • Entropy bonus: -coeff × Σ(p × log(p))

Where ratio = exp(new_log_prob - old_log_prob), ε = 0.2 default.

Knowledge Transfer

Group-level prior aggregation:

  1. Collect LinUCB selectors from group members
  2. Average A and b matrices (weighted by member contribution)
  3. Apply prior to new member's selector with configurable weight

Knowledge bundles: binary serialization of LinUcbSelector state for V2.1 marketplace.

Configuration

RlPolicyConfig {
// Neural
hidden_dim: 128,
learning_rate: 1e-3,
replay_capacity: 10_000,
min_batch_size: 32,
train_every_n: 16,
entropy_coeff: 0.01,
ppo_clip_epsilon: 0.2,

// LinUCB
linucb_alpha: 1.0, // Exploration coefficient
linucb_window_capacity: 500, // Sliding window per arm
linucb_recompute_every: 50, // Refresh inverse frequency

// Hierarchy
hierarchical: true,
neural_confidence_threshold: 0.3,

// Reward
reward_weights: None, // Use default weights
}

Performance

  • LinUCB select: ~0.1ms at d=406
  • LinUCB update: ~0.5ms at d=406
  • Neural forward: ~1ms (CPU)
  • Memory per arm: ~660KB (406² × 4 bytes for A matrix)
  • Total for 10 arms: ~6.6MB