V1.5 Local LLM — candle/GGUF Inference
Overview
Morphee V1.5 adds local LLM inference via the candle framework with GGUF quantized models in the Tauri Rust backend. This completes the offline AI story: V1.0 shipped offline data/sync, V1.5 adds offline text generation.
The same LLMIntegration contract (BaseInterface) means the orchestrator doesn't care whether Claude or a local model generates the response. The frontend implements smart routing — simple tasks go to the local model, complex reasoning with tool calls goes to Cloud Claude.
Target Models
| Model | Quantization | Size | Target | Context |
|---|---|---|---|---|
| Phi-4-mini-instruct | Q4_K_M | ~2.3 GB | Desktop + Mobile | 4K tokens |
| Llama 3.2 3B Instruct | Q4_K_M | ~2.0 GB | Desktop only | 4K tokens |
Architecture
User types message
│
▼
┌─────────────────┐
│ llm-routing.ts │ Frontend smart router
│ (heuristic) │
└────┬───────┬────┘
│ │
▼ ▼
┌────────┐ ┌──────────────┐
│ Cloud │ │ Local LLM │
│ Claude │ │ (Tauri IPC) │
│ (SSE) │ │ (Events) │
└────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ llm.rs │ candle GGUF inference
│ (Rust) │ + KV cache + sampling
└──────────────┘
Streaming Protocol
Local LLM uses Tauri events (not SSE) for token streaming:
| Event | Payload | Description |
|---|---|---|
llm-token | { generation_id, token } | Single generated token |
llm-done | { generation_id, token_count } | Generation complete |
llm-error | { generation_id, message } | Generation failed |
llm-download-progress | { model_id, bytes_downloaded, bytes_total } | Model download progress |
Smart Routing
The frontend routes messages based on:
- Model availability — is a local model loaded?
- User preference — local-first (default), cloud-only, or ask
- Task complexity — local LLM handles simple Q&A, drafting, summarization; cloud handles tool calls and complex reasoning
No tool calling in V1.5 local LLM. Tool use requires Cloud Claude.
Model Management
Models are stored at {cache_dir}/morphee/models/{model_id}/:
model.gguf— quantized model weightstokenizer.json— HuggingFace tokenizer
Downloaded via hf-hub crate from HuggingFace Hub with progress reporting.
Rust Module: llm.rs
Three-tier pattern matching embeddings.rs:
- Desktop (
local-llmfeature): Full candle GGUF inference with Metal (macOS) or CPU - Mobile (
mobile-mlfeature): Phi-4-mini only (RAM constraints), CPU/Metal - Stub (no feature): Returns error
Inference Pipeline
- Tokenize prompt with HF
tokenizerscrate - Load GGUF via
candle_transformers::quantized(QMixFormer for Phi, QLlama for Llama) - Autoregressive loop: forward pass → LogitsProcessor (temperature + top-p + repeat penalty) → yield token
- KV cache for efficient generation
- Stop on EOS token, max_tokens, or cancel flag
Prompt Templates
Phi-4:
<|system|>
You are a helpful assistant.<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>
Llama 3.2:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Tauri Commands
| Command | Purpose |
|---|---|
llm_chat_stream | Start streaming text generation |
llm_cancel_stream | Abort in-progress generation |
llm_get_info | Get loaded model info |
llm_load_model | Load a model into memory |
llm_unload_model | Free model from memory |
llm_list_models | List available + downloaded models |
llm_download_model | Download GGUF from HuggingFace |
llm_delete_model | Delete downloaded model files |
Frontend Components
llm-routing.ts— Smart cloud/local routing logicllm-client.ts— Local LLM streaming via Tauri eventsllmStore.ts— Zustand store for LLM stateLocalAITab.tsx— Settings tab for model management
Settings UI
The "Local AI" tab in Settings provides:
- Available models list with download/delete buttons
- Download progress bar
- Active model indicator with load/unload controls
- Disk usage display
- Routing preference selector (Local first / Cloud only / Ask each time)
Conversation Persistence
When using local LLM, messages are persisted:
- Online: POST to Python
/api/chat/conversations - Offline: Save to
.morph/viagit_save_conversationTauri command
Future (V2.0+)
- Tool calling with local models (structured output / function calling)
- More model support (Mistral 7B, Gemma 2B)
- LoRA adapter loading for fine-tuned models
- Speculative decoding for faster inference