V1.5 Local LLM — candle/GGUF Inference

Overview

Morphee V1.5 adds local LLM inference via the candle framework with GGUF quantized models in the Tauri Rust backend. This completes the offline AI story: V1.0 shipped offline data/sync, V1.5 adds offline text generation.

The same LLMIntegration contract (BaseInterface) means the orchestrator doesn't care whether Claude or a local model generates the response. The frontend implements smart routing — simple tasks go to the local model, complex reasoning with tool calls goes to Cloud Claude.

Target Models

Model	Quantization	Size	Target	Context
Phi-4-mini-instruct	Q4_K_M	~2.3 GB	Desktop + Mobile	4K tokens
Llama 3.2 3B Instruct	Q4_K_M	~2.0 GB	Desktop only	4K tokens

Architecture

User types message
        │
        ▼
┌─────────────────┐
│  llm-routing.ts │  Frontend smart router
│  (heuristic)    │
└────┬───────┬────┘
     │       │
     ▼       ▼
┌────────┐ ┌──────────────┐
│ Cloud  │ │ Local LLM    │
│ Claude │ │ (Tauri IPC)  │
│ (SSE)  │ │ (Events)     │
└────────┘ └──────┬───────┘
                  │
                  ▼
           ┌──────────────┐
           │  llm.rs      │  candle GGUF inference
           │  (Rust)      │  + KV cache + sampling
           └──────────────┘

Streaming Protocol

Local LLM uses Tauri events (not SSE) for token streaming:

Event	Payload	Description
`llm-token`	`{ generation_id, token }`	Single generated token
`llm-done`	`{ generation_id, token_count }`	Generation complete
`llm-error`	`{ generation_id, message }`	Generation failed
`llm-download-progress`	`{ model_id, bytes_downloaded, bytes_total }`	Model download progress

Smart Routing

The frontend routes messages based on:

Model availability — is a local model loaded?
User preference — local-first (default), cloud-only, or ask
Task complexity — local LLM handles simple Q&A, drafting, summarization; cloud handles tool calls and complex reasoning

No tool calling in V1.5 local LLM. Tool use requires Cloud Claude.

Model Management

Models are stored at {cache_dir}/morphee/models/{model_id}/:

model.gguf — quantized model weights
tokenizer.json — HuggingFace tokenizer

Downloaded via hf-hub crate from HuggingFace Hub with progress reporting.

Rust Module: `llm.rs`

Three-tier pattern matching embeddings.rs:

Desktop (local-llm feature): Full candle GGUF inference with Metal (macOS) or CPU
Mobile (mobile-ml feature): Phi-4-mini only (RAM constraints), CPU/Metal
Stub (no feature): Returns error

Inference Pipeline

Tokenize prompt with HF tokenizers crate
Load GGUF via candle_transformers::quantized (QMixFormer for Phi, QLlama for Llama)
Autoregressive loop: forward pass → LogitsProcessor (temperature + top-p + repeat penalty) → yield token
KV cache for efficient generation
Stop on EOS token, max_tokens, or cancel flag

Prompt Templates

Phi-4:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>

Llama 3.2:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Tauri Commands

Command	Purpose
`llm_chat_stream`	Start streaming text generation
`llm_cancel_stream`	Abort in-progress generation
`llm_get_info`	Get loaded model info
`llm_load_model`	Load a model into memory
`llm_unload_model`	Free model from memory
`llm_list_models`	List available + downloaded models
`llm_download_model`	Download GGUF from HuggingFace
`llm_delete_model`	Delete downloaded model files

Frontend Components

llm-routing.ts — Smart cloud/local routing logic
llm-client.ts — Local LLM streaming via Tauri events
llmStore.ts — Zustand store for LLM state
LocalAITab.tsx — Settings tab for model management

Settings UI

The "Local AI" tab in Settings provides:

Available models list with download/delete buttons
Download progress bar
Active model indicator with load/unload controls
Disk usage display
Routing preference selector (Local first / Cloud only / Ask each time)

Conversation Persistence

When using local LLM, messages are persisted:

Online: POST to Python /api/chat/conversations
Offline: Save to .morph/ via git_save_conversation Tauri command

Future (V2.0+)

Tool calling with local models (structured output / function calling)
More model support (Mistral 7B, Gemma 2B)
LoRA adapter loading for fine-tuned models
Speculative decoding for faster inference

Overview​

Target Models​

Architecture​

Streaming Protocol​

Smart Routing​

Model Management​

Rust Module: llm.rs​

Inference Pipeline​

Prompt Templates​

Tauri Commands​

Frontend Components​

Settings UI​

Conversation Persistence​

Future (V2.0+)​