Skip to main content

V1.5 Local LLM — candle/GGUF Inference

Overview

Morphee V1.5 adds local LLM inference via the candle framework with GGUF quantized models in the Tauri Rust backend. This completes the offline AI story: V1.0 shipped offline data/sync, V1.5 adds offline text generation.

The same LLMIntegration contract (BaseInterface) means the orchestrator doesn't care whether Claude or a local model generates the response. The frontend implements smart routing — simple tasks go to the local model, complex reasoning with tool calls goes to Cloud Claude.

Target Models

ModelQuantizationSizeTargetContext
Phi-4-mini-instructQ4_K_M~2.3 GBDesktop + Mobile4K tokens
Llama 3.2 3B InstructQ4_K_M~2.0 GBDesktop only4K tokens

Architecture

User types message


┌─────────────────┐
│ llm-routing.ts │ Frontend smart router
│ (heuristic) │
└────┬───────┬────┘
│ │
▼ ▼
┌────────┐ ┌──────────────┐
│ Cloud │ │ Local LLM │
│ Claude │ │ (Tauri IPC) │
│ (SSE) │ │ (Events) │
└────────┘ └──────┬───────┘


┌──────────────┐
│ llm.rs │ candle GGUF inference
│ (Rust) │ + KV cache + sampling
└──────────────┘

Streaming Protocol

Local LLM uses Tauri events (not SSE) for token streaming:

EventPayloadDescription
llm-token{ generation_id, token }Single generated token
llm-done{ generation_id, token_count }Generation complete
llm-error{ generation_id, message }Generation failed
llm-download-progress{ model_id, bytes_downloaded, bytes_total }Model download progress

Smart Routing

The frontend routes messages based on:

  1. Model availability — is a local model loaded?
  2. User preference — local-first (default), cloud-only, or ask
  3. Task complexity — local LLM handles simple Q&A, drafting, summarization; cloud handles tool calls and complex reasoning

No tool calling in V1.5 local LLM. Tool use requires Cloud Claude.

Model Management

Models are stored at {cache_dir}/morphee/models/{model_id}/:

  • model.gguf — quantized model weights
  • tokenizer.json — HuggingFace tokenizer

Downloaded via hf-hub crate from HuggingFace Hub with progress reporting.

Rust Module: llm.rs

Three-tier pattern matching embeddings.rs:

  • Desktop (local-llm feature): Full candle GGUF inference with Metal (macOS) or CPU
  • Mobile (mobile-ml feature): Phi-4-mini only (RAM constraints), CPU/Metal
  • Stub (no feature): Returns error

Inference Pipeline

  1. Tokenize prompt with HF tokenizers crate
  2. Load GGUF via candle_transformers::quantized (QMixFormer for Phi, QLlama for Llama)
  3. Autoregressive loop: forward pass → LogitsProcessor (temperature + top-p + repeat penalty) → yield token
  4. KV cache for efficient generation
  5. Stop on EOS token, max_tokens, or cancel flag

Prompt Templates

Phi-4:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
{user_message}<|end|>
<|assistant|>

Llama 3.2:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Tauri Commands

CommandPurpose
llm_chat_streamStart streaming text generation
llm_cancel_streamAbort in-progress generation
llm_get_infoGet loaded model info
llm_load_modelLoad a model into memory
llm_unload_modelFree model from memory
llm_list_modelsList available + downloaded models
llm_download_modelDownload GGUF from HuggingFace
llm_delete_modelDelete downloaded model files

Frontend Components

  • llm-routing.ts — Smart cloud/local routing logic
  • llm-client.ts — Local LLM streaming via Tauri events
  • llmStore.ts — Zustand store for LLM state
  • LocalAITab.tsx — Settings tab for model management

Settings UI

The "Local AI" tab in Settings provides:

  • Available models list with download/delete buttons
  • Download progress bar
  • Active model indicator with load/unload controls
  • Disk usage display
  • Routing preference selector (Local first / Cloud only / Ask each time)

Conversation Persistence

When using local LLM, messages are persisted:

  • Online: POST to Python /api/chat/conversations
  • Offline: Save to .morph/ via git_save_conversation Tauri command

Future (V2.0+)

  • Tool calling with local models (structured output / function calling)
  • More model support (Mistral 7B, Gemma 2B)
  • LoRA adapter loading for fine-tuned models
  • Speculative decoding for faster inference