Video Interaction & Visual Recognition
Status: PLANNED (V1.3 — Multimodal Interaction) Created: 2026-02-15 Effort: Large (3-4 weeks) Dependencies: Multi-Modal Identity (V1.0-V1.3)
Vision
Morphee is chat-first, but chat isn't the only way humans communicate. Adding video capability opens new interaction modalities:
- Gesture-based interactions — Thumbs up to approve, wave to acknowledge, point to select
- Visual context — Show pictures, diagrams, documents for discussion
- Handwriting recognition — Write and have Morphee transcribe/interpret
- Visual analysis — Upload a photo and discuss what's in it
- Screen sharing (desktop) — Share what you're looking at for context
- Accessibility — Kids, elderly, and non-vocal users can interact through gesture/camera
This phase transforms Morphee from a text-centric agent into a multimodal agent that understands visual input and responds to gestures.
Connection to Knowledge Pipeline: Gesture recognition models (TensorFlow.js pose detection) are candidates for the compilation chain — rule-based gesture classifiers start as Level 1 (PythonRuntime), move to Level 2 (JSRuntime) for real-time browser processing, and could be compiled to Level 3 (WasmRuntime) for maximum performance. See ROADMAP.md — BaseMorphRuntime. Also connects to Multi-Modal Identity for gesture-based authentication.
Problem Statement
Current Limitations
- Text-only interaction — Users must type or use voice; no visual context sharing
- Static approvals — Approving tasks requires clicking buttons; no gesture-based flow
- Image-based workflows missing — No way to show a picture and discuss it
- Handwriting ignored — Kids drawing or writing isn't accessible to Morphee
- Accessibility gap — Non-vocal, physically limited users have fewer interaction options
- Visual memory — Can't easily store and recall visual information (photos, diagrams)
Opportunities
- Richer family interactions — Kids show drawings, Morphee creates memories; parents use gestures
- Professional context — Developers share error screenshots; managers share diagrams
- Educational — Teachers show diagrams, Morphee generates questions
- Accessibility-first — Voice + gesture + visual cues enable multimodal communication
Design
1. Camera Input & Visual Capture
User Permissions
- On first use, browser requests camera permission (standard web API)
- Tauri native dialogs on desktop (more polished)
- Permission stored in ACL system (granular control per user/group)
- Can be revoked in Settings → Privacy
Camera Component
// frontend/src/components/chat/CameraCapture.tsx
<CameraCapture>
onCapture={(frame: Blob) => {
// Send to Morphee for analysis or just display
uploadImage(frame);
}}
disabled={!hasPermission}
previewSize="medium"
/>
Features:
- Live camera preview (mobile friendly)
- Capture button (takes single frame)
- Video mode (record short clip, up to 10s)
- Flash/light indicator (for low-light)
- Switch camera (front/back on mobile)
Image Upload Component
// frontend/src/components/chat/ImageUploader.tsx
<ImageUploader>
onUpload={(files: File[]) => {
// Handle single or multiple images
uploadImages(files);
}}
acceptTypes={['.jpg', '.png', '.gif', '.webp']}
maxSize={10 * 1024 * 1024} // 10 MB
/>
Features:
- Drag-and-drop zone
- Click to browse files
- Gallery picker (on mobile)
- Multi-file upload
- Progress indicator
Display in Chat
// Show captured/uploaded images inline
<ChatBubble
message="Here's the diagram I want you to explain"
images={[
{ url: '/uploads/diagram.png', alt: 'Architecture diagram', width: 300 }
]}
/>
2. Gesture Recognition
Detection Pipeline
Live video frame
↓
[TensorFlow.js Pose Detection]
↓
Extract hand/head landmarks
↓
[Gesture Classification] (rule-based or ML model)
↓
Semantic gesture (thumbs_up, wave, point, open_palm)
↓
Trigger AI response
Supported Gestures
| Gesture | Hands | Head | Meaning | AI Response |
|---|---|---|---|---|
| Thumbs up | Both thumbs up | — | Approval, positive | Approve pending action |
| Thumbs down | Both thumbs down | — | Disapproval, negative | Reject pending action |
| Wave | Hand moving side-to-side | — | Hello, acknowledge | "Hello! What can I help?" |
| Point | One hand pointing | Head facing that direction | Select, indicate | "I'll help with that" |
| Open palm | Both palms open at shoulders | — | Stop, pause | Pause AI response |
| Nod | — | Up/down motion | Yes, agreement | Confirm action |
| Shake | — | Side-to-side motion | No, disagreement | "Understood, I'll do it differently" |
Implementation
Tier 1: Rule-Based (JavaScript)
// frontend/src/lib/gesture-detector.ts
export interface GestureResult {
gesture: 'thumbs_up' | 'thumbs_down' | 'wave' | 'point' | 'nod' | 'shake' | null;
confidence: number; // 0-1
timestamp: number;
}
async function detectGesture(videoFrame: HTMLVideoElement): Promise<GestureResult> {
const poses = await poseDetector.estimatePoses(videoFrame);
if (!poses.length) return { gesture: null, confidence: 0, timestamp: Date.now() };
const landmarks = poses[0].landmarks;
// Thumbs up: both thumbs above fists
if (landmarks.leftThumb.y < landmarks.leftIndexFinger.y &&
landmarks.rightThumb.y < landmarks.rightIndexFinger.y) {
return { gesture: 'thumbs_up', confidence: 0.9, timestamp: Date.now() };
}
// Nod: head moving up/down
if (headYMovement > threshold) {
return { gesture: 'nod', confidence: 0.85, timestamp: Date.now() };
}
// ... more gesture rules
return { gesture: null, confidence: 0, timestamp: Date.now() };
}
Tier 2: ML Model (Optional Future)
- Train small gesture classification model (TFLite)
- More accurate than rule-based, but requires training data
- Local inference in browser (no cloud dependency)
Gesture Event Flow
User performs gesture
↓
[detectGesture] → gesture type + confidence
↓
Gesture confidence > threshold?
├─ YES → Emit event to chat
│ (show "Thumbs up detected" toast)
│ ↓
│ Post to /api/chat/component-event
│ ↓
│ AI receives → responds naturally
│
└─ NO → Ignore (too low confidence)
3. Visual Analysis
Backend: VisualIntegration
# backend/interfaces/integrations/visual.py
class VisualIntegration(BaseInterface):
"""Analyze images, video frames, and visual content."""
async def analyze_image(self, image_data: str, context: str = None):
"""
Analyze an image using Claude's vision capabilities.
Args:
image_data: base64-encoded image
context: optional user question/context
Returns:
{
"description": "what's in the image",
"objects": ["list", "of", "detected", "objects"],
"text": "any text found (OCR)",
"analysis": "detailed analysis based on context"
}
"""
# Call Claude API with vision
response = await self.llm.call_with_vision(
message=f"Analyze this image{f': {context}' if context else ''}",
image_base64=image_data
)
return {
"description": response.description,
"objects": response.detected_objects,
"text": response.ocr_text,
"analysis": response.analysis
}
async def recognize_gesture(self, video_frame: str):
"""
Recognize gestures in a video frame.
Note: This is mostly frontend work; backend can optionally
do additional analysis.
Args:
video_frame: base64-encoded frame
Returns:
{
"gesture": "thumbs_up|thumbs_down|wave|nod|...",
"confidence": 0.95,
"description": "User gave thumbs up"
}
"""
# Optional: use Claude vision for additional gesture analysis
# if frontend detection is uncertain
pass
async def transcribe_handwriting(self, image_data: str):
"""
Transcribe handwriting from an image using OCR + Claude vision.
Args:
image_data: base64-encoded image with handwriting
Returns:
{
"text": "transcribed text",
"confidence": 0.92,
"structured_data": {...} # if it's a form/list
}
"""
response = await self.llm.call_with_vision(
message="Transcribe this handwriting precisely",
image_base64=image_data
)
return {
"text": response.text,
"confidence": response.confidence,
"structured_data": response.parsed_structure
}
async def interpret_diagram(self, image_data: str):
"""
Interpret diagrams (flowcharts, architecture, whiteboard sketches).
Args:
image_data: base64-encoded diagram
Returns:
{
"type": "flowchart|architecture|mindmap|...",
"description": "what the diagram shows",
"elements": ["list", "of", "components"],
"relationships": [
{"from": "A", "to": "B", "relation": "connects_to"}
],
"explanation": "detailed explanation"
}
"""
response = await self.llm.call_with_vision(
message="Explain this diagram in detail. List components and relationships.",
image_base64=image_data
)
return {
"type": response.diagram_type,
"description": response.summary,
"elements": response.components,
"relationships": response.relationships,
"explanation": response.detailed_explanation
}
Frontend: Image Display & Analysis
// Show image in chat with analysis button
<ChatBubble>
<div className="space-y-2">
<img src={imageUrl} alt="User uploaded image" />
<button onClick={() => analyzeImage(imageUrl)}>
Analyze this image
</button>
</div>
</ChatBubble>
// After analysis
<ChatBubble role="assistant">
<p>{analysis.description}</p>
<ul>
{analysis.objects.map(obj => <li key={obj}>{obj}</li>)}
</ul>
</ChatBubble>
4. Screen Sharing (Desktop/Tauri)
Use Case
"Show me what you're seeing" — User shares a portion of their desktop for context.
Implementation
// Tauri command
const frame = await invoke('screen_capture', {
captureArea: { x: 0, y: 0, width: 1920, height: 1080 }
});
// Send to Morphee
await api.analyzeImage(frame, "What's on my screen?");
UI
<div className="flex gap-2">
<Button onClick={() => captureFullScreen()}>
<Share2 className="w-4 h-4" /> Share screen
</Button>
<Button onClick={() => captureRegion()}>
<Crop className="w-4 h-4" /> Share region
</Button>
</div>
5. Visual Memory
Memory Integration Extension
Add new MemoryIntegration actions:
# In backend/interfaces/integrations/memory.py
async def store_visual(self, image_data: str, caption: str, memory_type: str):
"""
Store an image in memory with optional caption/tags.
Args:
image_data: base64-encoded image
caption: user description
memory_type: "photo|diagram|screenshot|drawing|..."
"""
# Generate embedding from image description
description = await visual_integration.analyze_image(image_data)
embedding = await embeddings.embed(description['description'])
# Store both the image (in Git) and embedding (in pgvector)
await vector_store.insert(
content=caption or description['description'],
embedding=embedding,
type='visual_memory',
metadata={
'image_type': memory_type,
'objects': description['objects'],
'stored_at': datetime.now().isoformat()
}
)
# Save image file to Git
await git_store.save_memory(
filename=f"images/{uuid4()}.b64",
content=f"---\ntype: visual\ncaption: {caption}\n---\n{image_data}"
)
Frontend: Quick Save
<Button onClick={() => saveToMemory(imageUrl, caption)}>
<Save className="w-4 h-4" /> Save to memory
</Button>
Implementation Timeline
Phase 3p.1 — Basic Camera & Image Upload (2 weeks)
Backend:
- Permission model in ACL system (
camera,image_upload) - Image upload endpoint:
POST /api/chat/upload-image(multipart/form-data) - Image storage: temporary AWS S3 or local fs (sandboxed per group)
- URL generation for display in chat
Frontend:
-
CameraCapturecomponent (browser camera API) -
ImageUploadercomponent (drag-drop, file picker) -
ImageMessagerendering in ChatBubble - Permission request flow + Settings toggle
- Upload progress indicator
Tauri (Optional for Phase 3p.1):
- Native camera access (macOS AVFoundation, Windows MediaCapture, Linux V4L2)
- Permission dialog (native vs browser)
Tests:
- Frontend: camera permission, upload flow, image display
- Backend: permission check, image storage, retrieval
- E2E: upload image, see in chat
Phase 3p.2 — Visual Analysis (2 weeks)
Backend:
-
VisualIntegration(BaseInterface)with 4 actions - Claude vision integration (anthropic-sdk vision calls)
- Image caching (store analysis metadata, not images)
- Error handling (invalid images, API failures)
Frontend:
- "Analyze image" button in ChatBubble
- Display analysis results inline
- Follow-up questions for images ("Tell me more about...")
- Image analysis loading state
Tests:
- Backend: vision API calls, error cases, metadata caching
- Frontend: analyze button, result display
- E2E: upload, analyze, discuss results
Phase 3p.3 — Gesture Recognition (2 weeks)
Frontend:
- TensorFlow.js pose detection (load model, init)
-
GestureDetectorcomponent (continuous frame analysis) -
GestureRecognizerlogic (rule-based gesture classification) - Gesture event dispatch (
component-eventendpoint) - Visual feedback (toast: "Thumbs up detected!")
Backend:
- Handle gesture events in chat orchestrator
- Natural language response to gestures ("Got it! I'm on it.")
- Gesture → action mapping (thumbs_up = approve_pending)
Tauri:
- Camera frame streaming to frontend (optional optimization)
Tests:
- Frontend: pose detection, gesture classification, event dispatch
- Backend: gesture event handling, action mapping
- E2E: perform gesture, see AI response
Phase 3p.4 — Polish & Accessibility (1 week)
Frontend:
- Voice feedback for gestures ("I detected your thumbs up")
- High-contrast visual indicators (red/green borders for approval)
- Gesture customization in Settings
- Gesture log (history of detected gestures)
- Gesture-only mode (no buttons, all interaction via gesture)
Tauri (Desktop):
-
screen_capturecommand (full screen or region) - Screen share UI
Tests:
- Voice feedback for screen readers
- Gesture-only workflow (approval without clicking)
- Accessibility: keyboard nav, screen reader compat
Data & Privacy
What Data Is Captured?
| Data | Captured | Stored | Retained | Deleted |
|---|---|---|---|---|
| Camera frames | Yes (temporary) | No | Session only | On close |
| Uploaded images | Yes | Temporarily (S3) | 24h max | Auto after 24h |
| Image analysis | Yes | As metadata | Per memory policy | With memory |
| Gesture frames | No (only landmarks) | No | Session only | On close |
| Gesture log | Yes | Optional (user consent) | Per user setting | On request |
Privacy Safeguards
- No raw image storage — Only analyzed metadata + captions stored
- Explicit consent — Camera/image upload requires user permission
- GDPR compliance — User can delete all images/metadata via data export/deletion
- Local processing — Gesture detection happens in browser (no server-side video processing)
- Opt-in gesture logging — Users can disable gesture history
- Biometric template not stored — Face landmarks extracted during pose detection are NOT stored
Privacy Policy Updates
Add to PRIVACY_POLICY.md:
## Visual Data Processing
When you use Morphee's camera or image upload features:
- **Images you upload** are analyzed by Claude's vision API and deleted within 24 hours
- **Your camera** is only accessed with your explicit permission
- **Gesture data** is processed locally in your browser; no gestures are stored unless you enable gesture logging
- **Analysis results** (descriptions, extracted text) are stored only if you save them to memory
- **Screen captures** (desktop only) are temporary and never stored server-side
You can disable camera/image features in Settings → Privacy at any time.
Database Schema
-- visual_interactions: log of user visual interactions
CREATE TABLE visual_interactions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES morphee_users(id) ON DELETE CASCADE,
group_id UUID NOT NULL REFERENCES groups(id) ON DELETE CASCADE,
interaction_type TEXT NOT NULL, -- 'image_upload', 'gesture_detected', 'analysis'
image_url TEXT,
gesture_type TEXT,
gesture_confidence NUMERIC,
metadata JSONB,
created_at TIMESTAMP NOT NULL DEFAULT now()
);
-- gesture_preferences: custom gesture bindings per user
CREATE TABLE gesture_preferences (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES morphee_users(id) ON DELETE CASCADE,
gesture_type TEXT NOT NULL, -- 'thumbs_up', 'nod', etc.
custom_action TEXT, -- optional custom mapping
enabled BOOLEAN DEFAULT true,
created_at TIMESTAMP NOT NULL DEFAULT now(),
UNIQUE(user_id, gesture_type)
);
-- Enable gesture logging in ACL permissions
-- acl_preferences: add 'gesture_logging' boolean field
API Reference
Image Upload
POST /api/chat/upload-image
Content-Type: multipart/form-data
Body:
image: <file>,
conversation_id: <uuid>
Response 200:
{
"id": "<image_id>",
"url": "<cdn_url>",
"size": 1024,
"mime_type": "image/png"
}
Analyze Image
POST /api/chat/analyze-image
Content-Type: application/json
Body:
{
"image_id": "<image_id>",
"context": "What is this?"
}
Response 200:
{
"description": "A diagram showing...",
"objects": ["object1", "object2"],
"text": "Any text found",
"analysis": "Detailed analysis"
}
Gesture Event (via WebSocket)
// Client sends
{
"type": "component_event",
"conversation_id": "<uuid>",
"component_id": "gesture_detector",
"event": "gesture_detected",
"data": {
"gesture": "thumbs_up",
"confidence": 0.95
}
}
// Server responds
{
"type": "assistant_response",
"content": "Got it! Approving that for you."
}
Testing Strategy
Frontend Unit Tests
// @vitest
describe('CameraCapture', () => {
it('should request camera permission on mount', async () => {
// Mock getUserMedia
});
it('should capture frame on button click', async () => {
// Verify frame is returned as Blob
});
});
describe('GestureDetector', () => {
it('should detect thumbs up with >90% confidence', async () => {
// Load test video, run detector
});
it('should emit gesture event on high confidence', async () => {
// Verify event dispatched
});
});
Backend Integration Tests
@pytest.mark.asyncio
async def test_analyze_image_with_vision():
# Upload test image
# Call analyze endpoint
# Verify Claude vision was called
# Verify metadata returned
@pytest.mark.asyncio
async def test_gesture_event_permission_check():
# User without gesture permission
# POST gesture event
# Expect 403 Forbidden
E2E Tests (Playwright)
test('User uploads image and asks Morphee to analyze it', async ({ page }) => {
// Login
// Open chat
// Click camera button
// Select test image
// Morphee analyzes and responds
});
test('User approves pending task via thumbs up gesture', async ({ page }) => {
// Morphee asks for approval
// User performs thumbs up gesture
// Task is approved
});
Success Metrics
| Metric | Target | How to Measure |
|---|---|---|
| Camera permission grant rate | >80% | Analytics: permission requests vs grants |
| Image upload frequency | >20% of users use weekly | Usage analytics |
| Gesture detection accuracy | >90% | E2E tests + user feedback |
| Image analysis latency | <2s | CloudWatch metrics |
| User satisfaction | >4.2/5 | In-app survey post-analysis |
| Accessibility impact | 50% increase in non-text interactions | Interaction logs |
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Camera permission denial | Users can't use camera feature | Graceful degradation; image upload still works |
| Vision API rate limits | Analysis delayed/failed | Queue long-running analyses; cache results |
| Gesture false positives | Accidental approvals | High confidence threshold (>90%); confirmation prompt |
| Privacy concerns | User trust erosion | Clear policy; local processing where possible; transparency |
| Browser support | Limited to modern browsers | Feature detect; fallback to image upload |
| Mobile performance | High CPU/battery drain | Optimize frame processing; reduce frequency; offer toggle |
Future Enhancements (Phase 3p+)
- Handwriting recognition as notes — "Take a picture of my note" → stored in memory
- Gesture learning — Users can teach Morphee custom gestures
- Multi-gesture combos — Detect two-hand gestures for complex commands
- Real-time video chat — WebRTC peer connection to Morphee (requires cloud liveness detection)
- Vision-based authentication — Gesture or face for step-up auth (ties to Phase 3o)
- Whiteboard collaboration — Multiple users draw + Morphee joins as text/voice layer
- Augmented reality — AR overlay showing Morphee's response in real world (AR glasses era)
References
- TensorFlow.js Pose Detection: https://github.com/tensorflow/tfjs-models/tree/master/pose-detection
- Claude Vision API: https://docs.anthropic.com/en/api/vision
- Web Permissions API: https://developer.mozilla.org/en-US/docs/Web/API/Permissions_API
- Tauri Camera Plugin: https://docs.tauri.app/features/system-tray/ (future)
Last Updated: 2026-02-20 Status: Ready for Planning Phase (V1.3)