Skip to main content

Video Interaction & Visual Recognition

Status: PLANNED (V1.3 — Multimodal Interaction) Created: 2026-02-15 Effort: Large (3-4 weeks) Dependencies: Multi-Modal Identity (V1.0-V1.3)


Vision

Morphee is chat-first, but chat isn't the only way humans communicate. Adding video capability opens new interaction modalities:

  • Gesture-based interactions — Thumbs up to approve, wave to acknowledge, point to select
  • Visual context — Show pictures, diagrams, documents for discussion
  • Handwriting recognition — Write and have Morphee transcribe/interpret
  • Visual analysis — Upload a photo and discuss what's in it
  • Screen sharing (desktop) — Share what you're looking at for context
  • Accessibility — Kids, elderly, and non-vocal users can interact through gesture/camera

This phase transforms Morphee from a text-centric agent into a multimodal agent that understands visual input and responds to gestures.

Connection to Knowledge Pipeline: Gesture recognition models (TensorFlow.js pose detection) are candidates for the compilation chain — rule-based gesture classifiers start as Level 1 (PythonRuntime), move to Level 2 (JSRuntime) for real-time browser processing, and could be compiled to Level 3 (WasmRuntime) for maximum performance. See ROADMAP.md — BaseMorphRuntime. Also connects to Multi-Modal Identity for gesture-based authentication.


Problem Statement

Current Limitations

  1. Text-only interaction — Users must type or use voice; no visual context sharing
  2. Static approvals — Approving tasks requires clicking buttons; no gesture-based flow
  3. Image-based workflows missing — No way to show a picture and discuss it
  4. Handwriting ignored — Kids drawing or writing isn't accessible to Morphee
  5. Accessibility gap — Non-vocal, physically limited users have fewer interaction options
  6. Visual memory — Can't easily store and recall visual information (photos, diagrams)

Opportunities

  • Richer family interactions — Kids show drawings, Morphee creates memories; parents use gestures
  • Professional context — Developers share error screenshots; managers share diagrams
  • Educational — Teachers show diagrams, Morphee generates questions
  • Accessibility-first — Voice + gesture + visual cues enable multimodal communication

Design

1. Camera Input & Visual Capture

User Permissions

  • On first use, browser requests camera permission (standard web API)
  • Tauri native dialogs on desktop (more polished)
  • Permission stored in ACL system (granular control per user/group)
  • Can be revoked in Settings → Privacy

Camera Component

// frontend/src/components/chat/CameraCapture.tsx
<CameraCapture>
onCapture={(frame: Blob) => {
// Send to Morphee for analysis or just display
uploadImage(frame);
}}
disabled={!hasPermission}
previewSize="medium"
/>

Features:

  • Live camera preview (mobile friendly)
  • Capture button (takes single frame)
  • Video mode (record short clip, up to 10s)
  • Flash/light indicator (for low-light)
  • Switch camera (front/back on mobile)

Image Upload Component

// frontend/src/components/chat/ImageUploader.tsx
<ImageUploader>
onUpload={(files: File[]) => {
// Handle single or multiple images
uploadImages(files);
}}
acceptTypes={['.jpg', '.png', '.gif', '.webp']}
maxSize={10 * 1024 * 1024} // 10 MB
/>

Features:

  • Drag-and-drop zone
  • Click to browse files
  • Gallery picker (on mobile)
  • Multi-file upload
  • Progress indicator

Display in Chat

// Show captured/uploaded images inline
<ChatBubble
message="Here's the diagram I want you to explain"
images={[
{ url: '/uploads/diagram.png', alt: 'Architecture diagram', width: 300 }
]}
/>

2. Gesture Recognition

Detection Pipeline

Live video frame

[TensorFlow.js Pose Detection]

Extract hand/head landmarks

[Gesture Classification] (rule-based or ML model)

Semantic gesture (thumbs_up, wave, point, open_palm)

Trigger AI response

Supported Gestures

GestureHandsHeadMeaningAI Response
Thumbs upBoth thumbs upApproval, positiveApprove pending action
Thumbs downBoth thumbs downDisapproval, negativeReject pending action
WaveHand moving side-to-sideHello, acknowledge"Hello! What can I help?"
PointOne hand pointingHead facing that directionSelect, indicate"I'll help with that"
Open palmBoth palms open at shouldersStop, pausePause AI response
NodUp/down motionYes, agreementConfirm action
ShakeSide-to-side motionNo, disagreement"Understood, I'll do it differently"

Implementation

Tier 1: Rule-Based (JavaScript)

// frontend/src/lib/gesture-detector.ts
export interface GestureResult {
gesture: 'thumbs_up' | 'thumbs_down' | 'wave' | 'point' | 'nod' | 'shake' | null;
confidence: number; // 0-1
timestamp: number;
}

async function detectGesture(videoFrame: HTMLVideoElement): Promise<GestureResult> {
const poses = await poseDetector.estimatePoses(videoFrame);
if (!poses.length) return { gesture: null, confidence: 0, timestamp: Date.now() };

const landmarks = poses[0].landmarks;

// Thumbs up: both thumbs above fists
if (landmarks.leftThumb.y < landmarks.leftIndexFinger.y &&
landmarks.rightThumb.y < landmarks.rightIndexFinger.y) {
return { gesture: 'thumbs_up', confidence: 0.9, timestamp: Date.now() };
}

// Nod: head moving up/down
if (headYMovement > threshold) {
return { gesture: 'nod', confidence: 0.85, timestamp: Date.now() };
}

// ... more gesture rules

return { gesture: null, confidence: 0, timestamp: Date.now() };
}

Tier 2: ML Model (Optional Future)

  • Train small gesture classification model (TFLite)
  • More accurate than rule-based, but requires training data
  • Local inference in browser (no cloud dependency)

Gesture Event Flow

User performs gesture

[detectGesture] → gesture type + confidence

Gesture confidence > threshold?
├─ YES → Emit event to chat
│ (show "Thumbs up detected" toast)
│ ↓
│ Post to /api/chat/component-event
│ ↓
│ AI receives → responds naturally

└─ NO → Ignore (too low confidence)

3. Visual Analysis

Backend: VisualIntegration

# backend/interfaces/integrations/visual.py

class VisualIntegration(BaseInterface):
"""Analyze images, video frames, and visual content."""

async def analyze_image(self, image_data: str, context: str = None):
"""
Analyze an image using Claude's vision capabilities.

Args:
image_data: base64-encoded image
context: optional user question/context

Returns:
{
"description": "what's in the image",
"objects": ["list", "of", "detected", "objects"],
"text": "any text found (OCR)",
"analysis": "detailed analysis based on context"
}
"""
# Call Claude API with vision
response = await self.llm.call_with_vision(
message=f"Analyze this image{f': {context}' if context else ''}",
image_base64=image_data
)

return {
"description": response.description,
"objects": response.detected_objects,
"text": response.ocr_text,
"analysis": response.analysis
}

async def recognize_gesture(self, video_frame: str):
"""
Recognize gestures in a video frame.

Note: This is mostly frontend work; backend can optionally
do additional analysis.

Args:
video_frame: base64-encoded frame

Returns:
{
"gesture": "thumbs_up|thumbs_down|wave|nod|...",
"confidence": 0.95,
"description": "User gave thumbs up"
}
"""
# Optional: use Claude vision for additional gesture analysis
# if frontend detection is uncertain
pass

async def transcribe_handwriting(self, image_data: str):
"""
Transcribe handwriting from an image using OCR + Claude vision.

Args:
image_data: base64-encoded image with handwriting

Returns:
{
"text": "transcribed text",
"confidence": 0.92,
"structured_data": {...} # if it's a form/list
}
"""
response = await self.llm.call_with_vision(
message="Transcribe this handwriting precisely",
image_base64=image_data
)

return {
"text": response.text,
"confidence": response.confidence,
"structured_data": response.parsed_structure
}

async def interpret_diagram(self, image_data: str):
"""
Interpret diagrams (flowcharts, architecture, whiteboard sketches).

Args:
image_data: base64-encoded diagram

Returns:
{
"type": "flowchart|architecture|mindmap|...",
"description": "what the diagram shows",
"elements": ["list", "of", "components"],
"relationships": [
{"from": "A", "to": "B", "relation": "connects_to"}
],
"explanation": "detailed explanation"
}
"""
response = await self.llm.call_with_vision(
message="Explain this diagram in detail. List components and relationships.",
image_base64=image_data
)

return {
"type": response.diagram_type,
"description": response.summary,
"elements": response.components,
"relationships": response.relationships,
"explanation": response.detailed_explanation
}

Frontend: Image Display & Analysis

// Show image in chat with analysis button
<ChatBubble>
<div className="space-y-2">
<img src={imageUrl} alt="User uploaded image" />
<button onClick={() => analyzeImage(imageUrl)}>
Analyze this image
</button>
</div>
</ChatBubble>

// After analysis
<ChatBubble role="assistant">
<p>{analysis.description}</p>
<ul>
{analysis.objects.map(obj => <li key={obj}>{obj}</li>)}
</ul>
</ChatBubble>

4. Screen Sharing (Desktop/Tauri)

Use Case

"Show me what you're seeing" — User shares a portion of their desktop for context.

Implementation

// Tauri command
const frame = await invoke('screen_capture', {
captureArea: { x: 0, y: 0, width: 1920, height: 1080 }
});

// Send to Morphee
await api.analyzeImage(frame, "What's on my screen?");

UI

<div className="flex gap-2">
<Button onClick={() => captureFullScreen()}>
<Share2 className="w-4 h-4" /> Share screen
</Button>
<Button onClick={() => captureRegion()}>
<Crop className="w-4 h-4" /> Share region
</Button>
</div>

5. Visual Memory

Memory Integration Extension

Add new MemoryIntegration actions:

# In backend/interfaces/integrations/memory.py

async def store_visual(self, image_data: str, caption: str, memory_type: str):
"""
Store an image in memory with optional caption/tags.

Args:
image_data: base64-encoded image
caption: user description
memory_type: "photo|diagram|screenshot|drawing|..."
"""
# Generate embedding from image description
description = await visual_integration.analyze_image(image_data)
embedding = await embeddings.embed(description['description'])

# Store both the image (in Git) and embedding (in pgvector)
await vector_store.insert(
content=caption or description['description'],
embedding=embedding,
type='visual_memory',
metadata={
'image_type': memory_type,
'objects': description['objects'],
'stored_at': datetime.now().isoformat()
}
)

# Save image file to Git
await git_store.save_memory(
filename=f"images/{uuid4()}.b64",
content=f"---\ntype: visual\ncaption: {caption}\n---\n{image_data}"
)

Frontend: Quick Save

<Button onClick={() => saveToMemory(imageUrl, caption)}>
<Save className="w-4 h-4" /> Save to memory
</Button>

Implementation Timeline

Phase 3p.1 — Basic Camera & Image Upload (2 weeks)

Backend:

  • Permission model in ACL system (camera, image_upload)
  • Image upload endpoint: POST /api/chat/upload-image (multipart/form-data)
  • Image storage: temporary AWS S3 or local fs (sandboxed per group)
  • URL generation for display in chat

Frontend:

  • CameraCapture component (browser camera API)
  • ImageUploader component (drag-drop, file picker)
  • ImageMessage rendering in ChatBubble
  • Permission request flow + Settings toggle
  • Upload progress indicator

Tauri (Optional for Phase 3p.1):

  • Native camera access (macOS AVFoundation, Windows MediaCapture, Linux V4L2)
  • Permission dialog (native vs browser)

Tests:

  • Frontend: camera permission, upload flow, image display
  • Backend: permission check, image storage, retrieval
  • E2E: upload image, see in chat

Phase 3p.2 — Visual Analysis (2 weeks)

Backend:

  • VisualIntegration(BaseInterface) with 4 actions
  • Claude vision integration (anthropic-sdk vision calls)
  • Image caching (store analysis metadata, not images)
  • Error handling (invalid images, API failures)

Frontend:

  • "Analyze image" button in ChatBubble
  • Display analysis results inline
  • Follow-up questions for images ("Tell me more about...")
  • Image analysis loading state

Tests:

  • Backend: vision API calls, error cases, metadata caching
  • Frontend: analyze button, result display
  • E2E: upload, analyze, discuss results

Phase 3p.3 — Gesture Recognition (2 weeks)

Frontend:

  • TensorFlow.js pose detection (load model, init)
  • GestureDetector component (continuous frame analysis)
  • GestureRecognizer logic (rule-based gesture classification)
  • Gesture event dispatch (component-event endpoint)
  • Visual feedback (toast: "Thumbs up detected!")

Backend:

  • Handle gesture events in chat orchestrator
  • Natural language response to gestures ("Got it! I'm on it.")
  • Gesture → action mapping (thumbs_up = approve_pending)

Tauri:

  • Camera frame streaming to frontend (optional optimization)

Tests:

  • Frontend: pose detection, gesture classification, event dispatch
  • Backend: gesture event handling, action mapping
  • E2E: perform gesture, see AI response

Phase 3p.4 — Polish & Accessibility (1 week)

Frontend:

  • Voice feedback for gestures ("I detected your thumbs up")
  • High-contrast visual indicators (red/green borders for approval)
  • Gesture customization in Settings
  • Gesture log (history of detected gestures)
  • Gesture-only mode (no buttons, all interaction via gesture)

Tauri (Desktop):

  • screen_capture command (full screen or region)
  • Screen share UI

Tests:

  • Voice feedback for screen readers
  • Gesture-only workflow (approval without clicking)
  • Accessibility: keyboard nav, screen reader compat

Data & Privacy

What Data Is Captured?

DataCapturedStoredRetainedDeleted
Camera framesYes (temporary)NoSession onlyOn close
Uploaded imagesYesTemporarily (S3)24h maxAuto after 24h
Image analysisYesAs metadataPer memory policyWith memory
Gesture framesNo (only landmarks)NoSession onlyOn close
Gesture logYesOptional (user consent)Per user settingOn request

Privacy Safeguards

  1. No raw image storage — Only analyzed metadata + captions stored
  2. Explicit consent — Camera/image upload requires user permission
  3. GDPR compliance — User can delete all images/metadata via data export/deletion
  4. Local processing — Gesture detection happens in browser (no server-side video processing)
  5. Opt-in gesture logging — Users can disable gesture history
  6. Biometric template not stored — Face landmarks extracted during pose detection are NOT stored

Privacy Policy Updates

Add to PRIVACY_POLICY.md:

## Visual Data Processing

When you use Morphee's camera or image upload features:

- **Images you upload** are analyzed by Claude's vision API and deleted within 24 hours
- **Your camera** is only accessed with your explicit permission
- **Gesture data** is processed locally in your browser; no gestures are stored unless you enable gesture logging
- **Analysis results** (descriptions, extracted text) are stored only if you save them to memory
- **Screen captures** (desktop only) are temporary and never stored server-side

You can disable camera/image features in Settings → Privacy at any time.

Database Schema

-- visual_interactions: log of user visual interactions
CREATE TABLE visual_interactions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES morphee_users(id) ON DELETE CASCADE,
group_id UUID NOT NULL REFERENCES groups(id) ON DELETE CASCADE,
interaction_type TEXT NOT NULL, -- 'image_upload', 'gesture_detected', 'analysis'
image_url TEXT,
gesture_type TEXT,
gesture_confidence NUMERIC,
metadata JSONB,
created_at TIMESTAMP NOT NULL DEFAULT now()
);

-- gesture_preferences: custom gesture bindings per user
CREATE TABLE gesture_preferences (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES morphee_users(id) ON DELETE CASCADE,
gesture_type TEXT NOT NULL, -- 'thumbs_up', 'nod', etc.
custom_action TEXT, -- optional custom mapping
enabled BOOLEAN DEFAULT true,
created_at TIMESTAMP NOT NULL DEFAULT now(),
UNIQUE(user_id, gesture_type)
);

-- Enable gesture logging in ACL permissions
-- acl_preferences: add 'gesture_logging' boolean field

API Reference

Image Upload

POST /api/chat/upload-image
Content-Type: multipart/form-data

Body:
image: <file>,
conversation_id: <uuid>

Response 200:
{
"id": "<image_id>",
"url": "<cdn_url>",
"size": 1024,
"mime_type": "image/png"
}

Analyze Image

POST /api/chat/analyze-image
Content-Type: application/json

Body:
{
"image_id": "<image_id>",
"context": "What is this?"
}

Response 200:
{
"description": "A diagram showing...",
"objects": ["object1", "object2"],
"text": "Any text found",
"analysis": "Detailed analysis"
}

Gesture Event (via WebSocket)

// Client sends
{
"type": "component_event",
"conversation_id": "<uuid>",
"component_id": "gesture_detector",
"event": "gesture_detected",
"data": {
"gesture": "thumbs_up",
"confidence": 0.95
}
}

// Server responds
{
"type": "assistant_response",
"content": "Got it! Approving that for you."
}

Testing Strategy

Frontend Unit Tests

// @vitest
describe('CameraCapture', () => {
it('should request camera permission on mount', async () => {
// Mock getUserMedia
});

it('should capture frame on button click', async () => {
// Verify frame is returned as Blob
});
});

describe('GestureDetector', () => {
it('should detect thumbs up with >90% confidence', async () => {
// Load test video, run detector
});

it('should emit gesture event on high confidence', async () => {
// Verify event dispatched
});
});

Backend Integration Tests

@pytest.mark.asyncio
async def test_analyze_image_with_vision():
# Upload test image
# Call analyze endpoint
# Verify Claude vision was called
# Verify metadata returned

@pytest.mark.asyncio
async def test_gesture_event_permission_check():
# User without gesture permission
# POST gesture event
# Expect 403 Forbidden

E2E Tests (Playwright)

test('User uploads image and asks Morphee to analyze it', async ({ page }) => {
// Login
// Open chat
// Click camera button
// Select test image
// Morphee analyzes and responds
});

test('User approves pending task via thumbs up gesture', async ({ page }) => {
// Morphee asks for approval
// User performs thumbs up gesture
// Task is approved
});

Success Metrics

MetricTargetHow to Measure
Camera permission grant rate>80%Analytics: permission requests vs grants
Image upload frequency>20% of users use weeklyUsage analytics
Gesture detection accuracy>90%E2E tests + user feedback
Image analysis latency<2sCloudWatch metrics
User satisfaction>4.2/5In-app survey post-analysis
Accessibility impact50% increase in non-text interactionsInteraction logs

Risks & Mitigations

RiskImpactMitigation
Camera permission denialUsers can't use camera featureGraceful degradation; image upload still works
Vision API rate limitsAnalysis delayed/failedQueue long-running analyses; cache results
Gesture false positivesAccidental approvalsHigh confidence threshold (>90%); confirmation prompt
Privacy concernsUser trust erosionClear policy; local processing where possible; transparency
Browser supportLimited to modern browsersFeature detect; fallback to image upload
Mobile performanceHigh CPU/battery drainOptimize frame processing; reduce frequency; offer toggle

Future Enhancements (Phase 3p+)

  • Handwriting recognition as notes — "Take a picture of my note" → stored in memory
  • Gesture learning — Users can teach Morphee custom gestures
  • Multi-gesture combos — Detect two-hand gestures for complex commands
  • Real-time video chat — WebRTC peer connection to Morphee (requires cloud liveness detection)
  • Vision-based authentication — Gesture or face for step-up auth (ties to Phase 3o)
  • Whiteboard collaboration — Multiple users draw + Morphee joins as text/voice layer
  • Augmented reality — AR overlay showing Morphee's response in real world (AR glasses era)

References


Last Updated: 2026-02-20 Status: Ready for Planning Phase (V1.3)