Video Interaction & Visual Recognition

Status: PLANNED (V1.3 — Multimodal Interaction) Created: 2026-02-15 Effort: Large (3-4 weeks) Dependencies: Multi-Modal Identity (V1.0-V1.3)

Vision

Morphee is chat-first, but chat isn't the only way humans communicate. Adding video capability opens new interaction modalities:

Gesture-based interactions — Thumbs up to approve, wave to acknowledge, point to select
Visual context — Show pictures, diagrams, documents for discussion
Handwriting recognition — Write and have Morphee transcribe/interpret
Visual analysis — Upload a photo and discuss what's in it
Screen sharing (desktop) — Share what you're looking at for context
Accessibility — Kids, elderly, and non-vocal users can interact through gesture/camera

This phase transforms Morphee from a text-centric agent into a multimodal agent that understands visual input and responds to gestures.

Connection to Knowledge Pipeline: Gesture recognition models (TensorFlow.js pose detection) are candidates for the compilation chain — rule-based gesture classifiers start as Level 1 (PythonRuntime), move to Level 2 (JSRuntime) for real-time browser processing, and could be compiled to Level 3 (WasmRuntime) for maximum performance. See ROADMAP.md — BaseMorphRuntime. Also connects to Multi-Modal Identity for gesture-based authentication.

Problem Statement

Current Limitations

Text-only interaction — Users must type or use voice; no visual context sharing
Static approvals — Approving tasks requires clicking buttons; no gesture-based flow
Image-based workflows missing — No way to show a picture and discuss it
Handwriting ignored — Kids drawing or writing isn't accessible to Morphee
Accessibility gap — Non-vocal, physically limited users have fewer interaction options
Visual memory — Can't easily store and recall visual information (photos, diagrams)

Opportunities

Richer family interactions — Kids show drawings, Morphee creates memories; parents use gestures
Professional context — Developers share error screenshots; managers share diagrams
Educational — Teachers show diagrams, Morphee generates questions
Accessibility-first — Voice + gesture + visual cues enable multimodal communication

Design

1. Camera Input & Visual Capture

User Permissions

On first use, browser requests camera permission (standard web API)
Tauri native dialogs on desktop (more polished)
Permission stored in ACL system (granular control per user/group)
Can be revoked in Settings → Privacy

Camera Component

// frontend/src/components/chat/CameraCapture.tsx
<CameraCapture>
  onCapture={(frame: Blob) => {
    // Send to Morphee for analysis or just display
    uploadImage(frame);
  }}
  disabled={!hasPermission}
  previewSize="medium"
/>

Features:

Live camera preview (mobile friendly)
Capture button (takes single frame)
Video mode (record short clip, up to 10s)
Flash/light indicator (for low-light)
Switch camera (front/back on mobile)

Image Upload Component

// frontend/src/components/chat/ImageUploader.tsx
<ImageUploader>
  onUpload={(files: File[]) => {
    // Handle single or multiple images
    uploadImages(files);
  }}
  acceptTypes={['.jpg', '.png', '.gif', '.webp']}
  maxSize={10 * 1024 * 1024} // 10 MB
/>

Features:

Drag-and-drop zone
Click to browse files
Gallery picker (on mobile)
Multi-file upload
Progress indicator

Display in Chat

// Show captured/uploaded images inline
<ChatBubble
  message="Here's the diagram I want you to explain"
  images={[
    { url: '/uploads/diagram.png', alt: 'Architecture diagram', width: 300 }
  ]}
/>

2. Gesture Recognition

Detection Pipeline

Live video frame
    ↓
[TensorFlow.js Pose Detection]
    ↓
Extract hand/head landmarks
    ↓
[Gesture Classification] (rule-based or ML model)
    ↓
Semantic gesture (thumbs_up, wave, point, open_palm)
    ↓
Trigger AI response

Supported Gestures

Gesture	Hands	Head	Meaning	AI Response
Thumbs up	Both thumbs up	—	Approval, positive	Approve pending action
Thumbs down	Both thumbs down	—	Disapproval, negative	Reject pending action
Wave	Hand moving side-to-side	—	Hello, acknowledge	"Hello! What can I help?"
Point	One hand pointing	Head facing that direction	Select, indicate	"I'll help with that"
Open palm	Both palms open at shoulders	—	Stop, pause	Pause AI response
Nod	—	Up/down motion	Yes, agreement	Confirm action
Shake	—	Side-to-side motion	No, disagreement	"Understood, I'll do it differently"

Implementation

Tier 1: Rule-Based (JavaScript)

// frontend/src/lib/gesture-detector.ts
export interface GestureResult {
  gesture: 'thumbs_up' | 'thumbs_down' | 'wave' | 'point' | 'nod' | 'shake' | null;
  confidence: number; // 0-1
  timestamp: number;
}

async function detectGesture(videoFrame: HTMLVideoElement): Promise<GestureResult> {
  const poses = await poseDetector.estimatePoses(videoFrame);
  if (!poses.length) return { gesture: null, confidence: 0, timestamp: Date.now() };

  const landmarks = poses[0].landmarks;

  // Thumbs up: both thumbs above fists
  if (landmarks.leftThumb.y < landmarks.leftIndexFinger.y &&
      landmarks.rightThumb.y < landmarks.rightIndexFinger.y) {
    return { gesture: 'thumbs_up', confidence: 0.9, timestamp: Date.now() };
  }

  // Nod: head moving up/down
  if (headYMovement > threshold) {
    return { gesture: 'nod', confidence: 0.85, timestamp: Date.now() };
  }

  // ... more gesture rules

  return { gesture: null, confidence: 0, timestamp: Date.now() };
}

Tier 2: ML Model (Optional Future)

Train small gesture classification model (TFLite)
More accurate than rule-based, but requires training data
Local inference in browser (no cloud dependency)

Gesture Event Flow

User performs gesture
    ↓
[detectGesture] → gesture type + confidence
    ↓
Gesture confidence > threshold?
    ├─ YES → Emit event to chat
    │         (show "Thumbs up detected" toast)
    │         ↓
    │         Post to /api/chat/component-event
    │         ↓
    │         AI receives → responds naturally
    │
    └─ NO → Ignore (too low confidence)

3. Visual Analysis

Backend: VisualIntegration

# backend/interfaces/integrations/visual.py

class VisualIntegration(BaseInterface):
    """Analyze images, video frames, and visual content."""

    async def analyze_image(self, image_data: str, context: str = None):
        """
        Analyze an image using Claude's vision capabilities.

        Args:
            image_data: base64-encoded image
            context: optional user question/context

        Returns:
            {
              "description": "what's in the image",
              "objects": ["list", "of", "detected", "objects"],
              "text": "any text found (OCR)",
              "analysis": "detailed analysis based on context"
            }
        """
        # Call Claude API with vision
        response = await self.llm.call_with_vision(
            message=f"Analyze this image{f': {context}' if context else ''}",
            image_base64=image_data
        )

        return {
            "description": response.description,
            "objects": response.detected_objects,
            "text": response.ocr_text,
            "analysis": response.analysis
        }

    async def recognize_gesture(self, video_frame: str):
        """
        Recognize gestures in a video frame.

        Note: This is mostly frontend work; backend can optionally
        do additional analysis.

        Args:
            video_frame: base64-encoded frame

        Returns:
            {
              "gesture": "thumbs_up|thumbs_down|wave|nod|...",
              "confidence": 0.95,
              "description": "User gave thumbs up"
            }
        """
        # Optional: use Claude vision for additional gesture analysis
        # if frontend detection is uncertain
        pass

    async def transcribe_handwriting(self, image_data: str):
        """
        Transcribe handwriting from an image using OCR + Claude vision.

        Args:
            image_data: base64-encoded image with handwriting

        Returns:
            {
              "text": "transcribed text",
              "confidence": 0.92,
              "structured_data": {...}  # if it's a form/list
            }
        """
        response = await self.llm.call_with_vision(
            message="Transcribe this handwriting precisely",
            image_base64=image_data
        )

        return {
            "text": response.text,
            "confidence": response.confidence,
            "structured_data": response.parsed_structure
        }

    async def interpret_diagram(self, image_data: str):
        """
        Interpret diagrams (flowcharts, architecture, whiteboard sketches).

        Args:
            image_data: base64-encoded diagram

        Returns:
            {
              "type": "flowchart|architecture|mindmap|...",
              "description": "what the diagram shows",
              "elements": ["list", "of", "components"],
              "relationships": [
                {"from": "A", "to": "B", "relation": "connects_to"}
              ],
              "explanation": "detailed explanation"
            }
        """
        response = await self.llm.call_with_vision(
            message="Explain this diagram in detail. List components and relationships.",
            image_base64=image_data
        )

        return {
            "type": response.diagram_type,
            "description": response.summary,
            "elements": response.components,
            "relationships": response.relationships,
            "explanation": response.detailed_explanation
        }

Frontend: Image Display & Analysis

// Show image in chat with analysis button
<ChatBubble>
  <div className="space-y-2">
    <img src={imageUrl} alt="User uploaded image" />
    <button onClick={() => analyzeImage(imageUrl)}>
      Analyze this image
    </button>
  </div>
</ChatBubble>

// After analysis
<ChatBubble role="assistant">
  <p>{analysis.description}</p>
  <ul>
    {analysis.objects.map(obj => <li key={obj}>{obj}</li>)}
  </ul>
</ChatBubble>

Use Case

"Show me what you're seeing" — User shares a portion of their desktop for context.

Implementation

// Tauri command
const frame = await invoke('screen_capture', {
  captureArea: { x: 0, y: 0, width: 1920, height: 1080 }
});

// Send to Morphee
await api.analyzeImage(frame, "What's on my screen?");

UI

<div className="flex gap-2">
  <Button onClick={() => captureFullScreen()}>
    <Share2 className="w-4 h-4" /> Share screen
  </Button>
  <Button onClick={() => captureRegion()}>
    <Crop className="w-4 h-4" /> Share region
  </Button>
</div>

5. Visual Memory

Memory Integration Extension

Add new MemoryIntegration actions:

# In backend/interfaces/integrations/memory.py

async def store_visual(self, image_data: str, caption: str, memory_type: str):
    """
    Store an image in memory with optional caption/tags.

    Args:
        image_data: base64-encoded image
        caption: user description
        memory_type: "photo|diagram|screenshot|drawing|..."
    """
    # Generate embedding from image description
    description = await visual_integration.analyze_image(image_data)
    embedding = await embeddings.embed(description['description'])

    # Store both the image (in Git) and embedding (in pgvector)
    await vector_store.insert(
        content=caption or description['description'],
        embedding=embedding,
        type='visual_memory',
        metadata={
            'image_type': memory_type,
            'objects': description['objects'],
            'stored_at': datetime.now().isoformat()
        }
    )

    # Save image file to Git
    await git_store.save_memory(
        filename=f"images/{uuid4()}.b64",
        content=f"---\ntype: visual\ncaption: {caption}\n---\n{image_data}"
    )

Frontend: Quick Save

<Button onClick={() => saveToMemory(imageUrl, caption)}>
  <Save className="w-4 h-4" /> Save to memory
</Button>

Implementation Timeline

Phase 3p.1 — Basic Camera & Image Upload (2 weeks)

Backend:

Permission model in ACL system (camera, image_upload)
Image upload endpoint: POST /api/chat/upload-image (multipart/form-data)
Image storage: temporary AWS S3 or local fs (sandboxed per group)
URL generation for display in chat

Frontend:

CameraCapture component (browser camera API)
ImageUploader component (drag-drop, file picker)
ImageMessage rendering in ChatBubble
Permission request flow + Settings toggle
Upload progress indicator

Tauri (Optional for Phase 3p.1):

Native camera access (macOS AVFoundation, Windows MediaCapture, Linux V4L2)
Permission dialog (native vs browser)

Tests:

Frontend: camera permission, upload flow, image display
Backend: permission check, image storage, retrieval
E2E: upload image, see in chat

Phase 3p.2 — Visual Analysis (2 weeks)

Backend:

VisualIntegration(BaseInterface) with 4 actions
Claude vision integration (anthropic-sdk vision calls)
Image caching (store analysis metadata, not images)
Error handling (invalid images, API failures)

Frontend:

"Analyze image" button in ChatBubble
Display analysis results inline
Follow-up questions for images ("Tell me more about...")
Image analysis loading state

Tests:

Backend: vision API calls, error cases, metadata caching
Frontend: analyze button, result display
E2E: upload, analyze, discuss results

Phase 3p.3 — Gesture Recognition (2 weeks)

Frontend:

TensorFlow.js pose detection (load model, init)
GestureDetector component (continuous frame analysis)
GestureRecognizer logic (rule-based gesture classification)
Gesture event dispatch (component-event endpoint)
Visual feedback (toast: "Thumbs up detected!")

Backend:

Handle gesture events in chat orchestrator
Natural language response to gestures ("Got it! I'm on it.")
Gesture → action mapping (thumbs_up = approve_pending)

Tauri:

Camera frame streaming to frontend (optional optimization)

Tests:

Frontend: pose detection, gesture classification, event dispatch
Backend: gesture event handling, action mapping
E2E: perform gesture, see AI response

Phase 3p.4 — Polish & Accessibility (1 week)

Frontend:

Voice feedback for gestures ("I detected your thumbs up")
High-contrast visual indicators (red/green borders for approval)
Gesture customization in Settings
Gesture log (history of detected gestures)
Gesture-only mode (no buttons, all interaction via gesture)

Tauri (Desktop):

screen_capture command (full screen or region)
Screen share UI

Tests:

Voice feedback for screen readers
Gesture-only workflow (approval without clicking)
Accessibility: keyboard nav, screen reader compat

Data & Privacy

What Data Is Captured?

Data	Captured	Stored	Retained	Deleted
Camera frames	Yes (temporary)	No	Session only	On close
Uploaded images	Yes	Temporarily (S3)	24h max	Auto after 24h
Image analysis	Yes	As metadata	Per memory policy	With memory
Gesture frames	No (only landmarks)	No	Session only	On close
Gesture log	Yes	Optional (user consent)	Per user setting	On request

Privacy Safeguards

No raw image storage — Only analyzed metadata + captions stored
Explicit consent — Camera/image upload requires user permission
GDPR compliance — User can delete all images/metadata via data export/deletion
Local processing — Gesture detection happens in browser (no server-side video processing)
Opt-in gesture logging — Users can disable gesture history
Biometric template not stored — Face landmarks extracted during pose detection are NOT stored

Privacy Policy Updates

Add to PRIVACY_POLICY.md:

## Visual Data Processing

When you use Morphee's camera or image upload features:

- **Images you upload** are analyzed by Claude's vision API and deleted within 24 hours
- **Your camera** is only accessed with your explicit permission
- **Gesture data** is processed locally in your browser; no gestures are stored unless you enable gesture logging
- **Analysis results** (descriptions, extracted text) are stored only if you save them to memory
- **Screen captures** (desktop only) are temporary and never stored server-side

You can disable camera/image features in Settings → Privacy at any time.

Database Schema

-- visual_interactions: log of user visual interactions
CREATE TABLE visual_interactions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL REFERENCES morphee_users(id) ON DELETE CASCADE,
  group_id UUID NOT NULL REFERENCES groups(id) ON DELETE CASCADE,
  interaction_type TEXT NOT NULL, -- 'image_upload', 'gesture_detected', 'analysis'
  image_url TEXT,
  gesture_type TEXT,
  gesture_confidence NUMERIC,
  metadata JSONB,
  created_at TIMESTAMP NOT NULL DEFAULT now()
);

-- gesture_preferences: custom gesture bindings per user
CREATE TABLE gesture_preferences (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL REFERENCES morphee_users(id) ON DELETE CASCADE,
  gesture_type TEXT NOT NULL, -- 'thumbs_up', 'nod', etc.
  custom_action TEXT, -- optional custom mapping
  enabled BOOLEAN DEFAULT true,
  created_at TIMESTAMP NOT NULL DEFAULT now(),
  UNIQUE(user_id, gesture_type)
);

-- Enable gesture logging in ACL permissions
-- acl_preferences: add 'gesture_logging' boolean field

API Reference

Image Upload

POST /api/chat/upload-image
Content-Type: multipart/form-data

Body:
  image: <file>,
  conversation_id: <uuid>

Response 200:
{
  "id": "<image_id>",
  "url": "<cdn_url>",
  "size": 1024,
  "mime_type": "image/png"
}

Analyze Image

POST /api/chat/analyze-image
Content-Type: application/json

Body:
{
  "image_id": "<image_id>",
  "context": "What is this?"
}

Response 200:
{
  "description": "A diagram showing...",
  "objects": ["object1", "object2"],
  "text": "Any text found",
  "analysis": "Detailed analysis"
}

Gesture Event (via WebSocket)

// Client sends
{
  "type": "component_event",
  "conversation_id": "<uuid>",
  "component_id": "gesture_detector",
  "event": "gesture_detected",
  "data": {
    "gesture": "thumbs_up",
    "confidence": 0.95
  }
}

// Server responds
{
  "type": "assistant_response",
  "content": "Got it! Approving that for you."
}

Testing Strategy

Frontend Unit Tests

// @vitest
describe('CameraCapture', () => {
  it('should request camera permission on mount', async () => {
    // Mock getUserMedia
  });

  it('should capture frame on button click', async () => {
    // Verify frame is returned as Blob
  });
});

describe('GestureDetector', () => {
  it('should detect thumbs up with >90% confidence', async () => {
    // Load test video, run detector
  });

  it('should emit gesture event on high confidence', async () => {
    // Verify event dispatched
  });
});

Backend Integration Tests

@pytest.mark.asyncio
async def test_analyze_image_with_vision():
    # Upload test image
    # Call analyze endpoint
    # Verify Claude vision was called
    # Verify metadata returned

@pytest.mark.asyncio
async def test_gesture_event_permission_check():
    # User without gesture permission
    # POST gesture event
    # Expect 403 Forbidden

E2E Tests (Playwright)

test('User uploads image and asks Morphee to analyze it', async ({ page }) => {
  // Login
  // Open chat
  // Click camera button
  // Select test image
  // Morphee analyzes and responds
});

test('User approves pending task via thumbs up gesture', async ({ page }) => {
  // Morphee asks for approval
  // User performs thumbs up gesture
  // Task is approved
});

Success Metrics

Metric	Target	How to Measure
Camera permission grant rate	>80%	Analytics: permission requests vs grants
Image upload frequency	>20% of users use weekly	Usage analytics
Gesture detection accuracy	>90%	E2E tests + user feedback
Image analysis latency	<2s	CloudWatch metrics
User satisfaction	>4.2/5	In-app survey post-analysis
Accessibility impact	50% increase in non-text interactions	Interaction logs

Risks & Mitigations

Risk	Impact	Mitigation
Camera permission denial	Users can't use camera feature	Graceful degradation; image upload still works
Vision API rate limits	Analysis delayed/failed	Queue long-running analyses; cache results
Gesture false positives	Accidental approvals	High confidence threshold (>90%); confirmation prompt
Privacy concerns	User trust erosion	Clear policy; local processing where possible; transparency
Browser support	Limited to modern browsers	Feature detect; fallback to image upload
Mobile performance	High CPU/battery drain	Optimize frame processing; reduce frequency; offer toggle

Future Enhancements (Phase 3p+)

Handwriting recognition as notes — "Take a picture of my note" → stored in memory
Gesture learning — Users can teach Morphee custom gestures
Multi-gesture combos — Detect two-hand gestures for complex commands
Real-time video chat — WebRTC peer connection to Morphee (requires cloud liveness detection)
Vision-based authentication — Gesture or face for step-up auth (ties to Phase 3o)
Whiteboard collaboration — Multiple users draw + Morphee joins as text/voice layer
Augmented reality — AR overlay showing Morphee's response in real world (AR glasses era)

References

TensorFlow.js Pose Detection: https://github.com/tensorflow/tfjs-models/tree/master/pose-detection
Claude Vision API: https://docs.anthropic.com/en/api/vision
Web Permissions API: https://developer.mozilla.org/en-US/docs/Web/API/Permissions_API
Tauri Camera Plugin: https://docs.tauri.app/features/system-tray/ (future)

Last Updated: 2026-02-20 Status: Ready for Planning Phase (V1.3)

Vision​

Problem Statement​

Current Limitations​

Opportunities​

Design​

1. Camera Input & Visual Capture​

User Permissions​

Camera Component​

Image Upload Component​

Display in Chat​

2. Gesture Recognition​

Detection Pipeline​

Supported Gestures​

Implementation​

Gesture Event Flow​

3. Visual Analysis​

Backend: VisualIntegration​

Frontend: Image Display & Analysis​

4. Screen Sharing (Desktop/Tauri)​

Use Case​

Implementation​

UI​

5. Visual Memory​

Memory Integration Extension​

Frontend: Quick Save​

Implementation Timeline​

Phase 3p.1 — Basic Camera & Image Upload (2 weeks)​

Phase 3p.2 — Visual Analysis (2 weeks)​

Phase 3p.3 — Gesture Recognition (2 weeks)​

Phase 3p.4 — Polish & Accessibility (1 week)​

Data & Privacy​

What Data Is Captured?​

Privacy Safeguards​

Privacy Policy Updates​

Database Schema​

API Reference​

Image Upload​

Analyze Image​

Gesture Event (via WebSocket)​

Testing Strategy​

Frontend Unit Tests​

Backend Integration Tests​

E2E Tests (Playwright)​

Success Metrics​

Risks & Mitigations​

Future Enhancements (Phase 3p+)​

References​

Vision

Problem Statement

Current Limitations

Opportunities

Design

1. Camera Input & Visual Capture

User Permissions

Camera Component

Image Upload Component

Display in Chat

2. Gesture Recognition

Detection Pipeline

Supported Gestures

Implementation

Gesture Event Flow

3. Visual Analysis

Backend: VisualIntegration

Frontend: Image Display & Analysis

4. Screen Sharing (Desktop/Tauri)

Use Case

Implementation

UI

5. Visual Memory

Memory Integration Extension

Frontend: Quick Save

Implementation Timeline

Phase 3p.1 — Basic Camera & Image Upload (2 weeks)

Phase 3p.2 — Visual Analysis (2 weeks)

Phase 3p.3 — Gesture Recognition (2 weeks)

Phase 3p.4 — Polish & Accessibility (1 week)

Data & Privacy

What Data Is Captured?

Privacy Safeguards

Privacy Policy Updates

Database Schema

API Reference

Image Upload

Analyze Image

Gesture Event (via WebSocket)

Testing Strategy

Frontend Unit Tests

Backend Integration Tests

E2E Tests (Playwright)

Success Metrics

Risks & Mitigations

Future Enhancements (Phase 3p+)

References