Monitoring & Analytics — Design Document

Created: February 13, 2026 Status: Planned Phase: 3k (Observability)

Overview

Production observability via PostHog (product analytics) and Grafana + Prometheus (infrastructure metrics). Understand user behavior, monitor system health, debug issues, and track performance.

Motivation

Problem: Without observability, we're blind:

Can't answer "what features do users actually use?"
No visibility into performance bottlenecks (slow API endpoints, memory leaks)
Debugging user-reported issues is guesswork
No data-driven product decisions

Solution: Two-layer monitoring:

PostHog — Product analytics (user behavior, feature usage, funnels)
Grafana + Prometheus — Infrastructure metrics (API latency, LLM tokens, error rates)

PostHog — Product Analytics

What to Track

Events (10+ core events):

Event	Properties	Purpose
`chat_message_sent`	conversation_id, message_length, streaming	Engagement: messages per day, conversation length
`tool_call_executed`	integration, action, success, duration_ms	Feature usage: which integrations are most used?
`approval_requested`	tool, approved, timeout	User friction: approval rates, timeout frequency
`onboarding_completed`	persona, spaces_created, time_to_complete	Conversion: onboarding completion rate, drop-off points
`integration_connected`	integration_name, success	Adoption: which integrations do users connect first?
`conversation_created`	space_id, title	Usage patterns: conversations per user/day
`search_performed`	query, result_count, result_types	Feature value: search usage, relevance
`setting_changed`	category, key, old_value, new_value	Configuration: what do users customize?
`error_occurred`	error_type, endpoint, stack_trace	Reliability: error frequency, top error sources
`page_viewed`	page, referrer, duration	Navigation: user flows, page popularity

User properties:

persona — parent, teacher, manager, etc.
group_size — number of members in group
active_integrations — list of connected integrations
plan_tier — free, pro, enterprise (future)
signup_date — cohort analysis

Implementation

Backend: PostHog Client

# backend/analytics/posthog.py

from posthog import Posthog
from backend.config import settings

posthog_client = Posthog(
    project_api_key=settings.POSTHOG_API_KEY,
    host=settings.POSTHOG_HOST  # self-hosted or cloud
)

async def capture_event(
    user_id: str,
    event: str,
    properties: dict,
    group_id: str | None = None
):
    """Capture event in PostHog (async, non-blocking)."""
    if not await has_consent(user_id, "analytics"):
        return  # User opted out

    # PII scrubbing: never send message content, only metadata
    sanitized_props = sanitize_properties(properties)

    posthog_client.capture(
        distinct_id=user_id,
        event=event,
        properties=sanitized_props,
        groups={"group": group_id} if group_id else None
    )

Privacy: Opt-in + PII Scrubbing

Opt-in only — analytics_enabled setting (default: false, user must explicitly enable)
Never log PII — no message content, no emails, no names. Only opaque IDs and metadata
Anonymize IPs — PostHog config: anonymize_ips: true
Session replay opt-in — Separate consent: session_replay_enabled (default: false)

Event capture points:

# backend/api/chat.py
await analytics.capture_event(
    user_id=ctx.user_id,
    event="chat_message_sent",
    properties={"conversation_id": conv_id, "message_length": len(content)},
    group_id=ctx.group_id
)

# backend/chat/orchestrator.py
await analytics.capture_event(
    user_id=ctx.user_id,
    event="tool_call_executed",
    properties={"integration": interface, "action": action, "success": result.success, "duration_ms": elapsed},
    group_id=ctx.group_id
)

PostHog Dashboards

1. Engagement

Daily/weekly/monthly active users
Messages per user per day
Conversations per user per week
Session duration distribution

2. Feature Usage

Most-used integrations (bar chart)
Tool calls by integration (pie chart)
Approval request rates (% approved, % rejected, % timeout)
Search usage (queries per user, result click-through rate)

3. Onboarding

Funnel: signup → onboarding start → onboarding complete → first message
Time to first message (median, p95)
Onboarding drop-off points
Persona distribution (pie chart)

4. Errors

Error rate over time (line chart)
Top 10 error types (table)
Errors by endpoint (table)
User-reported errors vs auto-detected (comparison)

Grafana + Prometheus — Infrastructure Metrics

What to Track

API Metrics:

from prometheus_client import Counter, Histogram, Gauge

# Request counter
api_requests_total = Counter(
    'morphee_api_requests_total',
    'Total API requests',
    ['endpoint', 'status']
)

# Latency histogram
api_request_duration_seconds = Histogram(
    'morphee_api_request_duration_seconds',
    'API request latency',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)

# Active connections
websocket_connections_active = Gauge(
    'morphee_websocket_connections_active',
    'Active WebSocket connections'
)

LLM Metrics:

# Token usage
llm_tokens_total = Counter(
    'morphee_llm_tokens_total',
    'Total LLM tokens consumed',
    ['model', 'type']  # type: input or output
)

# Streaming latency
llm_streaming_latency_seconds = Histogram(
    'morphee_llm_streaming_latency_seconds',
    'Time to first token',
    ['model']
)

# Tool call count
llm_tool_calls_total = Counter(
    'morphee_llm_tool_calls_total',
    'Total tool calls',
    ['integration', 'action']
)

Memory Metrics:

# RAG pipeline
memory_rag_latency_seconds = Histogram(
    'morphee_memory_rag_latency_seconds',
    'RAG pipeline latency',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

# Vector search
memory_vector_search_results = Histogram(
    'morphee_memory_vector_search_results',
    'Number of results returned',
    buckets=[0, 1, 5, 10, 20, 50]
)

# Git sync
memory_git_sync_success = Counter(
    'morphee_memory_git_sync_success',
    'Git sync successes',
    ['group_id']
)

memory_git_sync_failures = Counter(
    'morphee_memory_git_sync_failures',
    'Git sync failures',
    ['group_id', 'error_type']
)

Implementation

Backend: Prometheus Client

# backend/utils/metrics.py

from prometheus_client import start_http_server, CollectorRegistry
from fastapi import FastAPI, Response

# Initialize registry
registry = CollectorRegistry()

# Define metrics (see above)

# Expose /metrics endpoint
@app.get("/metrics")
async def metrics():
    from prometheus_client import generate_latest
    return Response(content=generate_latest(registry), media_type="text/plain")

Middleware: Request tracking

# backend/main.py

from time import time

@app.middleware("http")
async def track_request_metrics(request: Request, call_next):
    start = time()
    response = await call_next(request)
    duration = time() - start

    api_requests_total.labels(
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    api_request_duration_seconds.labels(
        endpoint=request.url.path
    ).observe(duration)

    return response

Grafana Dashboards

1. API Health

Request rate (req/sec) — line chart
Error rate (%) — line chart with alert threshold (>1% = red)
Latency (p50, p95, p99) — line chart
Status code distribution (2xx, 4xx, 5xx) — stacked area chart

2. LLM Performance

Tokens per second — line chart
Streaming latency (time to first token) — histogram
Tool execution time — bar chart by integration
Tool call distribution — pie chart

3. Memory System

RAG latency — histogram
Vector search latency — line chart
Git sync success rate (%) — gauge
Embedding generation time — line chart

4. User Activity

Active users (1h, 24h, 7d) — 3 gauges
Conversations per day — line chart
Messages per conversation — histogram
WebSocket connections — line chart

5. Errors & Alerts

Exception rate — line chart with alert threshold
Top 10 error types — table
Failed tool calls — line chart
Timeout rate (approval, LLM) — line chart

Alerts

Slack webhook on:

Error rate > 1% for 5 minutes
LLM timeout > 10s (p95)
Git sync failures > 3 in 1 hour
WebSocket disconnect rate > 10% of active connections

Docker Compose Setup

# docker-compose.dev.yml

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources

volumes:
  prometheus-data:
  grafana-data:

prometheus.yml:

scrape_configs:
  - job_name: 'morphee-backend'
    static_configs:
      - targets: ['backend:8000']
    scrape_interval: 15s
    metrics_path: '/metrics'

Effort Estimation

Task	Size	Notes
Backend: PostHog client + event capture	M	Wrapper, 10 event definitions, PII scrubbing
Backend: Prometheus metrics	M	15+ custom metrics, middleware
Backend: `/metrics` endpoint	S	Expose Prometheus
Frontend: PostHog SDK integration	S	Event capture on user actions
Docker: Prometheus + Grafana setup	S	Compose config, volume mounts
Grafana: 5 dashboards	M	JSON dashboard configs
Alert rules: Slack webhooks	S	Prometheus alertmanager config
Tests: Backend metrics	M	20+ tests

Total: Large (2 weeks)

Risks & Mitigations

Risk	Mitigation
Privacy violations (logging PII)	Strict PII scrubbing, opt-in analytics, regular audits
Performance overhead (metrics collection)	Async event capture, metric sampling for high-volume endpoints
Alert fatigue (too many alerts)	Start with critical alerts only, tune thresholds over time
Prometheus disk usage growth	Set retention period (30 days), compress old data

Future Enhancements

Core Features

AI-queryable monitoring — MonitoringIntegration so AI can answer "How's the app performing today?"
Anomaly detection — ML-based alerting (Prometheus + Grafana ML)
Cost tracking — LLM API costs per user/group, budget alerts
User feedback loop — In-app feedback widget → PostHog event
Self-hosted PostHog — For privacy-focused deployments
Real-time dashboards — Live-updating metrics in Settings page (not just Grafana)
Performance budgets — Alert when bundle size, API latency, or LLM tokens exceed thresholds

ACL & Permission Enhancements

Role-Based Analytics Access (leveraging existing PermissionPolicy):

class AnalyticsPermission(BaseModel):
    metric_category: str  # "user_activity", "system_health", "costs", "privacy"
    can_view: list[str]   # Roles: ["parent", "admin"]
    can_export: list[str] # Roles: ["admin"]
    scope: str           # "own", "group", "system"
    masked_fields: list[str]  # Fields to mask: ["user_id", "message_content"]

Use cases:

Family Analytics:
- Parents can see group-wide usage stats (messages per day, most-used features)
- Parents can see each kid's usage (with consent, GDPR compliant)
- Kids can see only their own stats (not siblings' data)
- Privacy toggle: Kid can opt out of parent viewing their stats
Classroom Analytics:
- Teacher can see class-wide engagement metrics (participation, assignment completion)
- Teacher can see per-student activity (with school policy compliance)
- Students see only their own progress dashboards
- Admin can see system-wide stats (all classes, anonymized)
Team/Organization:
- Manager can see team productivity metrics (tasks completed, meetings scheduled)
- Manager can't see individual private messages (privacy boundary)
- Team members see their own contribution metrics
- C-suite sees org-wide KPIs (aggregated, anonymized)

Privacy-Aware Analytics:

Granular consent per metric type:

consent_types = [
  "analytics.usage_patterns",      # Can track feature usage
  "analytics.performance",         # Can track app performance
  "analytics.error_reports",       # Can capture error logs
  "analytics.session_replay",      # Can record sessions (high privacy impact)
]

User can opt in/out per category via SettingsIntegration
Stored in user_consents table (already exists for GDPR)
Analytics client checks consent before capturing events

Data Access Logs:

Log who viewed which analytics dashboards: {user_id, dashboard, timestamp}
GDPR compliance: "Show me who accessed my analytics data"
Stored in analytics_access_log table
Audit trail for compliance reviews

Advanced Analytics Features

Per-Space Analytics:

Track metrics per Space: "Homework space has 80% task completion rate"
Compare Spaces: "Meal planning space has 2x more activity than shopping space"
Space health score: Activity level, task completion, member engagement
Stored in PostHog with space_id property

Per-Integration Analytics:

Which integrations are most used? (Gmail > Calendar > Tasks)
Integration health: Success rate, average latency, error rate
Integration adoption funnel: connected → first use → daily active
Cost attribution: "Google Calendar integration used 1000 API calls this month"

Cohort Analysis:

Group users by persona: "Parents use calendar 3x more than kids"
Group users by signup date: "Week 1 cohort has 80% retention"
Group users by feature adoption: "Users who connect calendar are 2x more likely to stay active"
PostHog built-in cohorts + custom cohorts

Funnel Analysis:

Onboarding funnel: signup → group creation → first message → first task created
Feature adoption funnel: connected Google → first calendar event → scheduled recurring event
Conversion funnel: free → trial → paid (future monetization)
Identify drop-off points: "60% of users drop after onboarding step 3"

Retention Analysis:

Daily/weekly/monthly active users (DAU/WAU/MAU)
Retention curves: Day 1, Day 7, Day 30 retention rates
Churn prediction: "Users who don't connect integrations in first week are 3x more likely to churn"
Re-engagement campaigns: Send notification to inactive users

A/B Testing:

PostHog feature flags for gradual rollout: "Enable new chat UI for 10% of users"
Measure impact: "New UI increased messages per day by 15%"
Automatic winner selection: PostHog Experiments picks winning variant
Rollout to 100% after validation

AI-Powered Monitoring Enhancements

MonitoringIntegration (new Integration):

class MonitoringIntegration(BaseInterface):
    name = "monitoring"
    description = "Query system health, performance, and usage analytics"

    actions:
      get_system_health() → {status: "healthy", uptime: 99.9, errors_last_hour: 3}
      get_performance_metrics(metric_name?) → {api_latency_p95: 250ms, llm_tokens_per_sec: 45}
      get_usage_stats(time_range?) → {messages_today: 127, active_users: 5}
      get_cost_breakdown(period?) → {llm_cost: $12.50, storage_cost: $2.30}
      get_user_activity(user_id?) → {messages: 45, tasks_created: 8, last_active: "2h ago"}

AI can answer questions like:

"How's the app doing today?" → queries get_system_health() → "Everything's running smoothly, no errors"
"How many messages did we send this week?" → get_usage_stats("7d") → "Your group sent 340 messages this week"
"What's our LLM cost this month?" → get_cost_breakdown("30d") → "You've spent $38 on AI this month, under your $50 budget"
"Show me my activity today" → get_user_activity() → "You sent 12 messages, created 3 tasks, and searched 2 times"

Proactive Monitoring Alerts:

AI notices anomaly: "API latency spiked to 2 seconds (usually 200ms) — investigating..."
AI suggests optimization: "You're using 2x more LLM tokens than average — want to reduce context window?"
AI detects unused features: "You haven't used calendar in 30 days — disconnect to reduce clutter?"

Smart Alerting:

Context-aware: Don't alert during maintenance windows
Role-based: Parents get family usage alerts, admins get system health alerts
Intelligent throttling: Group similar alerts, don't spam (max 1 alert per hour per category)
Alert via NotificationsIntegration: "System health alert — database connection slow"

Predictive Analytics:

Predict churn: "User hasn't logged in for 5 days, usually logs daily — send re-engagement notification?"
Predict resource needs: "Your group's usage growing 20%/month — upgrade to larger plan in 2 months?"
Predict failures: "Error rate increasing (trend detected) — preemptive investigation triggered"

Cost Management & Optimization

Cost Tracking:

cost_breakdown = {
  "llm": {
    "anthropic_api": {"input_tokens": 1.2M, "output_tokens": 800K, "cost": $42.50},
    "embeddings": {"vectors_generated": 15K, "cost": $0.30}
  },
  "storage": {
    "postgresql": {"storage_gb": 5, "cost": $2.50},
    "redis": {"memory_gb": 1, "cost": $0.50}
  },
  "infrastructure": {
    "compute": {"hours": 720, "cost": $30.00},
    "bandwidth": {"gb": 50, "cost": $5.00}
  },
  "total": $81.30
}

Budget Alerts:

Set monthly budget: settings.update("monitoring.budget", {"monthly": 100})
Alert at 80%: "You've used $80 of your $100 monthly budget"
Alert at 100%: "Budget exceeded — consider upgrading or reducing usage"
Per-integration budgets: "Google Calendar used $5 this month (under $10 limit)"

Cost Optimization Suggestions:

AI analyzes costs: "You're spending 70% on LLM — reduce temperature from 0.9 to 0.7 to save 20%?"
AI suggests caching: "You're re-generating the same summaries — enable caching to save $10/month"
AI suggests model switching: "For simple tasks, use Haiku instead of Sonnet — save 80%"

Privacy & Security for Analytics

Differential Privacy:

Add noise to aggregate metrics to prevent re-identification
Group-level stats show "≈5 active users" instead of exact count
Protects individual behavior from being inferred

Data Retention Policies:

Raw events: 30 days (PostHog)
Aggregated metrics: 1 year (Grafana)
Audit logs: 7 years (compliance)
User can request early deletion: GDPR right to erasure

Anonymization Pipeline:

Before storing events, strip PII:
- user_id → hashed_user_id
- message_content → message_length
- email → domain_only (e.g., "gmail.com")
Anonymization policy configurable per metric

GDPR Compliance:

Right to access: Export all analytics data for user
Right to erasure: Delete all analytics events for user (cascading delete)
Right to portability: Export analytics in JSON format
Consent management: Opt-in for each analytics category

Real-Time Monitoring in App

Settings Page Metrics (read-only dashboard):

"Your Activity Today": messages sent, tasks created, integrations used
"Group Activity": top 3 active members, most-used features
"System Health": uptime, last error, performance status (green/yellow/red)
Embedded mini Grafana charts (iframe) or custom React components

Push Notifications for Alerts:

"System slow — we're investigating" (via NotificationsIntegration)
"Your LLM budget 80% used — $20 remaining this month"
"New feature available — try the new calendar view!"

In-Chat Analytics:

User: "How much have I used the app this week?" → AI calls MonitoringIntegration → responds inline
Interactive charts rendered as FrontendIntegration components in chat

Last Updated: February 13, 2026

Overview​

Motivation​

PostHog — Product Analytics​

What to Track​

Implementation​

PostHog Dashboards​

Grafana + Prometheus — Infrastructure Metrics​

What to Track​

Implementation​

Grafana Dashboards​

Alerts​

Docker Compose Setup​

Effort Estimation​

Risks & Mitigations​

Future Enhancements​

Core Features​

ACL & Permission Enhancements​

Advanced Analytics Features​

AI-Powered Monitoring Enhancements​

Cost Management & Optimization​

Privacy & Security for Analytics​

Real-Time Monitoring in App​

Overview

Motivation

PostHog — Product Analytics

What to Track

Implementation

PostHog Dashboards

Grafana + Prometheus — Infrastructure Metrics

What to Track

Implementation

Grafana Dashboards

Alerts

Docker Compose Setup

Effort Estimation

Risks & Mitigations

Future Enhancements

Core Features

ACL & Permission Enhancements

Advanced Analytics Features

AI-Powered Monitoring Enhancements

Cost Management & Optimization

Privacy & Security for Analytics

Real-Time Monitoring in App