Skip to main content

Monitoring & Analytics — Design Document

Created: February 13, 2026 Status: Planned Phase: 3k (Observability)


Overview

Production observability via PostHog (product analytics) and Grafana + Prometheus (infrastructure metrics). Understand user behavior, monitor system health, debug issues, and track performance.


Motivation

Problem: Without observability, we're blind:

  • Can't answer "what features do users actually use?"
  • No visibility into performance bottlenecks (slow API endpoints, memory leaks)
  • Debugging user-reported issues is guesswork
  • No data-driven product decisions

Solution: Two-layer monitoring:

  1. PostHog — Product analytics (user behavior, feature usage, funnels)
  2. Grafana + Prometheus — Infrastructure metrics (API latency, LLM tokens, error rates)

PostHog — Product Analytics

What to Track

Events (10+ core events):

EventPropertiesPurpose
chat_message_sentconversation_id, message_length, streamingEngagement: messages per day, conversation length
tool_call_executedintegration, action, success, duration_msFeature usage: which integrations are most used?
approval_requestedtool, approved, timeoutUser friction: approval rates, timeout frequency
onboarding_completedpersona, spaces_created, time_to_completeConversion: onboarding completion rate, drop-off points
integration_connectedintegration_name, successAdoption: which integrations do users connect first?
conversation_createdspace_id, titleUsage patterns: conversations per user/day
search_performedquery, result_count, result_typesFeature value: search usage, relevance
setting_changedcategory, key, old_value, new_valueConfiguration: what do users customize?
error_occurrederror_type, endpoint, stack_traceReliability: error frequency, top error sources
page_viewedpage, referrer, durationNavigation: user flows, page popularity

User properties:

  • persona — parent, teacher, manager, etc.
  • group_size — number of members in group
  • active_integrations — list of connected integrations
  • plan_tier — free, pro, enterprise (future)
  • signup_date — cohort analysis

Implementation

Backend: PostHog Client

# backend/analytics/posthog.py

from posthog import Posthog
from backend.config import settings

posthog_client = Posthog(
project_api_key=settings.POSTHOG_API_KEY,
host=settings.POSTHOG_HOST # self-hosted or cloud
)

async def capture_event(
user_id: str,
event: str,
properties: dict,
group_id: str | None = None
):
"""Capture event in PostHog (async, non-blocking)."""
if not await has_consent(user_id, "analytics"):
return # User opted out

# PII scrubbing: never send message content, only metadata
sanitized_props = sanitize_properties(properties)

posthog_client.capture(
distinct_id=user_id,
event=event,
properties=sanitized_props,
groups={"group": group_id} if group_id else None
)

Privacy: Opt-in + PII Scrubbing

  • Opt-in onlyanalytics_enabled setting (default: false, user must explicitly enable)
  • Never log PII — no message content, no emails, no names. Only opaque IDs and metadata
  • Anonymize IPs — PostHog config: anonymize_ips: true
  • Session replay opt-in — Separate consent: session_replay_enabled (default: false)

Event capture points:

# backend/api/chat.py
await analytics.capture_event(
user_id=ctx.user_id,
event="chat_message_sent",
properties={"conversation_id": conv_id, "message_length": len(content)},
group_id=ctx.group_id
)

# backend/chat/orchestrator.py
await analytics.capture_event(
user_id=ctx.user_id,
event="tool_call_executed",
properties={"integration": interface, "action": action, "success": result.success, "duration_ms": elapsed},
group_id=ctx.group_id
)

PostHog Dashboards

1. Engagement

  • Daily/weekly/monthly active users
  • Messages per user per day
  • Conversations per user per week
  • Session duration distribution

2. Feature Usage

  • Most-used integrations (bar chart)
  • Tool calls by integration (pie chart)
  • Approval request rates (% approved, % rejected, % timeout)
  • Search usage (queries per user, result click-through rate)

3. Onboarding

  • Funnel: signup → onboarding start → onboarding complete → first message
  • Time to first message (median, p95)
  • Onboarding drop-off points
  • Persona distribution (pie chart)

4. Errors

  • Error rate over time (line chart)
  • Top 10 error types (table)
  • Errors by endpoint (table)
  • User-reported errors vs auto-detected (comparison)

Grafana + Prometheus — Infrastructure Metrics

What to Track

API Metrics:

from prometheus_client import Counter, Histogram, Gauge

# Request counter
api_requests_total = Counter(
'morphee_api_requests_total',
'Total API requests',
['endpoint', 'status']
)

# Latency histogram
api_request_duration_seconds = Histogram(
'morphee_api_request_duration_seconds',
'API request latency',
['endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)

# Active connections
websocket_connections_active = Gauge(
'morphee_websocket_connections_active',
'Active WebSocket connections'
)

LLM Metrics:

# Token usage
llm_tokens_total = Counter(
'morphee_llm_tokens_total',
'Total LLM tokens consumed',
['model', 'type'] # type: input or output
)

# Streaming latency
llm_streaming_latency_seconds = Histogram(
'morphee_llm_streaming_latency_seconds',
'Time to first token',
['model']
)

# Tool call count
llm_tool_calls_total = Counter(
'morphee_llm_tool_calls_total',
'Total tool calls',
['integration', 'action']
)

Memory Metrics:

# RAG pipeline
memory_rag_latency_seconds = Histogram(
'morphee_memory_rag_latency_seconds',
'RAG pipeline latency',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

# Vector search
memory_vector_search_results = Histogram(
'morphee_memory_vector_search_results',
'Number of results returned',
buckets=[0, 1, 5, 10, 20, 50]
)

# Git sync
memory_git_sync_success = Counter(
'morphee_memory_git_sync_success',
'Git sync successes',
['group_id']
)

memory_git_sync_failures = Counter(
'morphee_memory_git_sync_failures',
'Git sync failures',
['group_id', 'error_type']
)

Implementation

Backend: Prometheus Client

# backend/utils/metrics.py

from prometheus_client import start_http_server, CollectorRegistry
from fastapi import FastAPI, Response

# Initialize registry
registry = CollectorRegistry()

# Define metrics (see above)

# Expose /metrics endpoint
@app.get("/metrics")
async def metrics():
from prometheus_client import generate_latest
return Response(content=generate_latest(registry), media_type="text/plain")

Middleware: Request tracking

# backend/main.py

from time import time

@app.middleware("http")
async def track_request_metrics(request: Request, call_next):
start = time()
response = await call_next(request)
duration = time() - start

api_requests_total.labels(
endpoint=request.url.path,
status=response.status_code
).inc()

api_request_duration_seconds.labels(
endpoint=request.url.path
).observe(duration)

return response

Grafana Dashboards

1. API Health

  • Request rate (req/sec) — line chart
  • Error rate (%) — line chart with alert threshold (>1% = red)
  • Latency (p50, p95, p99) — line chart
  • Status code distribution (2xx, 4xx, 5xx) — stacked area chart

2. LLM Performance

  • Tokens per second — line chart
  • Streaming latency (time to first token) — histogram
  • Tool execution time — bar chart by integration
  • Tool call distribution — pie chart

3. Memory System

  • RAG latency — histogram
  • Vector search latency — line chart
  • Git sync success rate (%) — gauge
  • Embedding generation time — line chart

4. User Activity

  • Active users (1h, 24h, 7d) — 3 gauges
  • Conversations per day — line chart
  • Messages per conversation — histogram
  • WebSocket connections — line chart

5. Errors & Alerts

  • Exception rate — line chart with alert threshold
  • Top 10 error types — table
  • Failed tool calls — line chart
  • Timeout rate (approval, LLM) — line chart

Alerts

Slack webhook on:

  • Error rate > 1% for 5 minutes
  • LLM timeout > 10s (p95)
  • Git sync failures > 3 in 1 hour
  • WebSocket disconnect rate > 10% of active connections

Docker Compose Setup

# docker-compose.dev.yml

services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources

volumes:
prometheus-data:
grafana-data:

prometheus.yml:

scrape_configs:
- job_name: 'morphee-backend'
static_configs:
- targets: ['backend:8000']
scrape_interval: 15s
metrics_path: '/metrics'

Effort Estimation

TaskSizeNotes
Backend: PostHog client + event captureMWrapper, 10 event definitions, PII scrubbing
Backend: Prometheus metricsM15+ custom metrics, middleware
Backend: /metrics endpointSExpose Prometheus
Frontend: PostHog SDK integrationSEvent capture on user actions
Docker: Prometheus + Grafana setupSCompose config, volume mounts
Grafana: 5 dashboardsMJSON dashboard configs
Alert rules: Slack webhooksSPrometheus alertmanager config
Tests: Backend metricsM20+ tests

Total: Large (2 weeks)


Risks & Mitigations

RiskMitigation
Privacy violations (logging PII)Strict PII scrubbing, opt-in analytics, regular audits
Performance overhead (metrics collection)Async event capture, metric sampling for high-volume endpoints
Alert fatigue (too many alerts)Start with critical alerts only, tune thresholds over time
Prometheus disk usage growthSet retention period (30 days), compress old data

Future Enhancements

Core Features

  • AI-queryable monitoringMonitoringIntegration so AI can answer "How's the app performing today?"
  • Anomaly detection — ML-based alerting (Prometheus + Grafana ML)
  • Cost tracking — LLM API costs per user/group, budget alerts
  • User feedback loop — In-app feedback widget → PostHog event
  • Self-hosted PostHog — For privacy-focused deployments
  • Real-time dashboards — Live-updating metrics in Settings page (not just Grafana)
  • Performance budgets — Alert when bundle size, API latency, or LLM tokens exceed thresholds

ACL & Permission Enhancements

Role-Based Analytics Access (leveraging existing PermissionPolicy):

class AnalyticsPermission(BaseModel):
metric_category: str # "user_activity", "system_health", "costs", "privacy"
can_view: list[str] # Roles: ["parent", "admin"]
can_export: list[str] # Roles: ["admin"]
scope: str # "own", "group", "system"
masked_fields: list[str] # Fields to mask: ["user_id", "message_content"]

Use cases:

  1. Family Analytics:

    • Parents can see group-wide usage stats (messages per day, most-used features)
    • Parents can see each kid's usage (with consent, GDPR compliant)
    • Kids can see only their own stats (not siblings' data)
    • Privacy toggle: Kid can opt out of parent viewing their stats
  2. Classroom Analytics:

    • Teacher can see class-wide engagement metrics (participation, assignment completion)
    • Teacher can see per-student activity (with school policy compliance)
    • Students see only their own progress dashboards
    • Admin can see system-wide stats (all classes, anonymized)
  3. Team/Organization:

    • Manager can see team productivity metrics (tasks completed, meetings scheduled)
    • Manager can't see individual private messages (privacy boundary)
    • Team members see their own contribution metrics
    • C-suite sees org-wide KPIs (aggregated, anonymized)

Privacy-Aware Analytics:

  • Granular consent per metric type:
    consent_types = [
    "analytics.usage_patterns", # Can track feature usage
    "analytics.performance", # Can track app performance
    "analytics.error_reports", # Can capture error logs
    "analytics.session_replay", # Can record sessions (high privacy impact)
    ]
  • User can opt in/out per category via SettingsIntegration
  • Stored in user_consents table (already exists for GDPR)
  • Analytics client checks consent before capturing events

Data Access Logs:

  • Log who viewed which analytics dashboards: {user_id, dashboard, timestamp}
  • GDPR compliance: "Show me who accessed my analytics data"
  • Stored in analytics_access_log table
  • Audit trail for compliance reviews

Advanced Analytics Features

Per-Space Analytics:

  • Track metrics per Space: "Homework space has 80% task completion rate"
  • Compare Spaces: "Meal planning space has 2x more activity than shopping space"
  • Space health score: Activity level, task completion, member engagement
  • Stored in PostHog with space_id property

Per-Integration Analytics:

  • Which integrations are most used? (Gmail > Calendar > Tasks)
  • Integration health: Success rate, average latency, error rate
  • Integration adoption funnel: connected → first use → daily active
  • Cost attribution: "Google Calendar integration used 1000 API calls this month"

Cohort Analysis:

  • Group users by persona: "Parents use calendar 3x more than kids"
  • Group users by signup date: "Week 1 cohort has 80% retention"
  • Group users by feature adoption: "Users who connect calendar are 2x more likely to stay active"
  • PostHog built-in cohorts + custom cohorts

Funnel Analysis:

  • Onboarding funnel: signup → group creation → first message → first task created
  • Feature adoption funnel: connected Google → first calendar event → scheduled recurring event
  • Conversion funnel: free → trial → paid (future monetization)
  • Identify drop-off points: "60% of users drop after onboarding step 3"

Retention Analysis:

  • Daily/weekly/monthly active users (DAU/WAU/MAU)
  • Retention curves: Day 1, Day 7, Day 30 retention rates
  • Churn prediction: "Users who don't connect integrations in first week are 3x more likely to churn"
  • Re-engagement campaigns: Send notification to inactive users

A/B Testing:

  • PostHog feature flags for gradual rollout: "Enable new chat UI for 10% of users"
  • Measure impact: "New UI increased messages per day by 15%"
  • Automatic winner selection: PostHog Experiments picks winning variant
  • Rollout to 100% after validation

AI-Powered Monitoring Enhancements

MonitoringIntegration (new Integration):

class MonitoringIntegration(BaseInterface):
name = "monitoring"
description = "Query system health, performance, and usage analytics"

actions:
get_system_health(){status: "healthy", uptime: 99.9, errors_last_hour: 3}
get_performance_metrics(metric_name?){api_latency_p95: 250ms, llm_tokens_per_sec: 45}
get_usage_stats(time_range?){messages_today: 127, active_users: 5}
get_cost_breakdown(period?){llm_cost: $12.50, storage_cost: $2.30}
get_user_activity(user_id?){messages: 45, tasks_created: 8, last_active: "2h ago"}

AI can answer questions like:

  • "How's the app doing today?" → queries get_system_health() → "Everything's running smoothly, no errors"
  • "How many messages did we send this week?" → get_usage_stats("7d") → "Your group sent 340 messages this week"
  • "What's our LLM cost this month?" → get_cost_breakdown("30d") → "You've spent $38 on AI this month, under your $50 budget"
  • "Show me my activity today" → get_user_activity() → "You sent 12 messages, created 3 tasks, and searched 2 times"

Proactive Monitoring Alerts:

  • AI notices anomaly: "API latency spiked to 2 seconds (usually 200ms) — investigating..."
  • AI suggests optimization: "You're using 2x more LLM tokens than average — want to reduce context window?"
  • AI detects unused features: "You haven't used calendar in 30 days — disconnect to reduce clutter?"

Smart Alerting:

  • Context-aware: Don't alert during maintenance windows
  • Role-based: Parents get family usage alerts, admins get system health alerts
  • Intelligent throttling: Group similar alerts, don't spam (max 1 alert per hour per category)
  • Alert via NotificationsIntegration: "System health alert — database connection slow"

Predictive Analytics:

  • Predict churn: "User hasn't logged in for 5 days, usually logs daily — send re-engagement notification?"
  • Predict resource needs: "Your group's usage growing 20%/month — upgrade to larger plan in 2 months?"
  • Predict failures: "Error rate increasing (trend detected) — preemptive investigation triggered"

Cost Management & Optimization

Cost Tracking:

cost_breakdown = {
"llm": {
"anthropic_api": {"input_tokens": 1.2M, "output_tokens": 800K, "cost": $42.50},
"embeddings": {"vectors_generated": 15K, "cost": $0.30}
},
"storage": {
"postgresql": {"storage_gb": 5, "cost": $2.50},
"redis": {"memory_gb": 1, "cost": $0.50}
},
"infrastructure": {
"compute": {"hours": 720, "cost": $30.00},
"bandwidth": {"gb": 50, "cost": $5.00}
},
"total": $81.30
}

Budget Alerts:

  • Set monthly budget: settings.update("monitoring.budget", {"monthly": 100})
  • Alert at 80%: "You've used $80 of your $100 monthly budget"
  • Alert at 100%: "Budget exceeded — consider upgrading or reducing usage"
  • Per-integration budgets: "Google Calendar used $5 this month (under $10 limit)"

Cost Optimization Suggestions:

  • AI analyzes costs: "You're spending 70% on LLM — reduce temperature from 0.9 to 0.7 to save 20%?"
  • AI suggests caching: "You're re-generating the same summaries — enable caching to save $10/month"
  • AI suggests model switching: "For simple tasks, use Haiku instead of Sonnet — save 80%"

Privacy & Security for Analytics

Differential Privacy:

  • Add noise to aggregate metrics to prevent re-identification
  • Group-level stats show "≈5 active users" instead of exact count
  • Protects individual behavior from being inferred

Data Retention Policies:

  • Raw events: 30 days (PostHog)
  • Aggregated metrics: 1 year (Grafana)
  • Audit logs: 7 years (compliance)
  • User can request early deletion: GDPR right to erasure

Anonymization Pipeline:

  • Before storing events, strip PII:
    • user_id → hashed_user_id
    • message_content → message_length
    • email → domain_only (e.g., "gmail.com")
  • Anonymization policy configurable per metric

GDPR Compliance:

  • Right to access: Export all analytics data for user
  • Right to erasure: Delete all analytics events for user (cascading delete)
  • Right to portability: Export analytics in JSON format
  • Consent management: Opt-in for each analytics category

Real-Time Monitoring in App

Settings Page Metrics (read-only dashboard):

  • "Your Activity Today": messages sent, tasks created, integrations used
  • "Group Activity": top 3 active members, most-used features
  • "System Health": uptime, last error, performance status (green/yellow/red)
  • Embedded mini Grafana charts (iframe) or custom React components

Push Notifications for Alerts:

  • "System slow — we're investigating" (via NotificationsIntegration)
  • "Your LLM budget 80% used — $20 remaining this month"
  • "New feature available — try the new calendar view!"

In-Chat Analytics:

  • User: "How much have I used the app this week?" → AI calls MonitoringIntegration → responds inline
  • Interactive charts rendered as FrontendIntegration components in chat

Last Updated: February 13, 2026