Monitoring & Analytics — Design Document
Created: February 13, 2026 Status: Planned Phase: 3k (Observability)
Overview
Production observability via PostHog (product analytics) and Grafana + Prometheus (infrastructure metrics). Understand user behavior, monitor system health, debug issues, and track performance.
Motivation
Problem: Without observability, we're blind:
- Can't answer "what features do users actually use?"
- No visibility into performance bottlenecks (slow API endpoints, memory leaks)
- Debugging user-reported issues is guesswork
- No data-driven product decisions
Solution: Two-layer monitoring:
- PostHog — Product analytics (user behavior, feature usage, funnels)
- Grafana + Prometheus — Infrastructure metrics (API latency, LLM tokens, error rates)
PostHog — Product Analytics
What to Track
Events (10+ core events):
| Event | Properties | Purpose |
|---|---|---|
chat_message_sent | conversation_id, message_length, streaming | Engagement: messages per day, conversation length |
tool_call_executed | integration, action, success, duration_ms | Feature usage: which integrations are most used? |
approval_requested | tool, approved, timeout | User friction: approval rates, timeout frequency |
onboarding_completed | persona, spaces_created, time_to_complete | Conversion: onboarding completion rate, drop-off points |
integration_connected | integration_name, success | Adoption: which integrations do users connect first? |
conversation_created | space_id, title | Usage patterns: conversations per user/day |
search_performed | query, result_count, result_types | Feature value: search usage, relevance |
setting_changed | category, key, old_value, new_value | Configuration: what do users customize? |
error_occurred | error_type, endpoint, stack_trace | Reliability: error frequency, top error sources |
page_viewed | page, referrer, duration | Navigation: user flows, page popularity |
User properties:
persona— parent, teacher, manager, etc.group_size— number of members in groupactive_integrations— list of connected integrationsplan_tier— free, pro, enterprise (future)signup_date— cohort analysis
Implementation
Backend: PostHog Client
# backend/analytics/posthog.py
from posthog import Posthog
from backend.config import settings
posthog_client = Posthog(
project_api_key=settings.POSTHOG_API_KEY,
host=settings.POSTHOG_HOST # self-hosted or cloud
)
async def capture_event(
user_id: str,
event: str,
properties: dict,
group_id: str | None = None
):
"""Capture event in PostHog (async, non-blocking)."""
if not await has_consent(user_id, "analytics"):
return # User opted out
# PII scrubbing: never send message content, only metadata
sanitized_props = sanitize_properties(properties)
posthog_client.capture(
distinct_id=user_id,
event=event,
properties=sanitized_props,
groups={"group": group_id} if group_id else None
)
Privacy: Opt-in + PII Scrubbing
- Opt-in only —
analytics_enabledsetting (default: false, user must explicitly enable) - Never log PII — no message content, no emails, no names. Only opaque IDs and metadata
- Anonymize IPs — PostHog config:
anonymize_ips: true - Session replay opt-in — Separate consent:
session_replay_enabled(default: false)
Event capture points:
# backend/api/chat.py
await analytics.capture_event(
user_id=ctx.user_id,
event="chat_message_sent",
properties={"conversation_id": conv_id, "message_length": len(content)},
group_id=ctx.group_id
)
# backend/chat/orchestrator.py
await analytics.capture_event(
user_id=ctx.user_id,
event="tool_call_executed",
properties={"integration": interface, "action": action, "success": result.success, "duration_ms": elapsed},
group_id=ctx.group_id
)
PostHog Dashboards
1. Engagement
- Daily/weekly/monthly active users
- Messages per user per day
- Conversations per user per week
- Session duration distribution
2. Feature Usage
- Most-used integrations (bar chart)
- Tool calls by integration (pie chart)
- Approval request rates (% approved, % rejected, % timeout)
- Search usage (queries per user, result click-through rate)
3. Onboarding
- Funnel: signup → onboarding start → onboarding complete → first message
- Time to first message (median, p95)
- Onboarding drop-off points
- Persona distribution (pie chart)
4. Errors
- Error rate over time (line chart)
- Top 10 error types (table)
- Errors by endpoint (table)
- User-reported errors vs auto-detected (comparison)
Grafana + Prometheus — Infrastructure Metrics
What to Track
API Metrics:
from prometheus_client import Counter, Histogram, Gauge
# Request counter
api_requests_total = Counter(
'morphee_api_requests_total',
'Total API requests',
['endpoint', 'status']
)
# Latency histogram
api_request_duration_seconds = Histogram(
'morphee_api_request_duration_seconds',
'API request latency',
['endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)
# Active connections
websocket_connections_active = Gauge(
'morphee_websocket_connections_active',
'Active WebSocket connections'
)
LLM Metrics:
# Token usage
llm_tokens_total = Counter(
'morphee_llm_tokens_total',
'Total LLM tokens consumed',
['model', 'type'] # type: input or output
)
# Streaming latency
llm_streaming_latency_seconds = Histogram(
'morphee_llm_streaming_latency_seconds',
'Time to first token',
['model']
)
# Tool call count
llm_tool_calls_total = Counter(
'morphee_llm_tool_calls_total',
'Total tool calls',
['integration', 'action']
)
Memory Metrics:
# RAG pipeline
memory_rag_latency_seconds = Histogram(
'morphee_memory_rag_latency_seconds',
'RAG pipeline latency',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)
# Vector search
memory_vector_search_results = Histogram(
'morphee_memory_vector_search_results',
'Number of results returned',
buckets=[0, 1, 5, 10, 20, 50]
)
# Git sync
memory_git_sync_success = Counter(
'morphee_memory_git_sync_success',
'Git sync successes',
['group_id']
)
memory_git_sync_failures = Counter(
'morphee_memory_git_sync_failures',
'Git sync failures',
['group_id', 'error_type']
)
Implementation
Backend: Prometheus Client
# backend/utils/metrics.py
from prometheus_client import start_http_server, CollectorRegistry
from fastapi import FastAPI, Response
# Initialize registry
registry = CollectorRegistry()
# Define metrics (see above)
# Expose /metrics endpoint
@app.get("/metrics")
async def metrics():
from prometheus_client import generate_latest
return Response(content=generate_latest(registry), media_type="text/plain")
Middleware: Request tracking
# backend/main.py
from time import time
@app.middleware("http")
async def track_request_metrics(request: Request, call_next):
start = time()
response = await call_next(request)
duration = time() - start
api_requests_total.labels(
endpoint=request.url.path,
status=response.status_code
).inc()
api_request_duration_seconds.labels(
endpoint=request.url.path
).observe(duration)
return response
Grafana Dashboards
1. API Health
- Request rate (req/sec) — line chart
- Error rate (%) — line chart with alert threshold (>1% = red)
- Latency (p50, p95, p99) — line chart
- Status code distribution (2xx, 4xx, 5xx) — stacked area chart
2. LLM Performance
- Tokens per second — line chart
- Streaming latency (time to first token) — histogram
- Tool execution time — bar chart by integration
- Tool call distribution — pie chart
3. Memory System
- RAG latency — histogram
- Vector search latency — line chart
- Git sync success rate (%) — gauge
- Embedding generation time — line chart
4. User Activity
- Active users (1h, 24h, 7d) — 3 gauges
- Conversations per day — line chart
- Messages per conversation — histogram
- WebSocket connections — line chart
5. Errors & Alerts
- Exception rate — line chart with alert threshold
- Top 10 error types — table
- Failed tool calls — line chart
- Timeout rate (approval, LLM) — line chart
Alerts
Slack webhook on:
- Error rate > 1% for 5 minutes
- LLM timeout > 10s (p95)
- Git sync failures > 3 in 1 hour
- WebSocket disconnect rate > 10% of active connections
Docker Compose Setup
# docker-compose.dev.yml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
volumes:
prometheus-data:
grafana-data:
prometheus.yml:
scrape_configs:
- job_name: 'morphee-backend'
static_configs:
- targets: ['backend:8000']
scrape_interval: 15s
metrics_path: '/metrics'
Effort Estimation
| Task | Size | Notes |
|---|---|---|
| Backend: PostHog client + event capture | M | Wrapper, 10 event definitions, PII scrubbing |
| Backend: Prometheus metrics | M | 15+ custom metrics, middleware |
Backend: /metrics endpoint | S | Expose Prometheus |
| Frontend: PostHog SDK integration | S | Event capture on user actions |
| Docker: Prometheus + Grafana setup | S | Compose config, volume mounts |
| Grafana: 5 dashboards | M | JSON dashboard configs |
| Alert rules: Slack webhooks | S | Prometheus alertmanager config |
| Tests: Backend metrics | M | 20+ tests |
Total: Large (2 weeks)
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Privacy violations (logging PII) | Strict PII scrubbing, opt-in analytics, regular audits |
| Performance overhead (metrics collection) | Async event capture, metric sampling for high-volume endpoints |
| Alert fatigue (too many alerts) | Start with critical alerts only, tune thresholds over time |
| Prometheus disk usage growth | Set retention period (30 days), compress old data |
Future Enhancements
Core Features
- AI-queryable monitoring —
MonitoringIntegrationso AI can answer "How's the app performing today?" - Anomaly detection — ML-based alerting (Prometheus + Grafana ML)
- Cost tracking — LLM API costs per user/group, budget alerts
- User feedback loop — In-app feedback widget → PostHog event
- Self-hosted PostHog — For privacy-focused deployments
- Real-time dashboards — Live-updating metrics in Settings page (not just Grafana)
- Performance budgets — Alert when bundle size, API latency, or LLM tokens exceed thresholds
ACL & Permission Enhancements
Role-Based Analytics Access (leveraging existing PermissionPolicy):
class AnalyticsPermission(BaseModel):
metric_category: str # "user_activity", "system_health", "costs", "privacy"
can_view: list[str] # Roles: ["parent", "admin"]
can_export: list[str] # Roles: ["admin"]
scope: str # "own", "group", "system"
masked_fields: list[str] # Fields to mask: ["user_id", "message_content"]
Use cases:
-
Family Analytics:
- Parents can see group-wide usage stats (messages per day, most-used features)
- Parents can see each kid's usage (with consent, GDPR compliant)
- Kids can see only their own stats (not siblings' data)
- Privacy toggle: Kid can opt out of parent viewing their stats
-
Classroom Analytics:
- Teacher can see class-wide engagement metrics (participation, assignment completion)
- Teacher can see per-student activity (with school policy compliance)
- Students see only their own progress dashboards
- Admin can see system-wide stats (all classes, anonymized)
-
Team/Organization:
- Manager can see team productivity metrics (tasks completed, meetings scheduled)
- Manager can't see individual private messages (privacy boundary)
- Team members see their own contribution metrics
- C-suite sees org-wide KPIs (aggregated, anonymized)
Privacy-Aware Analytics:
- Granular consent per metric type:
consent_types = [
"analytics.usage_patterns", # Can track feature usage
"analytics.performance", # Can track app performance
"analytics.error_reports", # Can capture error logs
"analytics.session_replay", # Can record sessions (high privacy impact)
] - User can opt in/out per category via SettingsIntegration
- Stored in
user_consentstable (already exists for GDPR) - Analytics client checks consent before capturing events
Data Access Logs:
- Log who viewed which analytics dashboards:
{user_id, dashboard, timestamp} - GDPR compliance: "Show me who accessed my analytics data"
- Stored in
analytics_access_logtable - Audit trail for compliance reviews
Advanced Analytics Features
Per-Space Analytics:
- Track metrics per Space: "Homework space has 80% task completion rate"
- Compare Spaces: "Meal planning space has 2x more activity than shopping space"
- Space health score: Activity level, task completion, member engagement
- Stored in PostHog with
space_idproperty
Per-Integration Analytics:
- Which integrations are most used? (Gmail > Calendar > Tasks)
- Integration health: Success rate, average latency, error rate
- Integration adoption funnel: connected → first use → daily active
- Cost attribution: "Google Calendar integration used 1000 API calls this month"
Cohort Analysis:
- Group users by persona: "Parents use calendar 3x more than kids"
- Group users by signup date: "Week 1 cohort has 80% retention"
- Group users by feature adoption: "Users who connect calendar are 2x more likely to stay active"
- PostHog built-in cohorts + custom cohorts
Funnel Analysis:
- Onboarding funnel: signup → group creation → first message → first task created
- Feature adoption funnel: connected Google → first calendar event → scheduled recurring event
- Conversion funnel: free → trial → paid (future monetization)
- Identify drop-off points: "60% of users drop after onboarding step 3"
Retention Analysis:
- Daily/weekly/monthly active users (DAU/WAU/MAU)
- Retention curves: Day 1, Day 7, Day 30 retention rates
- Churn prediction: "Users who don't connect integrations in first week are 3x more likely to churn"
- Re-engagement campaigns: Send notification to inactive users
A/B Testing:
- PostHog feature flags for gradual rollout: "Enable new chat UI for 10% of users"
- Measure impact: "New UI increased messages per day by 15%"
- Automatic winner selection: PostHog Experiments picks winning variant
- Rollout to 100% after validation
AI-Powered Monitoring Enhancements
MonitoringIntegration (new Integration):
class MonitoringIntegration(BaseInterface):
name = "monitoring"
description = "Query system health, performance, and usage analytics"
actions:
get_system_health() → {status: "healthy", uptime: 99.9, errors_last_hour: 3}
get_performance_metrics(metric_name?) → {api_latency_p95: 250ms, llm_tokens_per_sec: 45}
get_usage_stats(time_range?) → {messages_today: 127, active_users: 5}
get_cost_breakdown(period?) → {llm_cost: $12.50, storage_cost: $2.30}
get_user_activity(user_id?) → {messages: 45, tasks_created: 8, last_active: "2h ago"}
AI can answer questions like:
- "How's the app doing today?" → queries
get_system_health()→ "Everything's running smoothly, no errors" - "How many messages did we send this week?" →
get_usage_stats("7d")→ "Your group sent 340 messages this week" - "What's our LLM cost this month?" →
get_cost_breakdown("30d")→ "You've spent $38 on AI this month, under your $50 budget" - "Show me my activity today" →
get_user_activity()→ "You sent 12 messages, created 3 tasks, and searched 2 times"
Proactive Monitoring Alerts:
- AI notices anomaly: "API latency spiked to 2 seconds (usually 200ms) — investigating..."
- AI suggests optimization: "You're using 2x more LLM tokens than average — want to reduce context window?"
- AI detects unused features: "You haven't used calendar in 30 days — disconnect to reduce clutter?"
Smart Alerting:
- Context-aware: Don't alert during maintenance windows
- Role-based: Parents get family usage alerts, admins get system health alerts
- Intelligent throttling: Group similar alerts, don't spam (max 1 alert per hour per category)
- Alert via NotificationsIntegration: "System health alert — database connection slow"
Predictive Analytics:
- Predict churn: "User hasn't logged in for 5 days, usually logs daily — send re-engagement notification?"
- Predict resource needs: "Your group's usage growing 20%/month — upgrade to larger plan in 2 months?"
- Predict failures: "Error rate increasing (trend detected) — preemptive investigation triggered"
Cost Management & Optimization
Cost Tracking:
cost_breakdown = {
"llm": {
"anthropic_api": {"input_tokens": 1.2M, "output_tokens": 800K, "cost": $42.50},
"embeddings": {"vectors_generated": 15K, "cost": $0.30}
},
"storage": {
"postgresql": {"storage_gb": 5, "cost": $2.50},
"redis": {"memory_gb": 1, "cost": $0.50}
},
"infrastructure": {
"compute": {"hours": 720, "cost": $30.00},
"bandwidth": {"gb": 50, "cost": $5.00}
},
"total": $81.30
}
Budget Alerts:
- Set monthly budget:
settings.update("monitoring.budget", {"monthly": 100}) - Alert at 80%: "You've used $80 of your $100 monthly budget"
- Alert at 100%: "Budget exceeded — consider upgrading or reducing usage"
- Per-integration budgets: "Google Calendar used $5 this month (under $10 limit)"
Cost Optimization Suggestions:
- AI analyzes costs: "You're spending 70% on LLM — reduce temperature from 0.9 to 0.7 to save 20%?"
- AI suggests caching: "You're re-generating the same summaries — enable caching to save $10/month"
- AI suggests model switching: "For simple tasks, use Haiku instead of Sonnet — save 80%"
Privacy & Security for Analytics
Differential Privacy:
- Add noise to aggregate metrics to prevent re-identification
- Group-level stats show "≈5 active users" instead of exact count
- Protects individual behavior from being inferred
Data Retention Policies:
- Raw events: 30 days (PostHog)
- Aggregated metrics: 1 year (Grafana)
- Audit logs: 7 years (compliance)
- User can request early deletion: GDPR right to erasure
Anonymization Pipeline:
- Before storing events, strip PII:
- user_id → hashed_user_id
- message_content → message_length
- email → domain_only (e.g., "gmail.com")
- Anonymization policy configurable per metric
GDPR Compliance:
- Right to access: Export all analytics data for user
- Right to erasure: Delete all analytics events for user (cascading delete)
- Right to portability: Export analytics in JSON format
- Consent management: Opt-in for each analytics category
Real-Time Monitoring in App
Settings Page Metrics (read-only dashboard):
- "Your Activity Today": messages sent, tasks created, integrations used
- "Group Activity": top 3 active members, most-used features
- "System Health": uptime, last error, performance status (green/yellow/red)
- Embedded mini Grafana charts (iframe) or custom React components
Push Notifications for Alerts:
- "System slow — we're investigating" (via NotificationsIntegration)
- "Your LLM budget 80% used — $20 remaining this month"
- "New feature available — try the new calendar view!"
In-Chat Analytics:
- User: "How much have I used the app this week?" → AI calls MonitoringIntegration → responds inline
- Interactive charts rendered as FrontendIntegration components in chat
Last Updated: February 13, 2026