📊 Monitoring & Alerting
Production n8n cần monitoring để catch issues trước khi impact business. Bài này covers execution monitoring, alerting setup, và performance tracking.
Monitoring Overview
What to Monitor:
Text
1N8N MONITORING STACK2────────────────────3 4APPLICATION LEVEL5├── Workflow executions (success/fail)6├── Execution duration7├── Queue depth (if using queue)8└── Active workflows count9 10INFRASTRUCTURE LEVEL11├── CPU usage12├── Memory usage13├── Disk space14├── Network I/O15 16DATABASE LEVEL17├── Connection count18├── Query performance19├── Database size20└── Replication lag21 22EXTERNAL SERVICES23├── API response times24├── Service availability25├── Rate limit usage26└── Authentication statusBuilt-in Monitoring
Execution List:
Text
1Access: Executions menu (left sidebar)2 3Shows:4├── All executions5├── Running workflows6├── Execution status (success/error/waiting)7├── Start time & duration8├── Retry information9└── Filter by workflow/status/dateEnable Execution Saving:
Bash
1# Save all executions2EXECUTIONS_DATA_SAVE_ON_ERROR=all3EXECUTIONS_DATA_SAVE_ON_SUCCESS=all4 5# Or save only errors6EXECUTIONS_DATA_SAVE_ON_SUCCESS=none7 8# Set retention9EXECUTIONS_DATA_MAX_AGE=168 # 7 days in hours10EXECUTIONS_DATA_PRUNE=true11EXECUTIONS_DATA_PRUNE_MAX_COUNT=50000Health Check Endpoints
Basic Health Check:
Bash
1# Check if n8n is responding2curl -I https://n8n.yourdomain.com/healthz3 4# Expected response5HTTP/1.1 200 OKDetailed Health Check Script:
Bash
1#!/bin/bash2# health-check.sh3 4N8N_URL="https://n8n.yourdomain.com"5 6# Check HTTP response7HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$N8N_URL/healthz")8 9if [ "$HTTP_CODE" = "200" ]; then10 echo "✅ n8n is healthy"11 exit 012else13 echo "❌ n8n health check failed: HTTP $HTTP_CODE"14 exit 115fiDatabase Health Check:
Bash
1#!/bin/bash2# db-health.sh3 4# Check PostgreSQL connection5docker exec n8n-postgres pg_isready -U n8n -d n8n6 7if [ $? -eq 0 ]; then8 echo "✅ Database is ready"9else10 echo "❌ Database connection failed"11 exit 112fi13 14# Check database size15docker exec n8n-postgres psql -U n8n -d n8n -c "16SELECT pg_size_pretty(pg_database_size('n8n')) as db_size;17"Prometheus Monitoring
Enable Prometheus Metrics:
Bash
1# Enable metrics endpoint2N8N_METRICS=true3N8N_METRICS_PREFIX=n8n_4 5# Metrics available at6# https://n8n.yourdomain.com/metricsAvailable Metrics:
Text
1# Workflow metrics2n8n_workflow_executions_total{status="success|error"}3n8n_workflow_execution_duration_seconds4n8n_workflow_active_total5 6# Queue metrics (queue mode)7n8n_queue_depth8n8n_queue_job_processing_time_seconds9 10# System metrics11n8n_api_requests_total12n8n_webhook_requests_totalPrometheus Config:
yaml
1# prometheus.yml2global:3 scrape_interval: 15s4 5scrape_configs:6 - job_name: 'n8n'7 static_configs:8 - targets: ['n8n:5678']9 metrics_path: /metrics10 basic_auth:11 username: n8n12 password: your_password # If auth enabledDocker Compose with Prometheus:
yaml
1version: '3.8'2 3services:4 n8n:5 image: n8nio/n8n:latest6 environment:7 - N8N_METRICS=true8 # ... other config9 10 prometheus:11 image: prom/prometheus:latest12 volumes:13 - ./prometheus.yml:/etc/prometheus/prometheus.yml14 - prometheus_data:/prometheus15 ports:16 - "9090:9090"17 command:18 - '--config.file=/etc/prometheus/prometheus.yml'19 - '--storage.tsdb.retention.time=30d'20 21 grafana:22 image: grafana/grafana:latest23 volumes:24 - grafana_data:/var/lib/grafana25 ports:26 - "3000:3000"27 environment:28 - GF_SECURITY_ADMIN_PASSWORD=admin29 30volumes:31 prometheus_data:32 grafana_data:Grafana Dashboard
Import n8n Dashboard:
Text
11. Open Grafana22. Dashboards → Import33. Upload JSON or paste dashboard ID44. Select Prometheus data source55. ImportCustom Dashboard Panels:
JSON
1{2 "panels": [3 {4 "title": "Workflow Executions",5 "type": "stat",6 "targets": [7 {8 "expr": "sum(increase(n8n_workflow_executions_total[24h]))",9 "legendFormat": "Total Executions"10 }11 ]12 },13 {14 "title": "Error Rate",15 "type": "gauge",16 "targets": [17 {18 "expr": "sum(rate(n8n_workflow_executions_total{status=\"error\"}[1h])) / sum(rate(n8n_workflow_executions_total[1h])) * 100",19 "legendFormat": "Error %"20 }21 ]22 },23 {24 "title": "Execution Duration",25 "type": "graph",26 "targets": [27 {28 "expr": "histogram_quantile(0.95, sum(rate(n8n_workflow_execution_duration_seconds_bucket[5m])) by (le))",29 "legendFormat": "p95 Duration"30 }31 ]32 }33 ]34}Alerting Setup
Grafana Alerts:
yaml
1# Alert rules2groups:3 - name: n8n-alerts4 rules:5 - alert: N8NHighErrorRate6 expr: |7 sum(rate(n8n_workflow_executions_total{status="error"}[5m])) 8 / sum(rate(n8n_workflow_executions_total[5m])) > 0.19 for: 5m10 labels:11 severity: warning12 annotations:13 summary: "High error rate in n8n"14 description: "Error rate is above 10% for 5 minutes"15 16 - alert: N8NDown17 expr: up{job="n8n"} == 018 for: 1m19 labels:20 severity: critical21 annotations:22 summary: "n8n instance is down"23 description: "n8n has been unreachable for 1 minute"24 25 - alert: N8NSlowExecutions26 expr: |27 histogram_quantile(0.95, sum(rate(n8n_workflow_execution_duration_seconds_bucket[5m])) by (le)) > 6028 for: 10m29 labels:30 severity: warning31 annotations:32 summary: "Slow workflow executions"33 description: "95th percentile execution time > 60s"Slack Alert Integration:
yaml
1# alertmanager.yml2route:3 receiver: 'slack-notifications'4 group_by: ['alertname']5 group_wait: 30s6 group_interval: 5m7 repeat_interval: 4h8 9receivers:10 - name: 'slack-notifications'11 slack_configs:12 - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'13 channel: '#n8n-alerts'14 username: 'AlertManager'15 icon_emoji: ':warning:'16 title: '{{ .GroupLabels.alertname }}'17 text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'Self-Monitoring Workflow
Create Monitoring Workflow in n8n:
JSON
1{2 "name": "Self Monitoring",3 "nodes": [4 {5 "name": "Schedule",6 "type": "n8n-nodes-base.scheduleTrigger",7 "parameters": {8 "rule": {9 "interval": [{ "field": "minutes", "minutesInterval": 5 }]10 }11 }12 },13 {14 "name": "Check Execution Stats",15 "type": "n8n-nodes-base.postgres",16 "parameters": {17 "operation": "executeQuery",18 "query": "SELECT COUNT(*) as total, SUM(CASE WHEN finished = true AND \"stoppedAt\" IS NOT NULL THEN 1 ELSE 0 END) as success, SUM(CASE WHEN \"stoppedAt\" IS NULL AND \"startedAt\" < NOW() - INTERVAL '1 hour' THEN 1 ELSE 0 END) as stuck FROM execution_entity WHERE \"startedAt\" > NOW() - INTERVAL '1 hour'"19 }20 },21 {22 "name": "Check for Issues",23 "type": "n8n-nodes-base.if",24 "parameters": {25 "conditions": {26 "number": [27 {28 "value1": "={{ $json.stuck }}",29 "operation": "larger",30 "value2": 031 }32 ]33 }34 }35 },36 {37 "name": "Send Alert",38 "type": "n8n-nodes-base.slack",39 "parameters": {40 "channel": "#n8n-alerts",41 "text": "⚠️ n8n Alert: {{ $json.stuck }} stuck executions detected!"42 }43 }44 ]45}Database Monitoring
Execution Stats Query:
SQL
1-- Execution statistics2SELECT 3 DATE(e."startedAt") as date,4 COUNT(*) as total_executions,5 SUM(CASE WHEN e.finished THEN 1 ELSE 0 END) as successful,6 SUM(CASE WHEN NOT e.finished THEN 1 ELSE 0 END) as failed,7 ROUND(AVG(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt")))::numeric, 2) as avg_duration_sec8FROM execution_entity e9WHERE e."startedAt" > NOW() - INTERVAL '7 days'10GROUP BY DATE(e."startedAt")11ORDER BY date DESC;Workflow Performance:
SQL
1-- Top 10 slowest workflows (average)2SELECT 3 w.name,4 COUNT(e.id) as execution_count,5 ROUND(AVG(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt")))::numeric, 2) as avg_seconds,6 MAX(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt"))) as max_seconds7FROM execution_entity e8JOIN workflow_entity w ON e."workflowId" = w.id9WHERE e."startedAt" > NOW() - INTERVAL '24 hours'10 AND e.finished = true11GROUP BY w.id12ORDER BY avg_seconds DESC13LIMIT 10;Stuck Executions Monitor:
Bash
1#!/bin/bash2# check-stuck.sh3 4STUCK_COUNT=$(docker exec n8n-postgres psql -U n8n -d n8n -t -c "5SELECT COUNT(*)6FROM execution_entity7WHERE finished = false8 AND \"startedAt\" < NOW() - INTERVAL '1 hour';9")10 11if [ "$STUCK_COUNT" -gt 0 ]; then12 echo "⚠️ Found $STUCK_COUNT stuck executions"13 14 # Send alert15 curl -X POST -H 'Content-type: application/json' \16 --data "{\"text\":\"⚠️ n8n: $STUCK_COUNT stuck executions\"}" \17 https://hooks.slack.com/services/xxx/yyy/zzz18fiLog Monitoring
Docker Logs:
Bash
1# View live logs2docker logs -f n8n3 4# Filter for errors5docker logs n8n 2>&1 | grep -i error6 7# Save logs to file8docker logs n8n > n8n-logs-$(date +%Y%m%d).txt 2>&1Log Aggregation with Loki:
yaml
1# docker-compose.yml addition2services:3 loki:4 image: grafana/loki:latest5 ports:6 - "3100:3100"7 volumes:8 - loki_data:/loki9 10 promtail:11 image: grafana/promtail:latest12 volumes:13 - /var/run/docker.sock:/var/run/docker.sock14 - ./promtail-config.yml:/etc/promtail/config.yml15 command: -config.file=/etc/promtail/config.yml16 17volumes:18 loki_data:Promtail Config:
yaml
1# promtail-config.yml2server:3 http_listen_port: 90804 5positions:6 filename: /tmp/positions.yaml7 8clients:9 - url: http://loki:3100/loki/api/v1/push10 11scrape_configs:12 - job_name: docker13 docker_sd_configs:14 - host: unix:///var/run/docker.sock15 refresh_interval: 5s16 relabel_configs:17 - source_labels: ['__meta_docker_container_name']18 regex: '/(.*)'19 target_label: 'container'Uptime Monitoring
External Uptime Check:
Bash
1#!/bin/bash2# uptime-monitor.sh (run from external server)3 4N8N_URL="https://n8n.yourdomain.com/healthz"5WEBHOOK_URL="https://hooks.slack.com/services/xxx/yyy/zzz"6 7check_uptime() {8 HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$N8N_URL")9 10 if [ "$HTTP_CODE" != "200" ]; then11 curl -X POST -H 'Content-type: application/json' \12 --data "{\"text\":\"🔴 n8n is DOWN! Status: $HTTP_CODE\"}" \13 "$WEBHOOK_URL"14 fi15}16 17check_uptimeCron Setup:
Bash
1# Check every minute2* * * * * /opt/scripts/uptime-monitor.sh >> /var/log/uptime.log 2>&1UptimeRobot/BetterStack Integration:
Text
11. Sign up for uptime service22. Add monitor:3 - URL: https://n8n.yourdomain.com/healthz4 - Interval: 1-5 minutes5 - Alert contacts: email, Slack, SMS63. Configure alert escalationPerformance Baseline
Establish Baselines:
SQL
1-- Create baseline table2CREATE TABLE IF NOT EXISTS performance_baseline (3 metric_name VARCHAR(100),4 metric_date DATE,5 avg_value DECIMAL,6 p95_value DECIMAL,7 max_value DECIMAL8);910-- Populate daily baseline11INSERT INTO performance_baseline12SELECT 13 'execution_duration',14 CURRENT_DATE - 1,15 AVG(EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))),16 PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))),17 MAX(EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt")))18FROM execution_entity19WHERE DATE("startedAt") = CURRENT_DATE - 120 AND finished = true;Compare Against Baseline:
SQL
1-- Alert if current p95 > baseline * 22SELECT 3 CASE 4 WHEN current_p95 > baseline_p95 * 2 THEN 'ALERT'5 ELSE 'OK'6 END as status,7 current_p95,8 baseline_p959FROM (10 SELECT 11 PERCENTILE_CONT(0.95) WITHIN GROUP (12 ORDER BY EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))13 ) as current_p9514 FROM execution_entity15 WHERE "startedAt" > NOW() - INTERVAL '1 hour'16 AND finished = true17) current18CROSS JOIN (19 SELECT p95_value as baseline_p9520 FROM performance_baseline21 WHERE metric_name = 'execution_duration'22 ORDER BY metric_date DESC23 LIMIT 124) baseline;Bài Tập Thực Hành
Monitoring Challenge
Build monitoring stack:
- Enable Prometheus metrics in n8n
- Deploy Prometheus + Grafana stack
- Create dashboard with key metrics
- Set up alerts for errors & downtime
- Configure Slack/email notifications
- Create self-monitoring workflow in n8n
Complete observability setup! 📊
Monitoring Checklist
Production Checklist
Essential monitoring:
- Health check endpoint monitored
- Execution success/failure rates tracked
- Error alerting configured
- Database size monitored
- Disk space alerts set
- External uptime monitor active
- Log aggregation in place
- Performance baselines established
Key Takeaways
Remember
- 📈 Metrics first - Enable Prometheus metrics
- 🚨 Alert on impact - Not every error needs alert
- 📊 Dashboards - Visualize trends over time
- 🔄 Self-monitoring - n8n can monitor itself
- 📝 Baselines - Know what's normal before alerting on anomalies
Tiếp Theo
Bài tiếp theo: Scaling - Queue mode, horizontal scaling, và high availability setup.
