Lý thuyết
50 phút
Bài 13/15

Monitoring & Alerting

Giám sát n8n instance - execution tracking, error alerts, health checks và performance metrics

📊 Monitoring & Alerting

System Monitoring

Production n8n cần monitoring để catch issues trước khi impact business. Bài này covers execution monitoring, alerting setup, và performance tracking.

Monitoring Overview

What to Monitor:

Text
1N8N MONITORING STACK
2────────────────────
3
4APPLICATION LEVEL
5├── Workflow executions (success/fail)
6├── Execution duration
7├── Queue depth (if using queue)
8└── Active workflows count
9
10INFRASTRUCTURE LEVEL
11├── CPU usage
12├── Memory usage
13├── Disk space
14├── Network I/O
15
16DATABASE LEVEL
17├── Connection count
18├── Query performance
19├── Database size
20└── Replication lag
21
22EXTERNAL SERVICES
23├── API response times
24├── Service availability
25├── Rate limit usage
26└── Authentication status

Built-in Monitoring

Execution List:

Text
1Access: Executions menu (left sidebar)
2
3Shows:
4├── All executions
5├── Running workflows
6├── Execution status (success/error/waiting)
7├── Start time & duration
8├── Retry information
9└── Filter by workflow/status/date

Enable Execution Saving:

Bash
1# Save all executions
2EXECUTIONS_DATA_SAVE_ON_ERROR=all
3EXECUTIONS_DATA_SAVE_ON_SUCCESS=all
4
5# Or save only errors
6EXECUTIONS_DATA_SAVE_ON_SUCCESS=none
7
8# Set retention
9EXECUTIONS_DATA_MAX_AGE=168 # 7 days in hours
10EXECUTIONS_DATA_PRUNE=true
11EXECUTIONS_DATA_PRUNE_MAX_COUNT=50000

Health Check Endpoints

Basic Health Check:

Bash
1# Check if n8n is responding
2curl -I https://n8n.yourdomain.com/healthz
3
4# Expected response
5HTTP/1.1 200 OK

Detailed Health Check Script:

Bash
1#!/bin/bash
2# health-check.sh
3
4N8N_URL="https://n8n.yourdomain.com"
5
6# Check HTTP response
7HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$N8N_URL/healthz")
8
9if [ "$HTTP_CODE" = "200" ]; then
10 echo "✅ n8n is healthy"
11 exit 0
12else
13 echo "❌ n8n health check failed: HTTP $HTTP_CODE"
14 exit 1
15fi

Database Health Check:

Bash
1#!/bin/bash
2# db-health.sh
3
4# Check PostgreSQL connection
5docker exec n8n-postgres pg_isready -U n8n -d n8n
6
7if [ $? -eq 0 ]; then
8 echo "✅ Database is ready"
9else
10 echo "❌ Database connection failed"
11 exit 1
12fi
13
14# Check database size
15docker exec n8n-postgres psql -U n8n -d n8n -c "
16SELECT pg_size_pretty(pg_database_size('n8n')) as db_size;
17"

Prometheus Monitoring

Enable Prometheus Metrics:

Bash
1# Enable metrics endpoint
2N8N_METRICS=true
3N8N_METRICS_PREFIX=n8n_
4
5# Metrics available at
6# https://n8n.yourdomain.com/metrics

Available Metrics:

Text
1# Workflow metrics
2n8n_workflow_executions_total{status="success|error"}
3n8n_workflow_execution_duration_seconds
4n8n_workflow_active_total
5
6# Queue metrics (queue mode)
7n8n_queue_depth
8n8n_queue_job_processing_time_seconds
9
10# System metrics
11n8n_api_requests_total
12n8n_webhook_requests_total

Prometheus Config:

yaml
1# prometheus.yml
2global:
3 scrape_interval: 15s
4
5scrape_configs:
6 - job_name: 'n8n'
7 static_configs:
8 - targets: ['n8n:5678']
9 metrics_path: /metrics
10 basic_auth:
11 username: n8n
12 password: your_password # If auth enabled

Docker Compose with Prometheus:

yaml
1version: '3.8'
2
3services:
4 n8n:
5 image: n8nio/n8n:latest
6 environment:
7 - N8N_METRICS=true
8 # ... other config
9
10 prometheus:
11 image: prom/prometheus:latest
12 volumes:
13 - ./prometheus.yml:/etc/prometheus/prometheus.yml
14 - prometheus_data:/prometheus
15 ports:
16 - "9090:9090"
17 command:
18 - '--config.file=/etc/prometheus/prometheus.yml'
19 - '--storage.tsdb.retention.time=30d'
20
21 grafana:
22 image: grafana/grafana:latest
23 volumes:
24 - grafana_data:/var/lib/grafana
25 ports:
26 - "3000:3000"
27 environment:
28 - GF_SECURITY_ADMIN_PASSWORD=admin
29
30volumes:
31 prometheus_data:
32 grafana_data:

Grafana Dashboard

Import n8n Dashboard:

Text
11. Open Grafana
22. Dashboards → Import
33. Upload JSON or paste dashboard ID
44. Select Prometheus data source
55. Import

Custom Dashboard Panels:

JSON
1{
2 "panels": [
3 {
4 "title": "Workflow Executions",
5 "type": "stat",
6 "targets": [
7 {
8 "expr": "sum(increase(n8n_workflow_executions_total[24h]))",
9 "legendFormat": "Total Executions"
10 }
11 ]
12 },
13 {
14 "title": "Error Rate",
15 "type": "gauge",
16 "targets": [
17 {
18 "expr": "sum(rate(n8n_workflow_executions_total{status=\"error\"}[1h])) / sum(rate(n8n_workflow_executions_total[1h])) * 100",
19 "legendFormat": "Error %"
20 }
21 ]
22 },
23 {
24 "title": "Execution Duration",
25 "type": "graph",
26 "targets": [
27 {
28 "expr": "histogram_quantile(0.95, sum(rate(n8n_workflow_execution_duration_seconds_bucket[5m])) by (le))",
29 "legendFormat": "p95 Duration"
30 }
31 ]
32 }
33 ]
34}

Alerting Setup

Grafana Alerts:

yaml
1# Alert rules
2groups:
3 - name: n8n-alerts
4 rules:
5 - alert: N8NHighErrorRate
6 expr: |
7 sum(rate(n8n_workflow_executions_total{status="error"}[5m]))
8 / sum(rate(n8n_workflow_executions_total[5m])) > 0.1
9 for: 5m
10 labels:
11 severity: warning
12 annotations:
13 summary: "High error rate in n8n"
14 description: "Error rate is above 10% for 5 minutes"
15
16 - alert: N8NDown
17 expr: up{job="n8n"} == 0
18 for: 1m
19 labels:
20 severity: critical
21 annotations:
22 summary: "n8n instance is down"
23 description: "n8n has been unreachable for 1 minute"
24
25 - alert: N8NSlowExecutions
26 expr: |
27 histogram_quantile(0.95, sum(rate(n8n_workflow_execution_duration_seconds_bucket[5m])) by (le)) > 60
28 for: 10m
29 labels:
30 severity: warning
31 annotations:
32 summary: "Slow workflow executions"
33 description: "95th percentile execution time > 60s"

Slack Alert Integration:

yaml
1# alertmanager.yml
2route:
3 receiver: 'slack-notifications'
4 group_by: ['alertname']
5 group_wait: 30s
6 group_interval: 5m
7 repeat_interval: 4h
8
9receivers:
10 - name: 'slack-notifications'
11 slack_configs:
12 - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
13 channel: '#n8n-alerts'
14 username: 'AlertManager'
15 icon_emoji: ':warning:'
16 title: '{{ .GroupLabels.alertname }}'
17 text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Self-Monitoring Workflow

Create Monitoring Workflow in n8n:

JSON
1{
2 "name": "Self Monitoring",
3 "nodes": [
4 {
5 "name": "Schedule",
6 "type": "n8n-nodes-base.scheduleTrigger",
7 "parameters": {
8 "rule": {
9 "interval": [{ "field": "minutes", "minutesInterval": 5 }]
10 }
11 }
12 },
13 {
14 "name": "Check Execution Stats",
15 "type": "n8n-nodes-base.postgres",
16 "parameters": {
17 "operation": "executeQuery",
18 "query": "SELECT COUNT(*) as total, SUM(CASE WHEN finished = true AND \"stoppedAt\" IS NOT NULL THEN 1 ELSE 0 END) as success, SUM(CASE WHEN \"stoppedAt\" IS NULL AND \"startedAt\" < NOW() - INTERVAL '1 hour' THEN 1 ELSE 0 END) as stuck FROM execution_entity WHERE \"startedAt\" > NOW() - INTERVAL '1 hour'"
19 }
20 },
21 {
22 "name": "Check for Issues",
23 "type": "n8n-nodes-base.if",
24 "parameters": {
25 "conditions": {
26 "number": [
27 {
28 "value1": "={{ $json.stuck }}",
29 "operation": "larger",
30 "value2": 0
31 }
32 ]
33 }
34 }
35 },
36 {
37 "name": "Send Alert",
38 "type": "n8n-nodes-base.slack",
39 "parameters": {
40 "channel": "#n8n-alerts",
41 "text": "⚠️ n8n Alert: {{ $json.stuck }} stuck executions detected!"
42 }
43 }
44 ]
45}

Database Monitoring

Execution Stats Query:

SQL
1-- Execution statistics
2SELECT
3 DATE(e."startedAt") as date,
4 COUNT(*) as total_executions,
5 SUM(CASE WHEN e.finished THEN 1 ELSE 0 END) as successful,
6 SUM(CASE WHEN NOT e.finished THEN 1 ELSE 0 END) as failed,
7 ROUND(AVG(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt")))::numeric, 2) as avg_duration_sec
8FROM execution_entity e
9WHERE e."startedAt" > NOW() - INTERVAL '7 days'
10GROUP BY DATE(e."startedAt")
11ORDER BY date DESC;

Workflow Performance:

SQL
1-- Top 10 slowest workflows (average)
2SELECT
3 w.name,
4 COUNT(e.id) as execution_count,
5 ROUND(AVG(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt")))::numeric, 2) as avg_seconds,
6 MAX(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt"))) as max_seconds
7FROM execution_entity e
8JOIN workflow_entity w ON e."workflowId" = w.id
9WHERE e."startedAt" > NOW() - INTERVAL '24 hours'
10 AND e.finished = true
11GROUP BY w.id
12ORDER BY avg_seconds DESC
13LIMIT 10;

Stuck Executions Monitor:

Bash
1#!/bin/bash
2# check-stuck.sh
3
4STUCK_COUNT=$(docker exec n8n-postgres psql -U n8n -d n8n -t -c "
5SELECT COUNT(*)
6FROM execution_entity
7WHERE finished = false
8 AND \"startedAt\" < NOW() - INTERVAL '1 hour';
9")
10
11if [ "$STUCK_COUNT" -gt 0 ]; then
12 echo "⚠️ Found $STUCK_COUNT stuck executions"
13
14 # Send alert
15 curl -X POST -H 'Content-type: application/json' \
16 --data "{\"text\":\"⚠️ n8n: $STUCK_COUNT stuck executions\"}" \
17 https://hooks.slack.com/services/xxx/yyy/zzz
18fi

Log Monitoring

Docker Logs:

Bash
1# View live logs
2docker logs -f n8n
3
4# Filter for errors
5docker logs n8n 2>&1 | grep -i error
6
7# Save logs to file
8docker logs n8n > n8n-logs-$(date +%Y%m%d).txt 2>&1

Log Aggregation with Loki:

yaml
1# docker-compose.yml addition
2services:
3 loki:
4 image: grafana/loki:latest
5 ports:
6 - "3100:3100"
7 volumes:
8 - loki_data:/loki
9
10 promtail:
11 image: grafana/promtail:latest
12 volumes:
13 - /var/run/docker.sock:/var/run/docker.sock
14 - ./promtail-config.yml:/etc/promtail/config.yml
15 command: -config.file=/etc/promtail/config.yml
16
17volumes:
18 loki_data:

Promtail Config:

yaml
1# promtail-config.yml
2server:
3 http_listen_port: 9080
4
5positions:
6 filename: /tmp/positions.yaml
7
8clients:
9 - url: http://loki:3100/loki/api/v1/push
10
11scrape_configs:
12 - job_name: docker
13 docker_sd_configs:
14 - host: unix:///var/run/docker.sock
15 refresh_interval: 5s
16 relabel_configs:
17 - source_labels: ['__meta_docker_container_name']
18 regex: '/(.*)'
19 target_label: 'container'

Uptime Monitoring

External Uptime Check:

Bash
1#!/bin/bash
2# uptime-monitor.sh (run from external server)
3
4N8N_URL="https://n8n.yourdomain.com/healthz"
5WEBHOOK_URL="https://hooks.slack.com/services/xxx/yyy/zzz"
6
7check_uptime() {
8 HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$N8N_URL")
9
10 if [ "$HTTP_CODE" != "200" ]; then
11 curl -X POST -H 'Content-type: application/json' \
12 --data "{\"text\":\"🔴 n8n is DOWN! Status: $HTTP_CODE\"}" \
13 "$WEBHOOK_URL"
14 fi
15}
16
17check_uptime

Cron Setup:

Bash
1# Check every minute
2* * * * * /opt/scripts/uptime-monitor.sh >> /var/log/uptime.log 2>&1

UptimeRobot/BetterStack Integration:

Text
11. Sign up for uptime service
22. Add monitor:
3 - URL: https://n8n.yourdomain.com/healthz
4 - Interval: 1-5 minutes
5 - Alert contacts: email, Slack, SMS
63. Configure alert escalation

Performance Baseline

Establish Baselines:

SQL
1-- Create baseline table
2CREATE TABLE IF NOT EXISTS performance_baseline (
3 metric_name VARCHAR(100),
4 metric_date DATE,
5 avg_value DECIMAL,
6 p95_value DECIMAL,
7 max_value DECIMAL
8);
9
10-- Populate daily baseline
11INSERT INTO performance_baseline
12SELECT
13 'execution_duration',
14 CURRENT_DATE - 1,
15 AVG(EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))),
16 PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))),
17 MAX(EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt")))
18FROM execution_entity
19WHERE DATE("startedAt") = CURRENT_DATE - 1
20 AND finished = true;

Compare Against Baseline:

SQL
1-- Alert if current p95 > baseline * 2
2SELECT
3 CASE
4 WHEN current_p95 > baseline_p95 * 2 THEN 'ALERT'
5 ELSE 'OK'
6 END as status,
7 current_p95,
8 baseline_p95
9FROM (
10 SELECT
11 PERCENTILE_CONT(0.95) WITHIN GROUP (
12 ORDER BY EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))
13 ) as current_p95
14 FROM execution_entity
15 WHERE "startedAt" > NOW() - INTERVAL '1 hour'
16 AND finished = true
17) current
18CROSS JOIN (
19 SELECT p95_value as baseline_p95
20 FROM performance_baseline
21 WHERE metric_name = 'execution_duration'
22 ORDER BY metric_date DESC
23 LIMIT 1
24) baseline;

Bài Tập Thực Hành

Monitoring Challenge

Build monitoring stack:

  1. Enable Prometheus metrics in n8n
  2. Deploy Prometheus + Grafana stack
  3. Create dashboard with key metrics
  4. Set up alerts for errors & downtime
  5. Configure Slack/email notifications
  6. Create self-monitoring workflow in n8n

Complete observability setup! 📊

Monitoring Checklist

Production Checklist

Essential monitoring:

  • Health check endpoint monitored
  • Execution success/failure rates tracked
  • Error alerting configured
  • Database size monitored
  • Disk space alerts set
  • External uptime monitor active
  • Log aggregation in place
  • Performance baselines established

Key Takeaways

Remember
  • 📈 Metrics first - Enable Prometheus metrics
  • 🚨 Alert on impact - Not every error needs alert
  • 📊 Dashboards - Visualize trends over time
  • 🔄 Self-monitoring - n8n can monitor itself
  • 📝 Baselines - Know what's normal before alerting on anomalies

Tiếp Theo

Bài tiếp theo: Scaling - Queue mode, horizontal scaling, và high availability setup.