📊 Monitoring & Alerting

System Monitoring

Production n8n cần monitoring để catch issues trước khi impact business. Bài này covers execution monitoring, alerting setup, và performance tracking.

Monitoring Overview

What to Monitor:

Text

1N8N MONITORING STACK
2────────────────────
3 
4APPLICATION LEVEL
5├── Workflow executions (success/fail)
6├── Execution duration
7├── Queue depth (if using queue)
8└── Active workflows count
9 
10INFRASTRUCTURE LEVEL
11├── CPU usage
12├── Memory usage
13├── Disk space
14├── Network I/O
15 
16DATABASE LEVEL
17├── Connection count
18├── Query performance
19├── Database size
20└── Replication lag
21 
22EXTERNAL SERVICES
23├── API response times
24├── Service availability
25├── Rate limit usage
26└── Authentication status

Built-in Monitoring

Execution List:

Text

1Access: Executions menu (left sidebar)
2 
3Shows:
4├── All executions
5├── Running workflows
6├── Execution status (success/error/waiting)
7├── Start time & duration
8├── Retry information
9└── Filter by workflow/status/date

Enable Execution Saving:

Bash

1# Save all executions
2EXECUTIONS_DATA_SAVE_ON_ERROR=all
3EXECUTIONS_DATA_SAVE_ON_SUCCESS=all
4 
5# Or save only errors
6EXECUTIONS_DATA_SAVE_ON_SUCCESS=none
7 
8# Set retention
9EXECUTIONS_DATA_MAX_AGE=168  # 7 days in hours
10EXECUTIONS_DATA_PRUNE=true
11EXECUTIONS_DATA_PRUNE_MAX_COUNT=50000

Health Check Endpoints

Basic Health Check:

Bash

1# Check if n8n is responding
2curl -I https://n8n.yourdomain.com/healthz
3 
4# Expected response
5HTTP/1.1 200 OK

Detailed Health Check Script:

Bash

1#!/bin/bash
2# health-check.sh
3 
4N8N_URL="https://n8n.yourdomain.com"
5 
6# Check HTTP response
7HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$N8N_URL/healthz")
8 
9if [ "$HTTP_CODE" = "200" ]; then
10    echo "✅ n8n is healthy"
11    exit 0
12else
13    echo "❌ n8n health check failed: HTTP $HTTP_CODE"
14    exit 1
15fi

Database Health Check:

Bash

1#!/bin/bash
2# db-health.sh
3 
4# Check PostgreSQL connection
5docker exec n8n-postgres pg_isready -U n8n -d n8n
6 
7if [ $? -eq 0 ]; then
8    echo "✅ Database is ready"
9else
10    echo "❌ Database connection failed"
11    exit 1
12fi
13 
14# Check database size
15docker exec n8n-postgres psql -U n8n -d n8n -c "
16SELECT pg_size_pretty(pg_database_size('n8n')) as db_size;
17"

Prometheus Monitoring

Enable Prometheus Metrics:

Bash

1# Enable metrics endpoint
2N8N_METRICS=true
3N8N_METRICS_PREFIX=n8n_
4 
5# Metrics available at
6# https://n8n.yourdomain.com/metrics

Available Metrics:

Text

1# Workflow metrics
2n8n_workflow_executions_total{status="success|error"}
3n8n_workflow_execution_duration_seconds
4n8n_workflow_active_total
5 
6# Queue metrics (queue mode)
7n8n_queue_depth
8n8n_queue_job_processing_time_seconds
9 
10# System metrics
11n8n_api_requests_total
12n8n_webhook_requests_total

Prometheus Config:

yaml

1# prometheus.yml
2global:
3  scrape_interval: 15s
4 
5scrape_configs:
6  - job_name: 'n8n'
7    static_configs:
8      - targets: ['n8n:5678']
9    metrics_path: /metrics
10    basic_auth:
11      username: n8n
12      password: your_password  # If auth enabled

Docker Compose with Prometheus:

yaml

1version: '3.8'
2 
3services:
4  n8n:
5    image: n8nio/n8n:latest
6    environment:
7      - N8N_METRICS=true
8    # ... other config
9 
10  prometheus:
11    image: prom/prometheus:latest
12    volumes:
13      - ./prometheus.yml:/etc/prometheus/prometheus.yml
14      - prometheus_data:/prometheus
15    ports:
16      - "9090:9090"
17    command:
18      - '--config.file=/etc/prometheus/prometheus.yml'
19      - '--storage.tsdb.retention.time=30d'
20 
21  grafana:
22    image: grafana/grafana:latest
23    volumes:
24      - grafana_data:/var/lib/grafana
25    ports:
26      - "3000:3000"
27    environment:
28      - GF_SECURITY_ADMIN_PASSWORD=admin
29 
30volumes:
31  prometheus_data:
32  grafana_data:

Grafana Dashboard

Import n8n Dashboard:

Text

11. Open Grafana
22. Dashboards → Import
33. Upload JSON or paste dashboard ID
44. Select Prometheus data source
55. Import

Custom Dashboard Panels:

JSON

1{
2  "panels": [
3    {
4      "title": "Workflow Executions",
5      "type": "stat",
6      "targets": [
7        {
8          "expr": "sum(increase(n8n_workflow_executions_total[24h]))",
9          "legendFormat": "Total Executions"
10        }
11      ]
12    },
13    {
14      "title": "Error Rate",
15      "type": "gauge",
16      "targets": [
17        {
18          "expr": "sum(rate(n8n_workflow_executions_total{status=\"error\"}[1h])) / sum(rate(n8n_workflow_executions_total[1h])) * 100",
19          "legendFormat": "Error %"
20        }
21      ]
22    },
23    {
24      "title": "Execution Duration",
25      "type": "graph",
26      "targets": [
27        {
28          "expr": "histogram_quantile(0.95, sum(rate(n8n_workflow_execution_duration_seconds_bucket[5m])) by (le))",
29          "legendFormat": "p95 Duration"
30        }
31      ]
32    }
33  ]
34}

Alerting Setup

Grafana Alerts:

yaml

1# Alert rules
2groups:
3  - name: n8n-alerts
4    rules:
5      - alert: N8NHighErrorRate
6        expr: |
7          sum(rate(n8n_workflow_executions_total{status="error"}[5m])) 
8          / sum(rate(n8n_workflow_executions_total[5m])) > 0.1
9        for: 5m
10        labels:
11          severity: warning
12        annotations:
13          summary: "High error rate in n8n"
14          description: "Error rate is above 10% for 5 minutes"
15 
16      - alert: N8NDown
17        expr: up{job="n8n"} == 0
18        for: 1m
19        labels:
20          severity: critical
21        annotations:
22          summary: "n8n instance is down"
23          description: "n8n has been unreachable for 1 minute"
24 
25      - alert: N8NSlowExecutions
26        expr: |
27          histogram_quantile(0.95, sum(rate(n8n_workflow_execution_duration_seconds_bucket[5m])) by (le)) > 60
28        for: 10m
29        labels:
30          severity: warning
31        annotations:
32          summary: "Slow workflow executions"
33          description: "95th percentile execution time > 60s"

Slack Alert Integration:

yaml

1# alertmanager.yml
2route:
3  receiver: 'slack-notifications'
4  group_by: ['alertname']
5  group_wait: 30s
6  group_interval: 5m
7  repeat_interval: 4h
8 
9receivers:
10  - name: 'slack-notifications'
11    slack_configs:
12      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
13        channel: '#n8n-alerts'
14        username: 'AlertManager'
15        icon_emoji: ':warning:'
16        title: '{{ .GroupLabels.alertname }}'
17        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Self-Monitoring Workflow

Create Monitoring Workflow in n8n:

JSON

1{
2  "name": "Self Monitoring",
3  "nodes": [
4    {
5      "name": "Schedule",
6      "type": "n8n-nodes-base.scheduleTrigger",
7      "parameters": {
8        "rule": {
9          "interval": [{ "field": "minutes", "minutesInterval": 5 }]
10        }
11      }
12    },
13    {
14      "name": "Check Execution Stats",
15      "type": "n8n-nodes-base.postgres",
16      "parameters": {
17        "operation": "executeQuery",
18        "query": "SELECT COUNT(*) as total, SUM(CASE WHEN finished = true AND \"stoppedAt\" IS NOT NULL THEN 1 ELSE 0 END) as success, SUM(CASE WHEN \"stoppedAt\" IS NULL AND \"startedAt\" < NOW() - INTERVAL '1 hour' THEN 1 ELSE 0 END) as stuck FROM execution_entity WHERE \"startedAt\" > NOW() - INTERVAL '1 hour'"
19      }
20    },
21    {
22      "name": "Check for Issues",
23      "type": "n8n-nodes-base.if",
24      "parameters": {
25        "conditions": {
26          "number": [
27            {
28              "value1": "={{ $json.stuck }}",
29              "operation": "larger",
30              "value2": 0
31            }
32          ]
33        }
34      }
35    },
36    {
37      "name": "Send Alert",
38      "type": "n8n-nodes-base.slack",
39      "parameters": {
40        "channel": "#n8n-alerts",
41        "text": "⚠️ n8n Alert: {{ $json.stuck }} stuck executions detected!"
42      }
43    }
44  ]
45}

Database Monitoring

Execution Stats Query:

SQL

1-- Execution statistics
2SELECT 
3    DATE(e."startedAt") as date,
4    COUNT(*) as total_executions,
5    SUM(CASE WHEN e.finished THEN 1 ELSE 0 END) as successful,
6    SUM(CASE WHEN NOT e.finished THEN 1 ELSE 0 END) as failed,
7    ROUND(AVG(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt")))::numeric, 2) as avg_duration_sec
8FROM execution_entity e
9WHERE e."startedAt" > NOW() - INTERVAL '7 days'
10GROUP BY DATE(e."startedAt")
11ORDER BY date DESC;

Workflow Performance:

SQL

1-- Top 10 slowest workflows (average)
2SELECT 
3    w.name,
4    COUNT(e.id) as execution_count,
5    ROUND(AVG(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt")))::numeric, 2) as avg_seconds,
6    MAX(EXTRACT(EPOCH FROM (e."stoppedAt" - e."startedAt"))) as max_seconds
7FROM execution_entity e
8JOIN workflow_entity w ON e."workflowId" = w.id
9WHERE e."startedAt" > NOW() - INTERVAL '24 hours'
10  AND e.finished = true
11GROUP BY w.id
12ORDER BY avg_seconds DESC
13LIMIT 10;

Stuck Executions Monitor:

Bash

1#!/bin/bash
2# check-stuck.sh
3 
4STUCK_COUNT=$(docker exec n8n-postgres psql -U n8n -d n8n -t -c "
5SELECT COUNT(*)
6FROM execution_entity
7WHERE finished = false
8  AND \"startedAt\" < NOW() - INTERVAL '1 hour';
9")
10 
11if [ "$STUCK_COUNT" -gt 0 ]; then
12    echo "⚠️ Found $STUCK_COUNT stuck executions"
13    
14    # Send alert
15    curl -X POST -H 'Content-type: application/json' \
16        --data "{\"text\":\"⚠️ n8n: $STUCK_COUNT stuck executions\"}" \
17        https://hooks.slack.com/services/xxx/yyy/zzz
18fi

Log Monitoring

Docker Logs:

Bash

1# View live logs
2docker logs -f n8n
3 
4# Filter for errors
5docker logs n8n 2>&1 | grep -i error
6 
7# Save logs to file
8docker logs n8n > n8n-logs-$(date +%Y%m%d).txt 2>&1

Log Aggregation with Loki:

yaml

1# docker-compose.yml addition
2services:
3  loki:
4    image: grafana/loki:latest
5    ports:
6      - "3100:3100"
7    volumes:
8      - loki_data:/loki
9 
10  promtail:
11    image: grafana/promtail:latest
12    volumes:
13      - /var/run/docker.sock:/var/run/docker.sock
14      - ./promtail-config.yml:/etc/promtail/config.yml
15    command: -config.file=/etc/promtail/config.yml
16 
17volumes:
18  loki_data:

Promtail Config:

yaml

1# promtail-config.yml
2server:
3  http_listen_port: 9080
4 
5positions:
6  filename: /tmp/positions.yaml
7 
8clients:
9  - url: http://loki:3100/loki/api/v1/push
10 
11scrape_configs:
12  - job_name: docker
13    docker_sd_configs:
14      - host: unix:///var/run/docker.sock
15        refresh_interval: 5s
16    relabel_configs:
17      - source_labels: ['__meta_docker_container_name']
18        regex: '/(.*)'
19        target_label: 'container'

Uptime Monitoring

External Uptime Check:

Bash

1#!/bin/bash
2# uptime-monitor.sh (run from external server)
3 
4N8N_URL="https://n8n.yourdomain.com/healthz"
5WEBHOOK_URL="https://hooks.slack.com/services/xxx/yyy/zzz"
6 
7check_uptime() {
8    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$N8N_URL")
9    
10    if [ "$HTTP_CODE" != "200" ]; then
11        curl -X POST -H 'Content-type: application/json' \
12            --data "{\"text\":\"🔴 n8n is DOWN! Status: $HTTP_CODE\"}" \
13            "$WEBHOOK_URL"
14    fi
15}
16 
17check_uptime

Cron Setup:

Bash

1# Check every minute
2* * * * * /opt/scripts/uptime-monitor.sh >> /var/log/uptime.log 2>&1

UptimeRobot/BetterStack Integration:

Text

11. Sign up for uptime service
22. Add monitor:
3   - URL: https://n8n.yourdomain.com/healthz
4   - Interval: 1-5 minutes
5   - Alert contacts: email, Slack, SMS
63. Configure alert escalation

Performance Baseline

Establish Baselines:

SQL

1-- Create baseline table
2CREATE TABLE IF NOT EXISTS performance_baseline (
3    metric_name VARCHAR(100),
4    metric_date DATE,
5    avg_value DECIMAL,
6    p95_value DECIMAL,
7    max_value DECIMAL
8);
9
10-- Populate daily baseline
11INSERT INTO performance_baseline
12SELECT 
13    'execution_duration',
14    CURRENT_DATE - 1,
15    AVG(EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))),
16    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))),
17    MAX(EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt")))
18FROM execution_entity
19WHERE DATE("startedAt") = CURRENT_DATE - 1
20  AND finished = true;

Compare Against Baseline:

SQL

1-- Alert if current p95 > baseline * 2
2SELECT 
3    CASE 
4        WHEN current_p95 > baseline_p95 * 2 THEN 'ALERT'
5        ELSE 'OK'
6    END as status,
7    current_p95,
8    baseline_p95
9FROM (
10    SELECT 
11        PERCENTILE_CONT(0.95) WITHIN GROUP (
12            ORDER BY EXTRACT(EPOCH FROM ("stoppedAt" - "startedAt"))
13        ) as current_p95
14    FROM execution_entity
15    WHERE "startedAt" > NOW() - INTERVAL '1 hour'
16      AND finished = true
17) current
18CROSS JOIN (
19    SELECT p95_value as baseline_p95
20    FROM performance_baseline
21    WHERE metric_name = 'execution_duration'
22    ORDER BY metric_date DESC
23    LIMIT 1
24) baseline;

Bài Tập Thực Hành

Monitoring Challenge

Build monitoring stack:

Enable Prometheus metrics in n8n
Deploy Prometheus + Grafana stack
Create dashboard with key metrics
Set up alerts for errors & downtime
Configure Slack/email notifications
Create self-monitoring workflow in n8n

Complete observability setup! 📊

Monitoring Checklist

Production Checklist

Essential monitoring:

Health check endpoint monitored
Execution success/failure rates tracked
Error alerting configured
Database size monitored
Disk space alerts set
External uptime monitor active
Log aggregation in place
Performance baselines established

Key Takeaways

Remember

📈 Metrics first - Enable Prometheus metrics
🚨 Alert on impact - Not every error needs alert
📊 Dashboards - Visualize trends over time
🔄 Self-monitoring - n8n can monitor itself
📝 Baselines - Know what's normal before alerting on anomalies

Bài tiếp theo: Scaling - Queue mode, horizontal scaling, và high availability setup.

📊 Monitoring & Alerting

Monitoring Overview

What to Monitor:

Built-in Monitoring

Execution List:

Enable Execution Saving:

Health Check Endpoints

Basic Health Check:

Detailed Health Check Script:

Database Health Check:

Prometheus Monitoring

Enable Prometheus Metrics:

Available Metrics:

Prometheus Config:

Docker Compose with Prometheus:

Grafana Dashboard

Import n8n Dashboard:

Custom Dashboard Panels:

Alerting Setup

Grafana Alerts:

Slack Alert Integration:

Self-Monitoring Workflow

Create Monitoring Workflow in n8n:

Database Monitoring

Execution Stats Query:

Workflow Performance:

Stuck Executions Monitor:

Log Monitoring

Docker Logs:

Log Aggregation with Loki:

Promtail Config:

Uptime Monitoring

External Uptime Check:

Cron Setup:

UptimeRobot/BetterStack Integration:

Performance Baseline

Establish Baselines:

Compare Against Baseline:

Bài Tập Thực Hành

Monitoring Checklist

Key Takeaways

Tiếp Theo