49 KiB
04.3 Monitoring & Alerting
Project: LCBP3-DMS Version: 1.8.0 Status: Active Owner: Nattanin Peancharoen / DevOps Team Last Updated: 2026-02-23
📍 Monitoring Hub: ASUSTOR AS5403T 📍 App Server (Exporters): QNAP TS-473A
📖 Overview
This document combines the operational SLAs, Alerting Rules, and Health Checks with the technical deployment instructions for the monitoring stack (Prometheus, Grafana, Loki) across both servers.
Monitoring & Alerting
Project: LCBP3-DMS Version: 1.8.0 Last Updated: 2025-12-02
📋 Overview
This document describes monitoring setup, health checks, and alerting rules for LCBP3-DMS.
🎯 Monitoring Objectives
- Availability: System uptime > 99.5%
- Performance: API response time < 500ms (P95)
- Reliability: Error rate < 1%
- Capacity: Resource utilization < 80%
📊 Key Metrics
Application Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| API Response Time (P95) | < 500ms | > 1000ms |
| Error Rate | < 1% | > 5% |
| Request Rate | N/A | Sudden ±50% change |
| Active Users | N/A | - |
| Queue Length (BullMQ) | < 100 | > 500 |
Infrastructure Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| CPU Usage | < 70% | > 90% |
| Memory Usage | < 80% | > 95% |
| Disk Usage | < 80% | > 90% |
| Network I/O | N/A | Anomaly detection |
Database Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Query Time (P95) | < 100ms | > 500ms |
| Connection Pool Usage | < 80% | > 95% |
| Slow Queries | 0 | > 10/min |
| Replication Lag | 0s | > 30s |
🔍 Health Checks
Backend Health Endpoint
// File: backend/src/health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import {
HealthCheck,
HealthCheckService,
TypeOrmHealthIndicator,
DiskHealthIndicator,
} from '@nestjs/terminus';
@Controller('health')
export class HealthController {
constructor(
private health: HealthCheckService,
private db: TypeOrmHealthIndicator,
private disk: DiskHealthIndicator
) {}
@Get()
@HealthCheck()
check() {
return this.health.check([
// Database health
() => this.db.pingCheck('database'),
// Disk health
() =>
this.disk.checkStorage('storage', {
path: '/',
thresholdPercent: 0.9,
}),
// Redis health
async () => {
const redis = await this.redis.ping();
return { redis: { status: redis === 'PONG' ? 'up' : 'down' } };
},
]);
}
}
Health Check Response
{
"status": "ok",
"info": {
"database": {
"status": "up"
},
"storage": {
"status": "up",
"freePercent": 0.75
},
"redis": {
"status": "up"
}
},
"error": {},
"details": {
"database": {
"status": "up"
},
"storage": {
"status": "up",
"freePercent": 0.75
},
"redis": {
"status": "up"
}
}
}
🐳 Docker Container Monitoring
Health Check in docker-compose.yml
services:
backend:
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:3000/health']
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
mariadb:
healthcheck:
test: ['CMD', 'mysqladmin', 'ping', '-h', 'localhost']
interval: 30s
timeout: 10s
retries: 3
redis:
healthcheck:
test: ['CMD', 'redis-cli', 'ping']
interval: 30s
timeout: 10s
retries: 3
Monitor Container Status
#!/bin/bash
# File: /scripts/monitor-containers.sh
# Check all containers are healthy
CONTAINERS=("lcbp3-backend" "lcbp3-frontend" "lcbp3-mariadb" "lcbp3-redis")
for CONTAINER in "${CONTAINERS[@]}"; do
HEALTH=$(docker inspect --format='{{.State.Health.Status}}' $CONTAINER 2>/dev/null)
if [ "$HEALTH" != "healthy" ]; then
echo "ALERT: $CONTAINER is $HEALTH"
# Send alert (email, Slack, etc.)
fi
done
📈 Application Performance Monitoring (APM)
Log-Based Monitoring (MVP Phase)
// File: backend/src/common/interceptors/performance.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { logger } from 'src/config/logger.config';
@Injectable()
export class PerformanceInterceptor implements NestInterceptor {
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const start = Date.now();
return next.handle().pipe(
tap({
next: () => {
const duration = Date.now() - start;
logger.info('Request completed', {
method: request.method,
url: request.url,
statusCode: context.switchToHttp().getResponse().statusCode,
duration: `${duration}ms`,
userId: request.user?.user_id,
});
// Alert on slow requests
if (duration > 1000) {
logger.warn('Slow request detected', {
method: request.method,
url: request.url,
duration: `${duration}ms`,
});
}
},
error: (error) => {
const duration = Date.now() - start;
logger.error('Request failed', {
method: request.method,
url: request.url,
duration: `${duration}ms`,
error: error.message,
});
},
})
);
}
}
🚨 Alerting Rules
Critical Alerts (Immediate Action Required)
| Alert | Condition | Action |
|---|---|---|
| Service Down | Health check fails for 3 consecutive checks | Page on-call engineer |
| Database Down | Cannot connect to database | Page DBA + on-call engineer |
| Disk Full | Disk usage > 95% | Page operations team |
| High Error Rate | Error rate > 10% for 5 min | Page on-call engineer |
Warning Alerts (Review Within 1 Hour)
| Alert | Condition | Action |
|---|---|---|
| High CPU | CPU > 90% for 10 min | Notify operations team |
| High Memory | Memory > 95% for 10 min | Notify operations team |
| Slow Queries | > 50 slow queries/min | Notify DBA |
| Queue Backlog | BullMQ queue > 500 jobs | Notify backend team |
Info Alerts (Review During Business Hours)
| Alert | Condition | Action |
|---|---|---|
| Backup Failed | Daily backup job failed | Email operations team |
| SSL Expiring | SSL certificate expires in < 30 days | Email operations team |
| Disk Space Warning | Disk usage > 80% | Email operations team |
📧 Alert Notification Channels
Email Alerts
#!/bin/bash
# File: /scripts/send-alert-email.sh
TO="ops-team@example.com"
SUBJECT="$1"
MESSAGE="$2"
echo "$MESSAGE" | mail -s "[LCBP3-DMS] $SUBJECT" "$TO"
Slack Alerts
#!/bin/bash
# File: /scripts/send-alert-slack.sh
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
MESSAGE="$1"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"🚨 LCBP3-DMS Alert: $MESSAGE\"}" \
"$WEBHOOK_URL"
📊 Monitoring Dashboard
Metrics to Display
System Overview:
- Service status (up/down)
- Overall system health score
- Active user count
- Request rate (req/s)
Performance:
- API response time (P50, P95, P99)
- Database query time
- Queue processing time
Resources:
- CPU usage %
- Memory usage %
- Disk usage %
- Network I/O
Business Metrics:
- Documents created today
- Workflows completed today
- Active correspondences
- Pending approvals
🔧 Log Aggregation
Centralized Logging with Docker
# Configure Docker logging driver
# File: /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3",
"labels": "service,environment"
}
}
View Aggregated Logs
# View all LCBP3 container logs
docker-compose logs -f --tail=100
# View specific service logs
docker logs lcbp3-backend -f --since=1h
# Search logs
docker logs lcbp3-backend 2>&1 | grep "ERROR"
# Export logs for analysis
docker logs lcbp3-backend > backend-logs.txt
📈 Performance Baseline
Establish Baselines
Run load tests to establish performance baselines:
# Install Apache Bench
apt-get install apache2-utils
# Test API endpoint
ab -n 1000 -c 10 \
-H "Authorization: Bearer <TOKEN>" \
https://lcbp3-dms.example.com/api/correspondences
# Results to record:
# - Requests per second
# - Mean response time
# - P95 response time
# - Error rate
Regular Performance Testing
- Weekly: Quick health check (100 requests)
- Monthly: Full load test (10,000 requests)
- Quarterly: Stress test (find breaking point)
✅ Monitoring Checklist
Daily
- Check service health dashboard
- Review error logs
- Verify backup completion
- Check disk space
Weekly
- Review performance metrics trends
- Analyze slow query log
- Check SSL certificate expiry
- Review security alerts
Monthly
- Capacity planning review
- Update monitoring thresholds
- Test alert notifications
- Review and tune performance
🔗 Related Documents
Version: 1.8.0 Last Review: 2025-12-01 Next Review: 2026-03-01
การติดตั้ง Monitoring Stack บน ASUSTOR
📝 คำอธิบายและข้อควรพิจารณา
⚠️ หมายเหตุ: Monitoring Stack ทั้งหมดติดตั้งบน ASUSTOR AS5403T ไม่ใช่ QNAP เพื่อแยก Application workload ออกจาก Infrastructure/Monitoring workload
Stack สำหรับ Monitoring ประกอบด้วย:
| Service | Port | Purpose | Host |
|---|---|---|---|
| Prometheus | 9090 | เก็บ Metrics และ Time-series data | ASUSTOR |
| Grafana | 3000 | Dashboard สำหรับแสดงผล Metrics | ASUSTOR |
| Node Exporter | 9100 | เก็บ Metrics ของ Host system | Both |
| cAdvisor | 8080 (ASUSTOR) / 8088 (QNAP) | เก็บ Metrics ของ Docker containers | Both |
| Uptime Kuma | 3001 | Service Availability Monitoring | ASUSTOR |
| Loki | 3100 | Log aggregation | ASUSTOR |
| Promtail | - | Log shipper (Sender) | ASUSTOR |
🏗️ Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ ASUSTOR AS5403T (Monitoring Hub) │
├─────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Prometheus │───▶│ Grafana │ │ Uptime Kuma │ │
│ │ :9090 │ │ :3000 │ │ :3001 │ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ │ Scrape Metrics │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │node-exporter│ │ cAdvisor │ │ Promtail │ │
│ │ :9100 │ │ :8080 │ │ (Log Ship) │ │
│ │ (Local) │ │ (Local) │ │ (Local) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│ Remote Scrape
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ QNAP TS-473A (App Server) │
├─────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │node-exporter│ │ cAdvisor │ │ Backend │ │
│ │ :9100 │ │ :8080 │ │ /metrics │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
กำหนดสิทธิ (บน ASUSTOR)
# SSH เข้า ASUSTOR
ssh admin@192.168.10.9
# สร้าง Directory
mkdir -p /volume1/np-dms/monitoring/prometheus/data
mkdir -p /volume1/np-dms/monitoring/prometheus/config
mkdir -p /volume1/np-dms/monitoring/grafana/data
mkdir -p /volume1/np-dms/monitoring/uptime-kuma/data
mkdir -p /volume1/np-dms/monitoring/loki/data
mkdir -p /volume1/np-dms/monitoring/promtail/config
# กำหนดสิทธิ์ให้ตรงกับ User ID ใน Container
# Prometheus (UID 65534 - nobody)
chown -R 65534:65534 /volume1/np-dms/monitoring/prometheus
chmod -R 750 /volume1/np-dms/monitoring/prometheus
# Grafana (UID 472)
chown -R 472:472 /volume1/np-dms/monitoring/grafana/data
chmod -R 750 /volume1/np-dms/monitoring/grafana/data
# Uptime Kuma (UID 1000)
chown -R 1000:1000 /volume1/np-dms/monitoring/uptime-kuma/data
chmod -R 750 /volume1/np-dms/monitoring/uptime-kuma/data
# Loki (UID 10001)
chown -R 10001:10001 /volume1/np-dms/monitoring/loki/data
chmod -R 750 /volume1/np-dms/monitoring/loki/data
# Promtail (Runs as root to read docker logs - no specific chown needed for config dir if created by admin)
# But ensure config file is readable
chmod -R 755 /volume1/np-dms/monitoring/promtail/config
🔗 สร้าง Docker Network (ทำครั้งแรกครั้งเดียว)
⚠️ ต้องสร้าง network ก่อน deploy docker-compose ทุกตัว เพราะทุก service ใช้
lcbp3เป็น external network
สร้างผ่าน Portainer (แนะนำ)
- เปิด Portainer → เลือก Environment ของ ASUSTOR
- ไปที่ Networks → Add network
- กรอกข้อมูล:
- Name:
lcbp3 - Driver:
bridge
- Name:
- กด Create the network
สร้างผ่าน SSH
# SSH เข้า ASUSTOR
ssh admin@192.168.10.9
# สร้าง external network
docker network create lcbp3
# ตรวจสอบ
docker network ls | grep lcbp3
docker network inspect lcbp3
📖 QNAP ก็ต้องมี network ชื่อ
lcbp3เช่นกัน (สร้างผ่าน Container Station หรือ SSH) ดู README.md – Quick Reference สำหรับคำสั่งบน QNAP
Note: NPM Proxy Configuration (NPM รันบน QNAP → Forward ไป ASUSTOR)
⚠️ เนื่องจาก NPM อยู่บน QNAP แต่ Monitoring services อยู่บน ASUSTOR ต้องใช้ IP Address (
192.168.10.9) แทนชื่อ container (resolve ข้ามเครื่องไม่ได้)
| Domain Names | Scheme | Forward Hostname | Forward Port | Block Common Exploits | Websockets | Force SSL | HTTP/2 |
|---|---|---|---|---|---|---|---|
| grafana.np-dms.work | http |
192.168.10.9 |
3000 | [x] | [x] | [x] | [x] |
| prometheus.np-dms.work | http |
192.168.10.9 |
9090 | [x] | [ ] | [x] | [x] |
| uptime.np-dms.work | http |
192.168.10.9 |
3001 | [x] | [x] | [x] | [x] |
Docker Compose File (ASUSTOR)
# File: /volume1/np-dms/monitoring/docker-compose.yml
# DMS Container v1.8.0: Application name: lcbp3-monitoring
# Deploy on: ASUSTOR AS5403T
# Services: prometheus, grafana, node-exporter, cadvisor, uptime-kuma, loki, promtail
x-restart: &restart_policy
restart: unless-stopped
x-logging: &default_logging
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "5"
networks:
lcbp3:
external: true
services:
# ----------------------------------------------------------------
# 1. Prometheus (Metrics Collection & Storage)
# ----------------------------------------------------------------
prometheus:
<<: [*restart_policy, *default_logging]
image: prom/prometheus:v2.48.0
container_name: prometheus
stdin_open: true
tty: true
deploy:
resources:
limits:
cpus: "1.0"
memory: 1G
reservations:
cpus: "0.25"
memory: 256M
environment:
TZ: "Asia/Bangkok"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- lcbp3
volumes:
- "/volume1/np-dms/monitoring/prometheus/config:/etc/prometheus:ro"
- "/volume1/np-dms/monitoring/prometheus/data:/prometheus"
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
# ----------------------------------------------------------------
# 2. Grafana (Dashboard & Visualization)
# ----------------------------------------------------------------
grafana:
<<: [*restart_policy, *default_logging]
image: grafana/grafana:10.2.2
container_name: grafana
stdin_open: true
tty: true
deploy:
resources:
limits:
cpus: "1.0"
memory: 512M
reservations:
cpus: "0.25"
memory: 128M
environment:
TZ: "Asia/Bangkok"
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: "Center#2025"
GF_SERVER_ROOT_URL: "https://grafana.np-dms.work"
GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
ports:
- "3000:3000"
networks:
- lcbp3
volumes:
- "/volume1/np-dms/monitoring/grafana/data:/var/lib/grafana"
depends_on:
- prometheus
healthcheck:
test: ["CMD-SHELL", "wget --spider -q http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
# ----------------------------------------------------------------
# 3. Uptime Kuma (Service Availability Monitoring)
# ----------------------------------------------------------------
uptime-kuma:
<<: [*restart_policy, *default_logging]
image: louislam/uptime-kuma:1
container_name: uptime-kuma
deploy:
resources:
limits:
cpus: "0.5"
memory: 256M
environment:
TZ: "Asia/Bangkok"
ports:
- "3001:3001"
networks:
- lcbp3
volumes:
- "/volume1/np-dms/monitoring/uptime-kuma/data:/app/data"
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3001/api/entry-page || exit 1"]
interval: 30s
timeout: 10s
retries: 3
# ----------------------------------------------------------------
# 4. Node Exporter (Host Metrics - ASUSTOR)
# ----------------------------------------------------------------
node-exporter:
<<: [*restart_policy, *default_logging]
image: prom/node-exporter:v1.7.0
container_name: node-exporter
deploy:
resources:
limits:
cpus: "0.5"
memory: 128M
environment:
TZ: "Asia/Bangkok"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- lcbp3
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:9100/metrics"]
interval: 30s
timeout: 10s
retries: 3
# ----------------------------------------------------------------
# 5. cAdvisor (Container Metrics - ASUSTOR)
# ----------------------------------------------------------------
cadvisor:
<<: [*restart_policy, *default_logging]
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
privileged: true
devices:
- /dev/kmsg
deploy:
resources:
limits:
cpus: "0.5"
memory: 256M
environment:
TZ: "Asia/Bangkok"
ports:
- "8088:8088"
networks:
- lcbp3
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/healthz"]
interval: 30s
timeout: 10s
retries: 3
# ----------------------------------------------------------------
# 6. Loki (Log Aggregation)
# ----------------------------------------------------------------
loki:
<<: [*restart_policy, *default_logging]
image: grafana/loki:2.9.0
container_name: loki
deploy:
resources:
limits:
cpus: "0.5"
memory: 512M
environment:
TZ: "Asia/Bangkok"
command: -config.file=/etc/loki/local-config.yaml
ports:
- "3100:3100"
networks:
- lcbp3
volumes:
- "/volume1/np-dms/monitoring/loki/data:/loki"
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3100/ready"]
interval: 30s
timeout: 10s
retries: 3
# ----------------------------------------------------------------
# 7. Promtail (Log Shipper)
# ----------------------------------------------------------------
promtail:
<<: [*restart_policy, *default_logging]
image: grafana/promtail:2.9.0
container_name: promtail
user: "0:0"
deploy:
resources:
limits:
cpus: "0.5"
memory: 256M
environment:
TZ: "Asia/Bangkok"
command: -config.file=/etc/promtail/promtail-config.yml
networks:
- lcbp3
volumes:
- "/volume1/np-dms/monitoring/promtail/config:/etc/promtail:ro"
- "/var/run/docker.sock:/var/run/docker.sock:ro"
- "/var/lib/docker/containers:/var/lib/docker/containers:ro"
depends_on:
- loki
QNAP Node Exporter & cAdvisor
ติดตั้ง node-exporter และ cAdvisor บน QNAP เพื่อให้ Prometheus บน ASUSTOR scrape metrics ได้:
# File: /share/np-dms/monitoring/docker-compose.yml (QNAP)
# เฉพาะ exporters เท่านั้น - metrics ถูก scrape โดย Prometheus บน ASUSTOR
version: '3.8'
networks:
lcbp3:
external: true
services:
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- lcbp3
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
restart: unless-stopped
privileged: true
ports:
- "8088:8080"
networks:
- lcbp3
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /sys/fs/cgroup:/sys/fs/cgroup:ro
mysqld-exporter:
image: prom/mysqld-exporter:v0.15.0
container_name: mysqld-exporter
restart: unless-stopped
user: root
command:
- '--config.my-cnf=/etc/mysql/my.cnf'
ports:
- "9104:9104"
networks:
- lcbp3
volumes:
- "/share/np-dms/monitoring/mysqld-exporter/.my.cnf:/etc/mysql/my.cnf:ro"
Prometheus Configuration
สร้างไฟล์ /volume1/np-dms/monitoring/prometheus/config/prometheus.yml บน ASUSTOR:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Prometheus self-monitoring (ASUSTOR)
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# ============================================
# ASUSTOR Metrics (Local)
# ============================================
# Host metrics from Node Exporter (ASUSTOR)
- job_name: 'asustor-node'
static_configs:
- targets: ['node-exporter:9100']
labels:
host: 'asustor'
# Container metrics from cAdvisor (ASUSTOR)
- job_name: 'asustor-cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
host: 'asustor'
# ============================================
# QNAP Metrics (Remote - 192.168.10.8)
# ============================================
# Host metrics from Node Exporter (QNAP)
- job_name: 'qnap-node'
static_configs:
- targets: ['192.168.10.8:9100']
labels:
host: 'qnap'
# Container metrics from cAdvisor (QNAP)
- job_name: 'qnap-cadvisor'
static_configs:
- targets: ['192.168.10.8:8088']
labels:
host: 'qnap'
# Backend NestJS application (QNAP)
- job_name: 'backend'
static_configs:
- targets: ['192.168.10.8:3000']
labels:
host: 'qnap'
metrics_path: '/metrics'
# MariaDB Exporter (QNAP)
- job_name: 'mariadb'
static_configs:
- targets: ['192.168.10.8:9104']
labels:
host: 'qnap'
Uptime Kuma Monitors
เมื่อ Uptime Kuma พร้อมใช้งาน ให้เพิ่ม monitors ต่อไปนี้:
| Monitor Name | Type | URL / Host | Interval |
|---|---|---|---|
| QNAP NPM | HTTP | https://npm.np-dms.work | 60s |
| Frontend | HTTP | https://lcbp3.np-dms.work | 60s |
| Backend API | HTTP | https://backend.np-dms.work/health | 60s |
| MariaDB | TCP | 192.168.10.8:3306 | 60s |
| Redis | TCP | 192.168.10.8:6379 | 60s |
| Elasticsearch | HTTP | http://192.168.10.8:9200 | 60s |
| Gitea | HTTP | https://git.np-dms.work | 60s |
| n8n | HTTP | https://n8n.np-dms.work | 60s |
| Grafana | HTTP | https://grafana.np-dms.work | 60s |
| QNAP Host | Ping | 192.168.10.8 | 60s |
| ASUSTOR Host | Ping | 192.168.10.9 | 60s |
Grafana Dashboards
Recommended Dashboards to Import
| Dashboard ID | Name | Purpose |
|---|---|---|
| 1860 | Node Exporter Full | Host system metrics |
| 14282 | cAdvisor exporter | Container metrics |
| 11074 | Node Exporter for Prometheus | Node overview |
| 893 | Docker and Container | Docker overview |
| 7362 | MySQL | MySQL view |
| 1214 | Redis | Redis view |
| 14204 | Elasticsearch | Elasticsearch view |
| 13106 | MySQL/MariaDB Overview | Detailed MySQL/MariaDB metrics |
Import Dashboard via Grafana UI
- Go to Dashboards → Import
- Enter Dashboard ID (e.g.,
1860) - Select Prometheus data source
- Click Import
🚀 Deploy lcbp3-monitoring บน ASUSTOR
📋 Prerequisites Checklist
| # | ขั้นตอน | Status |
|---|---|---|
| 1 | SSH เข้า ASUSTOR ได้ (ssh admin@192.168.10.9) |
✅ |
| 2 | Docker Network lcbp3 สร้างแล้ว (ดูหัวข้อ สร้าง Docker Network) |
✅ |
| 3 | สร้าง Directories และกำหนดสิทธิ์แล้ว (ดูหัวข้อ กำหนดสิทธิ) | ✅ |
| 4 | สร้าง prometheus.yml แล้ว (ดูหัวข้อ Prometheus Configuration) |
✅ |
| 5 | สร้าง promtail-config.yml แล้ว (ดูหัวข้อ Step 1.2) |
✅ |
Step 1: สร้าง prometheus.yml
# SSH เข้า ASUSTOR
ssh admin@192.168.10.9
# สร้างไฟล์ prometheus.yml
cat > /volume1/np-dms/monitoring/prometheus/config/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'asustor-node'
static_configs:
- targets: ['node-exporter:9100']
labels:
host: 'asustor'
- job_name: 'asustor-cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
host: 'asustor'
- job_name: 'qnap-node'
static_configs:
- targets: ['192.168.10.8:9100']
labels:
host: 'qnap'
- job_name: 'qnap-cadvisor'
static_configs:
- targets: ['192.168.10.8:8088']
labels:
host: 'qnap'
- job_name: 'backend'
static_configs:
- targets: ['192.168.10.8:3000']
labels:
host: 'qnap'
metrics_path: '/metrics'
EOF
# ตรวจสอบ
cat /volume1/np-dms/monitoring/prometheus/config/prometheus.yml
Step 1.2: สร้าง promtail-config.yml
ต้องสร้าง Config ให้ Promtail อ่าน logs จาก Docker containers และส่งไป Loki:
# สร้างไฟล์ promtail-config.yml
cat > /volume1/np-dms/monitoring/promtail/config/promtail-config.yml << 'EOF'
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'stream'
EOF
# ขั้นตอนการเตรียมระบบที่ QNAP (ก่อน Deploy Stack)
### 1. สร้าง Monitoring User ใน MariaDB
รันคำสั่ง SQL นี้ผ่าน **phpMyAdmin** หรือ `docker exec`:
```sql
CREATE USER 'exporter'@'%' IDENTIFIED BY 'Center2025' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT, SLAVE MONITOR ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
2. สร้างไฟล์คอนฟิก .my.cnf บน QNAP
เพื่อให้ mysqld-exporter อ่านรหัสผ่านที่มีตัวอักษรพิเศษได้ถูกต้อง:
- SSH เข้า QNAP (หรือใช้ File Station สร้าง Folder):
ssh admin@192.168.10.8 - สร้าง Directory สำหรับเก็บ Config:
mkdir -p /share/np-dms/monitoring/mysqld-exporter - สร้างไฟล์ .my.cnf:
cat > /share/np-dms/monitoring/mysqld-exporter/.my.cnf << 'EOF'
[client] user=exporter password=Center2025 host=mariadb EOF
4. **กำหนดสิทธิ์ไฟล์** (เพื่อให้ Container อ่านไฟล์ได้):
```bash
chmod 644 /share/np-dms/monitoring/mysqld-exporter/.my.cnf
ตรวจสอบ
cat /volume1/np-dms/monitoring/promtail/config/promtail-config.yml
---
### Step 2: Deploy ผ่าน Portainer (แนะนำ)
1. เปิด **Portainer** → เลือก Environment ของ **ASUSTOR**
2. ไปที่ **Stacks** → **Add stack**
3. กรอกข้อมูล:
- **Name:** `lcbp3-monitoring`
- **Build method:** เลือก **Web editor**
4. วาง (Paste) เนื้อหาจาก [Docker Compose File (ASUSTOR)](#docker-compose-file-asustor) ด้านบน
5. กด **Deploy the stack**
> ⚠️ **สำคัญ:** ตรวจสอบ Password ของ Grafana (`GF_SECURITY_ADMIN_PASSWORD`) ใน docker-compose ก่อน deploy
### Deploy ผ่าน SSH (วิธีสำรอง)
```bash
# SSH เข้า ASUSTOR
ssh admin@192.168.10.9
# คัดลอก docker-compose.yml ไปยัง path
# (วางไฟล์ที่ /volume1/np-dms/monitoring/docker-compose.yml)
# Deploy
cd /volume1/np-dms/monitoring
docker compose up -d
# ตรวจสอบ container status
docker compose ps
Step 3: Verify Services
# ตรวจสอบ containers ทั้งหมด
docker ps --filter "name=prometheus" --filter "name=grafana" \
--filter "name=uptime-kuma" --filter "name=node-exporter" \
--filter "name=cadvisor" --filter "name=loki" --filter "name=promtail"
| Service | วิธีตรวจสอบ | Expected Result |
|---|---|---|
| ✅ Prometheus | curl http://192.168.10.9:9090/-/healthy |
Prometheus Server is Healthy |
| ✅ Grafana | เปิด https://grafana.np-dms.work (หรือ http://192.168.10.9:3000) |
หน้า Login |
| ✅ Uptime Kuma | เปิด https://uptime.np-dms.work (หรือ http://192.168.10.9:3001) |
หน้า Setup |
| ✅ Node Exp. | curl http://192.168.10.9:9100/metrics | head |
Metrics output |
| ✅ cAdvisor | curl http://192.168.10.9:8080/healthz |
ok |
| ✅ Loki | curl http://192.168.10.9:3100/ready |
ready |
| ✅ Promtail | เช็ค Logs: docker logs promtail |
ไม่ควรมี Error + เห็น connection success |
Step 4: Deploy QNAP Exporters
ติดตั้ง node-exporter และ cAdvisor บน QNAP เพื่อให้ Prometheus scrape ข้ามเครื่องได้:
ผ่าน Container Station (QNAP)
- เปิด Container Station บน QNAP Web UI
- ไปที่ Applications → Create
- ตั้งชื่อ Application:
lcbp3-exporters - วาง (Paste) เนื้อหาจาก QNAP Node Exporter & cAdvisor
- กด Create
ตรวจสอบจาก ASUSTOR
# ตรวจว่า Prometheus scrape QNAP ได้
curl -s http://localhost:9090/api/v1/targets | grep -E '"qnap-(node|cadvisor)"'
# หรือเปิด Prometheus UI → Targets
# URL: http://192.168.10.9:9090/targets
# ดูว่า qnap-node, qnap-cadvisor เป็น State: UP
Step 5: ตั้งค่า Grafana & Uptime Kuma
Grafana — First Login
- เปิด
https://grafana.np-dms.work - Login:
admin/Center#2025(หรือ password ที่ตั้งไว้) - ไปที่ Connections → Data sources → Add data source
- เลือก Prometheus
- URL:
http://prometheus:9090 - กด Save & Test → ต้องขึ้น ✅
- URL:
- Import Dashboards (ดูรายละเอียดในหัวข้อ 6. Grafana Dashboards Setup)
Uptime Kuma — First Setup
- เปิด
https://uptime.np-dms.work - สร้าง Admin account
- เพิ่ม Monitors ตาม ตาราง Uptime Kuma Monitors
6. Grafana Dashboards Setup
เพื่อการ Monitor ที่สมบูรณ์ แนะนำให้ Import Dashboards ต่อไปนี้:
6.1 Host Monitoring (Node Exporter)
- Concept: ดู resource ของเครื่อง Host (CPU, RAM, Disk, Network)
- Dashboard ID:
1860(Node Exporter Full) - วิธี Import:
- ไปที่ Dashboards → New → Import
- ช่อง Import via grafana.com ใส่เลข
1860กด Load - เลือก Data source: Prometheus
- กด Import
6.2 Container Monitoring (cAdvisor)
- Concept: ดู resource ของแต่ละ Container (เชื่อม Logs ด้วย)
- Dashboard ID:
14282(Cadvisor exporter) - วิธี Import:
- ใส่เลข
14282กด Load - เลือก Data source: Prometheus
- กด Import
- ใส่เลข
6.3 Logs Monitoring (Loki Integration)
เพื่อให้ Dashboard ของ Container แสดง Logs จาก Loki ได้ด้วย:
- เปิด Dashboard Cadvisor exporter ที่เพิ่ง Import มา
- กดปุ่ม Add visualization (หรือ Edit dashboard)
- เลือก Data source: Loki
- ในช่อง Query ใส่:
{container="$name"}- (Note:
$nameมาจาก Variable ของ Dashboard 14282)
- (Note:
- ปรับ Visualization type เป็น Logs
- ตั้งชื่อ Panel ว่า "Container Logs"
- กด Apply และ Save Dashboard
ตอนนี้เราจะเห็นทั้ง กราฟการกินทรัพยากร และ Logs ของ Container นั้นๆ ในหน้าเดียวกันครับ
6.4 Integrated Dashboard (Recommended)
ผมได้เตรียม JSON file ที่รวม Metrics และ Logs ไว้ให้แล้วครับ:
- ไปที่ Dashboards → New → Import
- ลากไฟล์ หรือ Copy เนื้อหาจากไฟล์:
specs/08-infrastructure/grafana/dashboards/lcbp3-docker-monitoring.json - กด Load และ Import
7.3 Backup / Export Dashboards
เมื่อปรับแต่ง Dashboard จนพอใจแล้ว ควร Export เก็บเป็นไฟล์ JSON ไว้ backup หรือ version control:
- เปิด Dashboard ที่ต้องการ backup
- ไปที่ปุ่ม Share Dashboard (ไอคอน 🔗 หรือ Share มุมซ้ายบน)
- เลือกTab Export
- เปิดตัวเลือก Export for sharing externally (เพื่อให้ลบ hardcoded value)
- กด Save to file
- นำไฟล์ JSON มาเก็บไว้ที่ path:
specs/08-infrastructure/grafana/dashboards/
📝 หมายเหตุ: เอกสารนี้อ้างอิงจาก Architecture Document v1.8.0 - Monitoring Stack deploy บน ASUSTOR AS5403T
📈 Document Numbering Specific Monitoring
3. Monitoring & Metrics
3.1. Prometheus Metrics
Key Metrics to Collect
// metrics.service.ts
import { Counter, Histogram, Gauge } from 'prom-client';
// Lock acquisition metrics
export const lockAcquisitionDuration = new Histogram({
name: 'docnum_lock_acquisition_duration_ms',
help: 'Lock acquisition time in milliseconds',
labelNames: ['project', 'type'],
buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000],
});
export const lockAcquisitionFailures = new Counter({
name: 'docnum_lock_acquisition_failures_total',
help: 'Total number of lock acquisition failures',
labelNames: ['project', 'type', 'reason'],
});
// Generation metrics
export const generationDuration = new Histogram({
name: 'docnum_generation_duration_ms',
help: 'Total document number generation time',
labelNames: ['project', 'type', 'status'],
buckets: [100, 200, 500, 1000, 2000, 5000],
});
export const retryCount = new Histogram({
name: 'docnum_retry_count',
help: 'Number of retries per generation',
labelNames: ['project', 'type'],
buckets: [0, 1, 2, 3, 5, 10],
});
// Connection health
export const redisConnectionStatus = new Gauge({
name: 'docnum_redis_connection_status',
help: 'Redis connection status (1=up, 0=down)',
});
export const dbConnectionPoolUsage = new Gauge({
name: 'docnum_db_connection_pool_usage',
help: 'Database connection pool usage percentage',
});
3.2. Prometheus Alert Rules
# prometheus/alerts.yml
groups:
- name: document_numbering_alerts
interval: 30s
rules:
# CRITICAL: Redis unavailable
- alert: RedisUnavailable
expr: docnum_redis_connection_status == 0
for: 1m
labels:
severity: critical
component: document-numbering
annotations:
summary: "Redis is unavailable for document numbering"
description: "System is falling back to DB-only locking. Performance degraded by 30-50%."
runbook_url: "https://wiki.lcbp3/runbooks/redis-unavailable"
# CRITICAL: High lock failure rate
- alert: HighLockFailureRate
expr: |
rate(docnum_lock_acquisition_failures_total[5m]) > 0.1
for: 5m
labels:
severity: critical
component: document-numbering
annotations:
summary: "Lock acquisition failure rate > 10%"
description: "Check Redis and database performance immediately"
runbook_url: "https://wiki.lcbp3/runbooks/high-lock-failure"
# WARNING: Elevated lock failure rate
- alert: ElevatedLockFailureRate
expr: |
rate(docnum_lock_acquisition_failures_total[5m]) > 0.05
for: 5m
labels:
severity: warning
component: document-numbering
annotations:
summary: "Lock acquisition failure rate > 5%"
description: "Monitor closely. May escalate to critical soon."
# WARNING: Slow lock acquisition
- alert: SlowLockAcquisition
expr: |
histogram_quantile(0.95,
rate(docnum_lock_acquisition_duration_ms_bucket[5m])
) > 1000
for: 5m
labels:
severity: warning
component: document-numbering
annotations:
summary: "P95 lock acquisition time > 1 second"
description: "Lock acquisition is slower than expected. Check Redis latency."
# WARNING: High retry count
- alert: HighRetryCount
expr: |
sum by (project) (
rate(docnum_retry_count_sum[1h])
) > 100
for: 1h
labels:
severity: warning
component: document-numbering
annotations:
summary: "Retry count > 100 per hour in project {{ $labels.project }}"
description: "High contention detected. Consider scaling."
# WARNING: Slow generation
- alert: SlowDocumentNumberGeneration
expr: |
histogram_quantile(0.95,
rate(docnum_generation_duration_ms_bucket[5m])
) > 2000
for: 5m
labels:
severity: warning
component: document-numbering
annotations:
summary: "P95 generation time > 2 seconds"
description: "Document number generation is slower than SLA target"
3.3. AlertManager Configuration
# alertmanager/config.yml
global:
resolve_timeout: 5m
slack_api_url: ${SLACK_WEBHOOK_URL}
route:
group_by: ['alertname', 'severity', 'project']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'ops-team'
routes:
# CRITICAL alerts → PagerDuty + Slack
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
continue: false
# WARNING alerts → Slack only
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: ${PAGERDUTY_SERVICE_KEY}
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
runbook: '{{ .CommonAnnotations.runbook_url }}'
- name: 'slack-critical'
slack_configs:
- channel: '#lcbp3-critical-alerts'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: |
*Summary:* {{ .CommonAnnotations.summary }}
*Description:* {{ .CommonAnnotations.description }}
*Runbook:* {{ .CommonAnnotations.runbook_url }}
color: 'danger'
- name: 'slack-warnings'
slack_configs:
- channel: '#lcbp3-alerts'
title: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
color: 'warning'
- name: 'ops-team'
email_configs:
- to: 'ops@example.com'
subject: '[LCBP3] {{ .GroupLabels.alertname }}'
3.4. Grafana Dashboard
Dashboard panels ที่สำคัญ:
-
Lock Acquisition Success Rate (Gauge)
- Query:
1 - (rate(docnum_lock_acquisition_failures_total[5m]) / rate(docnum_lock_acquisition_total[5m])) - Alert threshold: < 95%
- Query:
-
Lock Acquisition Time Percentiles (Graph)
- P50:
histogram_quantile(0.50, rate(docnum_lock_acquisition_duration_ms_bucket[5m])) - P95:
histogram_quantile(0.95, rate(docnum_lock_acquisition_duration_ms_bucket[5m])) - P99:
histogram_quantile(0.99, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))
- P50:
-
Generation Rate (Stat)
- Query:
sum(rate(docnum_generation_duration_ms_count[1m])) * 60 - Unit: documents/minute
- Query:
-
Error Rate by Type (Graph)
- Query:
sum by (reason) (rate(docnum_lock_acquisition_failures_total[5m]))
- Query:
-
Redis Connection Status (Stat)
- Query:
docnum_redis_connection_status - Thresholds: 0 = red, 1 = green
- Query:
-
DB Connection Pool Usage (Gauge)
- Query:
docnum_db_connection_pool_usage - Alert threshold: > 80%
- Query: