18 KiB
Document Numbering Operations Guide
title: 'Operations Guide: Document Numbering System' version: 1.6.0 status: draft owner: Operations Team last_updated: 2025-12-02 related:
- specs/01-requirements/03.11-document-numbering.md
- specs/03-implementation/document-numbering.md
- specs/04-operations/monitoring-alerting.md
Overview
เอกสารนี้อธิบาย operations procedures, monitoring, และ troubleshooting สำหรับระบบ Document Numbering
1. Performance Requirements
1.1. Response Time Targets
| Metric | Target | Measurement |
|---|---|---|
| 95th percentile | ≤ 2 วินาที | ตั้งแต่ request ถึง response |
| 99th percentile | ≤ 5 วินาที | ตั้งแต่ request ถึง response |
| Normal operation | ≤ 500ms | ไม่มี retry |
1.2. Throughput Targets
| Load Level | Target | Notes |
|---|---|---|
| Normal load | ≥ 50 req/s | ใช้งานปกติ |
| Peak load | ≥ 100 req/s | ช่วงเร่งงาน |
| Burst capacity | ≥ 200 req/s | Short duration (< 1 min) |
1.3. Availability SLA
- Uptime: ≥ 99.5% (excluding planned maintenance)
- Maximum downtime: ≤ 3.6 ชั่วโมง/เดือน (~ 8.6 นาที/วัน)
- Recovery Time Objective (RTO): ≤ 30 นาที
- Recovery Point Objective (RPO): ≤ 5 นาที
2. Infrastructure Setup
2.1. Database Configuration
MariaDB Connection Pool
// ormconfig.ts
{
type: 'mysql',
host: process.env.DB_HOST,
port: parseInt(process.env.DB_PORT),
username: process.env.DB_USERNAME,
password: process.env.DB_PASSWORD,
database: process.env.DB_DATABASE,
extra: {
connectionLimit: 20, // Pool size
queueLimit: 0, // Unlimited queue
acquireTimeout: 10000, // 10s timeout
retryAttempts: 3,
retryDelay: 1000
}
}
High Availability Setup
# docker-compose.yml
services:
mariadb-master:
image: mariadb:10.11
environment:
MYSQL_REPLICATION_MODE: master
MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
volumes:
- mariadb-master-data:/var/lib/mysql
networks:
- backend
mariadb-replica:
image: mariadb:10.11
environment:
MYSQL_REPLICATION_MODE: slave
MYSQL_MASTER_HOST: mariadb-master
MYSQL_MASTER_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
volumes:
- mariadb-replica-data:/var/lib/mysql
networks:
- backend
2.2. Redis Configuration
Redis Sentinel for High Availability
# docker-compose.yml
services:
redis-master:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redis-master-data:/data
networks:
- backend
redis-replica:
image: redis:7-alpine
command: redis-server --replicaof redis-master 6379 --appendonly yes
volumes:
- redis-replica-data:/data
networks:
- backend
redis-sentinel:
image: redis:7-alpine
command: >
redis-sentinel /etc/redis/sentinel.conf
--sentinel monitor mymaster redis-master 6379 2
--sentinel down-after-milliseconds mymaster 5000
--sentinel failover-timeout mymaster 10000
networks:
- backend
Redis Connection Pool
// redis.config.ts
import IORedis from 'ioredis';
export const redisConfig = {
host: process.env.REDIS_HOST || 'localhost',
port: parseInt(process.env.REDIS_PORT) || 6379,
password: process.env.REDIS_PASSWORD,
maxRetriesPerRequest: 3,
enableReadyCheck: true,
lazyConnect: false,
poolSize: 10,
retryStrategy: (times: number) => {
if (times > 3) {
return null; // Stop retry
}
return Math.min(times * 100, 3000);
},
};
2.3. Load Balancing
Nginx Configuration
# nginx.conf
upstream backend {
least_conn; # Least connections algorithm
server backend-1:3000 max_fails=3 fail_timeout=30s weight=1;
server backend-2:3000 max_fails=3 fail_timeout=30s weight=1;
server backend-3:3000 max_fails=3 fail_timeout=30s weight=1;
keepalive 32;
}
server {
listen 80;
server_name api.lcbp3.local;
location /api/v1/document-numbering/ {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_next_upstream error timeout;
proxy_connect_timeout 10s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}
Docker Compose Scaling
# docker-compose.yml
services:
backend:
image: lcbp3-backend:latest
deploy:
replicas: 3
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
environment:
NODE_ENV: production
DB_POOL_SIZE: 20
networks:
- backend
3. Monitoring & Metrics
3.1. Prometheus Metrics
Key Metrics to Collect
// metrics.service.ts
import { Counter, Histogram, Gauge } from 'prom-client';
// Lock acquisition metrics
export const lockAcquisitionDuration = new Histogram({
name: 'docnum_lock_acquisition_duration_ms',
help: 'Lock acquisition time in milliseconds',
labelNames: ['project', 'type'],
buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000],
});
export const lockAcquisitionFailures = new Counter({
name: 'docnum_lock_acquisition_failures_total',
help: 'Total number of lock acquisition failures',
labelNames: ['project', 'type', 'reason'],
});
// Generation metrics
export const generationDuration = new Histogram({
name: 'docnum_generation_duration_ms',
help: 'Total document number generation time',
labelNames: ['project', 'type', 'status'],
buckets: [100, 200, 500, 1000, 2000, 5000],
});
export const retryCount = new Histogram({
name: 'docnum_retry_count',
help: 'Number of retries per generation',
labelNames: ['project', 'type'],
buckets: [0, 1, 2, 3, 5, 10],
});
// Connection health
export const redisConnectionStatus = new Gauge({
name: 'docnum_redis_connection_status',
help: 'Redis connection status (1=up, 0=down)',
});
export const dbConnectionPoolUsage = new Gauge({
name: 'docnum_db_connection_pool_usage',
help: 'Database connection pool usage percentage',
});
3.2. Prometheus Alert Rules
# prometheus/alerts.yml
groups:
- name: document_numbering_alerts
interval: 30s
rules:
# CRITICAL: Redis unavailable
- alert: RedisUnavailable
expr: docnum_redis_connection_status == 0
for: 1m
labels:
severity: critical
component: document-numbering
annotations:
summary: "Redis is unavailable for document numbering"
description: "System is falling back to DB-only locking. Performance degraded by 30-50%."
runbook_url: "https://wiki.lcbp3/runbooks/redis-unavailable"
# CRITICAL: High lock failure rate
- alert: HighLockFailureRate
expr: |
rate(docnum_lock_acquisition_failures_total[5m]) > 0.1
for: 5m
labels:
severity: critical
component: document-numbering
annotations:
summary: "Lock acquisition failure rate > 10%"
description: "Check Redis and database performance immediately"
runbook_url: "https://wiki.lcbp3/runbooks/high-lock-failure"
# WARNING: Elevated lock failure rate
- alert: ElevatedLockFailureRate
expr: |
rate(docnum_lock_acquisition_failures_total[5m]) > 0.05
for: 5m
labels:
severity: warning
component: document-numbering
annotations:
summary: "Lock acquisition failure rate > 5%"
description: "Monitor closely. May escalate to critical soon."
# WARNING: Slow lock acquisition
- alert: SlowLockAcquisition
expr: |
histogram_quantile(0.95,
rate(docnum_lock_acquisition_duration_ms_bucket[5m])
) > 1000
for: 5m
labels:
severity: warning
component: document-numbering
annotations:
summary: "P95 lock acquisition time > 1 second"
description: "Lock acquisition is slower than expected. Check Redis latency."
# WARNING: High retry count
- alert: HighRetryCount
expr: |
sum by (project) (
rate(docnum_retry_count_sum[1h])
) > 100
for: 1h
labels:
severity: warning
component: document-numbering
annotations:
summary: "Retry count > 100 per hour in project {{ $labels.project }}"
description: "High contention detected. Consider scaling."
# WARNING: Slow generation
- alert: SlowDocumentNumberGeneration
expr: |
histogram_quantile(0.95,
rate(docnum_generation_duration_ms_bucket[5m])
) > 2000
for: 5m
labels:
severity: warning
component: document-numbering
annotations:
summary: "P95 generation time > 2 seconds"
description: "Document number generation is slower than SLA target"
3.3. AlertManager Configuration
# alertmanager/config.yml
global:
resolve_timeout: 5m
slack_api_url: ${SLACK_WEBHOOK_URL}
route:
group_by: ['alertname', 'severity', 'project']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'ops-team'
routes:
# CRITICAL alerts → PagerDuty + Slack
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
continue: false
# WARNING alerts → Slack only
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: ${PAGERDUTY_SERVICE_KEY}
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
runbook: '{{ .CommonAnnotations.runbook_url }}'
- name: 'slack-critical'
slack_configs:
- channel: '#lcbp3-critical-alerts'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: |
*Summary:* {{ .CommonAnnotations.summary }}
*Description:* {{ .CommonAnnotations.description }}
*Runbook:* {{ .CommonAnnotations.runbook_url }}
color: 'danger'
- name: 'slack-warnings'
slack_configs:
- channel: '#lcbp3-alerts'
title: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
color: 'warning'
- name: 'ops-team'
email_configs:
- to: 'ops@example.com'
subject: '[LCBP3] {{ .GroupLabels.alertname }}'
3.4. Grafana Dashboard
Dashboard panels ที่สำคัญ:
-
Lock Acquisition Success Rate (Gauge)
- Query:
1 - (rate(docnum_lock_acquisition_failures_total[5m]) / rate(docnum_lock_acquisition_total[5m])) - Alert threshold: < 95%
- Query:
-
Lock Acquisition Time Percentiles (Graph)
- P50:
histogram_quantile(0.50, rate(docnum_lock_acquisition_duration_ms_bucket[5m])) - P95:
histogram_quantile(0.95, rate(docnum_lock_acquisition_duration_ms_bucket[5m])) - P99:
histogram_quantile(0.99, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))
- P50:
-
Generation Rate (Stat)
- Query:
sum(rate(docnum_generation_duration_ms_count[1m])) * 60 - Unit: documents/minute
- Query:
-
Error Rate by Type (Graph)
- Query:
sum by (reason) (rate(docnum_lock_acquisition_failures_total[5m]))
- Query:
-
Redis Connection Status (Stat)
- Query:
docnum_redis_connection_status - Thresholds: 0 = red, 1 = green
- Query:
-
DB Connection Pool Usage (Gauge)
- Query:
docnum_db_connection_pool_usage - Alert threshold: > 80%
- Query:
4. Troubleshooting Runbooks
4.1. Scenario: Redis Unavailable
Symptoms:
- Alert:
RedisUnavailable - System falls back to DB-only locking
- Performance degraded 30-50%
Action Steps:
-
Check Redis status:
docker exec lcbp3-redis redis-cli ping # Expected: PONG -
Check Redis logs:
docker logs lcbp3-redis --tail=100 -
Restart Redis (if needed):
docker restart lcbp3-redis -
Verify failover (if using Sentinel):
docker exec lcbp3-redis-sentinel redis-cli -p 26379 SENTINEL masters -
Monitor recovery:
- Check metric:
docnum_redis_connection_statusreturns to 1 - Check performance: P95 latency returns to normal (< 500ms)
- Check metric:
4.2. Scenario: High Lock Failure Rate
Symptoms:
- Alert:
HighLockFailureRate(> 10%) - Users report "ระบบกำลังยุ่ง" errors
Action Steps:
-
Check concurrent load:
# Check current request rate curl http://prometheus:9090/api/v1/query?query=rate(docnum_generation_duration_ms_count[1m]) -
Check database connections:
SHOW PROCESSLIST; -- Look for waiting/locked queries -
Check Redis memory:
docker exec lcbp3-redis redis-cli INFO memory -
Scale up if needed:
# Increase backend replicas docker-compose up -d --scale backend=5 -
Check for deadlocks:
SHOW ENGINE INNODB STATUS; -- Look for LATEST DETECTED DEADLOCK section
4.3. Scenario: Slow Performance
Symptoms:
- Alert:
SlowDocumentNumberGeneration - P95 > 2 seconds
Action Steps:
-
Check database query performance:
SELECT * FROM document_number_counters USE INDEX (idx_counter_lookup) WHERE project_id = 2 AND correspondence_type_id = 6 AND current_year = 2025; -- Check execution plan EXPLAIN SELECT ...; -
Check for missing indexes:
SHOW INDEX FROM document_number_counters; -
Check Redis latency:
docker exec lcbp3-redis redis-cli --latency -
Check network latency:
ping mariadb-master ping redis-master -
Review slow query log:
docker exec lcbp3-mariadb-master cat /var/log/mysql/slow.log
4.4. Scenario: Version Conflicts
Symptoms:
- High retry count
- Users report "เลขที่เอกสารถูกเปลี่ยน" errors
Action Steps:
-
Check concurrent requests to same counter:
SELECT project_id, correspondence_type_id, COUNT(*) as concurrent_requests FROM document_number_audit WHERE created_at > NOW() - INTERVAL 5 MINUTE GROUP BY project_id, correspondence_type_id HAVING COUNT(*) > 10 ORDER BY concurrent_requests DESC; -
Investigate specific counter:
SELECT * FROM document_number_counters WHERE project_id = X AND correspondence_type_id = Y; -- Check audit trail SELECT * FROM document_number_audit WHERE counter_key LIKE '%project_id:X%' ORDER BY created_at DESC LIMIT 20; -
Check for application bugs:
- Review error logs for stack traces
- Check if retry logic is working correctly
-
Temporary mitigation:
- Increase retry count in application config
- Consider manual counter adjustment (last resort)
5. Maintenance Procedures
5.1. Counter Reset (Manual)
Requires: SUPER_ADMIN role + 2-person approval
Steps:
-
Request approval via API:
POST /api/v1/document-numbering/configs/{configId}/reset-counter { "reason": "เหตุผลที่ชัดเจน อย่างน้อย 20 ตัวอักษร", "approver_1": "user_id", "approver_2": "user_id" } -
Verify in audit log:
SELECT * FROM document_number_config_history WHERE config_id = X ORDER BY changed_at DESC LIMIT 1;
5.2. Template Update
Best Practices:
- Always test template in staging first
- Preview generated numbers before applying
- Document reason for change
- Template changes do NOT affect existing documents
API Call:
PUT /api/v1/document-numbering/configs/{configId}
{
"template": "{ORIGINATOR}-{RECIPIENT}-{SEQ:4}-{YEAR:B.E.}",
"change_reason": "เหตุผลในการเปลี่ยนแปลง"
}
5.3. Database Maintenance
Weekly Tasks:
- Check slow query log
- Optimize tables if needed:
OPTIMIZE TABLE document_number_counters; OPTIMIZE TABLE document_number_audit;
Monthly Tasks:
- Review and archive old audit logs (> 2 years)
- Check index usage:
SELECT * FROM sys.schema_unused_indexes WHERE object_schema = 'lcbp3_db';
6. Backup & Recovery
6.1. Backup Strategy
Database:
- Full backup: Daily at 02:00 AM
- Incremental backup: Every 4 hours
- Retention: 30 days
Redis:
- AOF (Append-Only File) enabled
- Snapshot every 1 hour
- Retention: 7 days