# Document Numbering Operations Guide --- title: 'Operations Guide: Document Numbering System' version: 1.6.0 status: draft owner: Operations Team last_updated: 2025-12-02 related: - specs/01-requirements/03.11-document-numbering.md - specs/03-implementation/document-numbering.md - specs/04-operations/monitoring-alerting.md --- ## Overview เอกสารนี้อธิบาย operations procedures, monitoring, และ troubleshooting สำหรับระบบ Document Numbering ## 1. Performance Requirements ### 1.1. Response Time Targets | Metric | Target | Measurement | | ---------------- | -------- | ------------------------ | | 95th percentile | ≤ 2 วินาที | ตั้งแต่ request ถึง response | | 99th percentile | ≤ 5 วินาที | ตั้งแต่ request ถึง response | | Normal operation | ≤ 500ms | ไม่มี retry | ### 1.2. Throughput Targets | Load Level | Target | Notes | | -------------- | ----------- | ------------------------ | | Normal load | ≥ 50 req/s | ใช้งานปกติ | | Peak load | ≥ 100 req/s | ช่วงเร่งงาน | | Burst capacity | ≥ 200 req/s | Short duration (< 1 min) | ### 1.3. Availability SLA - **Uptime**: ≥ 99.5% (excluding planned maintenance) - **Maximum downtime**: ≤ 3.6 ชั่วโมง/เดือน (~ 8.6 นาที/วัน) - **Recovery Time Objective (RTO)**: ≤ 30 นาที - **Recovery Point Objective (RPO)**: ≤ 5 นาที ## 2. Infrastructure Setup ### 2.1. Database Configuration #### MariaDB Connection Pool ```typescript // ormconfig.ts { type: 'mysql', host: process.env.DB_HOST, port: parseInt(process.env.DB_PORT), username: process.env.DB_USERNAME, password: process.env.DB_PASSWORD, database: process.env.DB_DATABASE, extra: { connectionLimit: 20, // Pool size queueLimit: 0, // Unlimited queue acquireTimeout: 10000, // 10s timeout retryAttempts: 3, retryDelay: 1000 } } ``` #### High Availability Setup ```yaml # docker-compose.yml services: mariadb-master: image: mariadb:11.8 environment: MYSQL_REPLICATION_MODE: master MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD} volumes: - mariadb-master-data:/var/lib/mysql networks: - backend mariadb-replica: image: mariadb:11.8 environment: MYSQL_REPLICATION_MODE: slave MYSQL_MASTER_HOST: mariadb-master MYSQL_MASTER_ROOT_PASSWORD: ${DB_ROOT_PASSWORD} volumes: - mariadb-replica-data:/var/lib/mysql networks: - backend ``` ### 2.2. Redis Configuration #### Redis Sentinel for High Availability ```yaml # docker-compose.yml services: redis-master: image: redis:7-alpine command: redis-server --appendonly yes volumes: - redis-master-data:/data networks: - backend redis-replica: image: redis:7-alpine command: redis-server --replicaof redis-master 6379 --appendonly yes volumes: - redis-replica-data:/data networks: - backend redis-sentinel: image: redis:7-alpine command: > redis-sentinel /etc/redis/sentinel.conf --sentinel monitor mymaster redis-master 6379 2 --sentinel down-after-milliseconds mymaster 5000 --sentinel failover-timeout mymaster 10000 networks: - backend ``` #### Redis Connection Pool ```typescript // redis.config.ts import IORedis from 'ioredis'; export const redisConfig = { host: process.env.REDIS_HOST || 'localhost', port: parseInt(process.env.REDIS_PORT) || 6379, password: process.env.REDIS_PASSWORD, maxRetriesPerRequest: 3, enableReadyCheck: true, lazyConnect: false, poolSize: 10, retryStrategy: (times: number) => { if (times > 3) { return null; // Stop retry } return Math.min(times * 100, 3000); }, }; ``` ### 2.3. Load Balancing #### Nginx Configuration ```nginx # nginx.conf upstream backend { least_conn; # Least connections algorithm server backend-1:3000 max_fails=3 fail_timeout=30s weight=1; server backend-2:3000 max_fails=3 fail_timeout=30s weight=1; server backend-3:3000 max_fails=3 fail_timeout=30s weight=1; keepalive 32; } server { listen 80; server_name api.lcbp3.local; location /api/v1/document-numbering/ { proxy_pass http://backend; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_next_upstream error timeout; proxy_connect_timeout 10s; proxy_send_timeout 30s; proxy_read_timeout 30s; } } ``` #### Docker Compose Scaling ```yaml # docker-compose.yml services: backend: image: lcbp3-backend:latest deploy: replicas: 3 resources: limits: cpus: '1.0' memory: 1G reservations: cpus: '0.5' memory: 512M restart_policy: condition: on-failure delay: 5s max_attempts: 3 environment: NODE_ENV: production DB_POOL_SIZE: 20 networks: - backend ``` ## 3. Monitoring & Metrics ### 3.1. Prometheus Metrics #### Key Metrics to Collect ```typescript // metrics.service.ts import { Counter, Histogram, Gauge } from 'prom-client'; // Lock acquisition metrics export const lockAcquisitionDuration = new Histogram({ name: 'docnum_lock_acquisition_duration_ms', help: 'Lock acquisition time in milliseconds', labelNames: ['project', 'type'], buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000], }); export const lockAcquisitionFailures = new Counter({ name: 'docnum_lock_acquisition_failures_total', help: 'Total number of lock acquisition failures', labelNames: ['project', 'type', 'reason'], }); // Generation metrics export const generationDuration = new Histogram({ name: 'docnum_generation_duration_ms', help: 'Total document number generation time', labelNames: ['project', 'type', 'status'], buckets: [100, 200, 500, 1000, 2000, 5000], }); export const retryCount = new Histogram({ name: 'docnum_retry_count', help: 'Number of retries per generation', labelNames: ['project', 'type'], buckets: [0, 1, 2, 3, 5, 10], }); // Connection health export const redisConnectionStatus = new Gauge({ name: 'docnum_redis_connection_status', help: 'Redis connection status (1=up, 0=down)', }); export const dbConnectionPoolUsage = new Gauge({ name: 'docnum_db_connection_pool_usage', help: 'Database connection pool usage percentage', }); ``` ### 3.2. Prometheus Alert Rules ```yaml # prometheus/alerts.yml groups: - name: document_numbering_alerts interval: 30s rules: # CRITICAL: Redis unavailable - alert: RedisUnavailable expr: docnum_redis_connection_status == 0 for: 1m labels: severity: critical component: document-numbering annotations: summary: "Redis is unavailable for document numbering" description: "System is falling back to DB-only locking. Performance degraded by 30-50%." runbook_url: "https://wiki.lcbp3/runbooks/redis-unavailable" # CRITICAL: High lock failure rate - alert: HighLockFailureRate expr: | rate(docnum_lock_acquisition_failures_total[5m]) > 0.1 for: 5m labels: severity: critical component: document-numbering annotations: summary: "Lock acquisition failure rate > 10%" description: "Check Redis and database performance immediately" runbook_url: "https://wiki.lcbp3/runbooks/high-lock-failure" # WARNING: Elevated lock failure rate - alert: ElevatedLockFailureRate expr: | rate(docnum_lock_acquisition_failures_total[5m]) > 0.05 for: 5m labels: severity: warning component: document-numbering annotations: summary: "Lock acquisition failure rate > 5%" description: "Monitor closely. May escalate to critical soon." # WARNING: Slow lock acquisition - alert: SlowLockAcquisition expr: | histogram_quantile(0.95, rate(docnum_lock_acquisition_duration_ms_bucket[5m]) ) > 1000 for: 5m labels: severity: warning component: document-numbering annotations: summary: "P95 lock acquisition time > 1 second" description: "Lock acquisition is slower than expected. Check Redis latency." # WARNING: High retry count - alert: HighRetryCount expr: | sum by (project) ( rate(docnum_retry_count_sum[1h]) ) > 100 for: 1h labels: severity: warning component: document-numbering annotations: summary: "Retry count > 100 per hour in project {{ $labels.project }}" description: "High contention detected. Consider scaling." # WARNING: Slow generation - alert: SlowDocumentNumberGeneration expr: | histogram_quantile(0.95, rate(docnum_generation_duration_ms_bucket[5m]) ) > 2000 for: 5m labels: severity: warning component: document-numbering annotations: summary: "P95 generation time > 2 seconds" description: "Document number generation is slower than SLA target" ``` ### 3.3. AlertManager Configuration ```yaml # alertmanager/config.yml global: resolve_timeout: 5m slack_api_url: ${SLACK_WEBHOOK_URL} route: group_by: ['alertname', 'severity', 'project'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'ops-team' routes: # CRITICAL alerts → PagerDuty + Slack - match: severity: critical receiver: 'pagerduty-critical' continue: true - match: severity: critical receiver: 'slack-critical' continue: false # WARNING alerts → Slack only - match: severity: warning receiver: 'slack-warnings' receivers: - name: 'pagerduty-critical' pagerduty_configs: - service_key: ${PAGERDUTY_SERVICE_KEY} description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' details: firing: '{{ .Alerts.Firing | len }}' resolved: '{{ .Alerts.Resolved | len }}' runbook: '{{ .CommonAnnotations.runbook_url }}' - name: 'slack-critical' slack_configs: - channel: '#lcbp3-critical-alerts' title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}' text: | *Summary:* {{ .CommonAnnotations.summary }} *Description:* {{ .CommonAnnotations.description }} *Runbook:* {{ .CommonAnnotations.runbook_url }} color: 'danger' - name: 'slack-warnings' slack_configs: - channel: '#lcbp3-alerts' title: '⚠️ WARNING: {{ .GroupLabels.alertname }}' text: '{{ .CommonAnnotations.description }}' color: 'warning' - name: 'ops-team' email_configs: - to: 'ops@example.com' subject: '[LCBP3] {{ .GroupLabels.alertname }}' ``` ### 3.4. Grafana Dashboard Dashboard panels ที่สำคัญ: 1. **Lock Acquisition Success Rate** (Gauge) - Query: `1 - (rate(docnum_lock_acquisition_failures_total[5m]) / rate(docnum_lock_acquisition_total[5m]))` - Alert threshold: < 95% 2. **Lock Acquisition Time Percentiles** (Graph) - P50: `histogram_quantile(0.50, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))` - P95: `histogram_quantile(0.95, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))` - P99: `histogram_quantile(0.99, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))` 3. **Generation Rate** (Stat) - Query: `sum(rate(docnum_generation_duration_ms_count[1m])) * 60` - Unit: documents/minute 4. **Error Rate by Type** (Graph) - Query: `sum by (reason) (rate(docnum_lock_acquisition_failures_total[5m]))` 5. **Redis Connection Status** (Stat) - Query: `docnum_redis_connection_status` - Thresholds: 0 = red, 1 = green 6. **DB Connection Pool Usage** (Gauge) - Query: `docnum_db_connection_pool_usage` - Alert threshold: > 80% ## 4. Troubleshooting Runbooks ### 4.1. Scenario: Redis Unavailable **Symptoms:** - Alert: `RedisUnavailable` - System falls back to DB-only locking - Performance degraded 30-50% **Action Steps:** 1. **Check Redis status:** ```bash docker exec lcbp3-redis redis-cli ping # Expected: PONG ``` 2. **Check Redis logs:** ```bash docker logs lcbp3-redis --tail=100 ``` 3. **Restart Redis (if needed):** ```bash docker restart lcbp3-redis ``` 4. **Verify failover (if using Sentinel):** ```bash docker exec lcbp3-redis-sentinel redis-cli -p 26379 SENTINEL masters ``` 5. **Monitor recovery:** - Check metric: `docnum_redis_connection_status` returns to 1 - Check performance: P95 latency returns to normal (< 500ms) ### 4.2. Scenario: High Lock Failure Rate **Symptoms:** - Alert: `HighLockFailureRate` (> 10%) - Users report "ระบบกำลังยุ่ง" errors **Action Steps:** 1. **Check concurrent load:** ```bash # Check current request rate curl http://prometheus:9090/api/v1/query?query=rate(docnum_generation_duration_ms_count[1m]) ``` 2. **Check database connections:** ```sql SHOW PROCESSLIST; -- Look for waiting/locked queries ``` 3. **Check Redis memory:** ```bash docker exec lcbp3-redis redis-cli INFO memory ``` 4. **Scale up if needed:** ```bash # Increase backend replicas docker-compose up -d --scale backend=5 ``` 5. **Check for deadlocks:** ```sql SHOW ENGINE INNODB STATUS; -- Look for LATEST DETECTED DEADLOCK section ``` ### 4.3. Scenario: Slow Performance **Symptoms:** - Alert: `SlowDocumentNumberGeneration` - P95 > 2 seconds **Action Steps:** 1. **Check database query performance:** ```sql SELECT * FROM document_number_counters USE INDEX (idx_counter_lookup) WHERE project_id = 2 AND correspondence_type_id = 6 AND current_year = 2025; -- Check execution plan EXPLAIN SELECT ...; ``` 2. **Check for missing indexes:** ```sql SHOW INDEX FROM document_number_counters; ``` 3. **Check Redis latency:** ```bash docker exec lcbp3-redis redis-cli --latency ``` 4. **Check network latency:** ```bash ping mariadb-master ping redis-master ``` 5. **Review slow query log:** ```bash docker exec lcbp3-mariadb-master cat /var/log/mysql/slow.log ``` ### 4.4. Scenario: Version Conflicts **Symptoms:** - High retry count - Users report "เลขที่เอกสารถูกเปลี่ยน" errors **Action Steps:** 1. **Check concurrent requests to same counter:** ```sql SELECT project_id, correspondence_type_id, COUNT(*) as concurrent_requests FROM document_number_audit WHERE created_at > NOW() - INTERVAL 5 MINUTE GROUP BY project_id, correspondence_type_id HAVING COUNT(*) > 10 ORDER BY concurrent_requests DESC; ``` 2. **Investigate specific counter:** ```sql SELECT * FROM document_number_counters WHERE project_id = X AND correspondence_type_id = Y; -- Check audit trail SELECT * FROM document_number_audit WHERE counter_key LIKE '%project_id:X%' ORDER BY created_at DESC LIMIT 20; ``` 3. **Check for application bugs:** - Review error logs for stack traces - Check if retry logic is working correctly 4. **Temporary mitigation:** - Increase retry count in application config - Consider manual counter adjustment (last resort) ## 5. Maintenance Procedures ### 5.1. Counter Reset (Manual) **Requires:** SUPER_ADMIN role + 2-person approval **Steps:** 1. **Request approval via API:** ```bash POST /api/v1/document-numbering/configs/{configId}/reset-counter { "reason": "เหตุผลที่ชัดเจน อย่างน้อย 20 ตัวอักษร", "approver_1": "user_id", "approver_2": "user_id" } ``` 2. **Verify in audit log:** ```sql SELECT * FROM document_number_config_history WHERE config_id = X ORDER BY changed_at DESC LIMIT 1; ``` ### 5.2. Template Update **Best Practices:** 1. Always test template in staging first 2. Preview generated numbers before applying 3. Document reason for change 4. Template changes do NOT affect existing documents **API Call:** ```bash PUT /api/v1/document-numbering/configs/{configId} { "template": "{ORIGINATOR}-{RECIPIENT}-{SEQ:4}-{YEAR:B.E.}", "change_reason": "เหตุผลในการเปลี่ยนแปลง" } ``` ### 5.3. Database Maintenance **Weekly Tasks:** - Check slow query log - Optimize tables if needed: ```sql OPTIMIZE TABLE document_number_counters; OPTIMIZE TABLE document_number_audit; ``` **Monthly Tasks:** - Review and archive old audit logs (> 2 years) - Check index usage: ```sql SELECT * FROM sys.schema_unused_indexes WHERE object_schema = 'lcbp3_db'; ``` ## 6. Backup & Recovery ### 6.1. Backup Strategy **Database:** - Full backup: Daily at 02:00 AM - Incremental backup: Every 4 hours - Retention: 30 days **Redis:** - AOF (Append-Only File) enabled - Snapshot every 1 hour - Retention: 7 days ### 6.2. Recovery Procedures See: [Backup & Recovery Guide](file:///e:/np-dms/lcbp3/specs/04-operations/backup-recovery.md) ## References - [Requirements](file:///e:/np-dms/lcbp3/specs/01-requirements/03.11-document-numbering.md) - [Implementation Guide](file:///e:/np-dms/lcbp3/specs/03-implementation/document-numbering.md) - [Monitoring & Alerting](file:///e:/np-dms/lcbp3/specs/04-operations/monitoring-alerting.md) - [Incident Response](file:///e:/np-dms/lcbp3/specs/04-operations/incident-response.md)