Files
lcbp3/specs/04-operations/document-numbering-operations.md
admin 32d820ea6b
Some checks failed
Spec Validation / validate-markdown (push) Has been cancelled
Spec Validation / validate-diagrams (push) Has been cancelled
Spec Validation / check-todos (push) Has been cancelled
251207:0048 Update Schema & Data dictionary/ Login PASS
2025-12-07 00:48:46 +07:00

18 KiB

Document Numbering Operations Guide


title: 'Operations Guide: Document Numbering System' version: 1.5.1 status: draft owner: Operations Team last_updated: 2025-12-02 related:

  • specs/01-requirements/03.11-document-numbering.md
  • specs/03-implementation/document-numbering.md
  • specs/04-operations/monitoring-alerting.md

Overview

เอกสารนี้อธิบาย operations procedures, monitoring, และ troubleshooting สำหรับระบบ Document Numbering

1. Performance Requirements

1.1. Response Time Targets

Metric Target Measurement
95th percentile ≤ 2 วินาที ตั้งแต่ request ถึง response
99th percentile ≤ 5 วินาที ตั้งแต่ request ถึง response
Normal operation ≤ 500ms ไม่มี retry

1.2. Throughput Targets

Load Level Target Notes
Normal load ≥ 50 req/s ใช้งานปกติ
Peak load ≥ 100 req/s ช่วงเร่งงาน
Burst capacity ≥ 200 req/s Short duration (< 1 min)

1.3. Availability SLA

  • Uptime: ≥ 99.5% (excluding planned maintenance)
  • Maximum downtime: ≤ 3.6 ชั่วโมง/เดือน (~ 8.6 นาที/วัน)
  • Recovery Time Objective (RTO): ≤ 30 นาที
  • Recovery Point Objective (RPO): ≤ 5 นาที

2. Infrastructure Setup

2.1. Database Configuration

MariaDB Connection Pool

// ormconfig.ts
{
  type: 'mysql',
  host: process.env.DB_HOST,
  port: parseInt(process.env.DB_PORT),
  username: process.env.DB_USERNAME,
  password: process.env.DB_PASSWORD,
  database: process.env.DB_DATABASE,
  extra: {
    connectionLimit: 20,      // Pool size
    queueLimit: 0,            // Unlimited queue
    acquireTimeout: 10000,    // 10s timeout
    retryAttempts: 3,
    retryDelay: 1000
  }
}

High Availability Setup

# docker-compose.yml
services:
  mariadb-master:
    image: mariadb:11.8
    environment:
      MYSQL_REPLICATION_MODE: master
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
    volumes:
      - mariadb-master-data:/var/lib/mysql
    networks:
      - backend

  mariadb-replica:
    image: mariadb:11.8
    environment:
      MYSQL_REPLICATION_MODE: slave
      MYSQL_MASTER_HOST: mariadb-master
      MYSQL_MASTER_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
    volumes:
      - mariadb-replica-data:/var/lib/mysql
    networks:
      - backend

2.2. Redis Configuration

Redis Sentinel for High Availability

# docker-compose.yml
services:
  redis-master:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redis-master-data:/data
    networks:
      - backend

  redis-replica:
    image: redis:7-alpine
    command: redis-server --replicaof redis-master 6379 --appendonly yes
    volumes:
      - redis-replica-data:/data
    networks:
      - backend

  redis-sentinel:
    image: redis:7-alpine
    command: >
      redis-sentinel /etc/redis/sentinel.conf
      --sentinel monitor mymaster redis-master 6379 2
      --sentinel down-after-milliseconds mymaster 5000
      --sentinel failover-timeout mymaster 10000
    networks:
      - backend

Redis Connection Pool

// redis.config.ts
import IORedis from 'ioredis';

export const redisConfig = {
  host: process.env.REDIS_HOST || 'localhost',
  port: parseInt(process.env.REDIS_PORT) || 6379,
  password: process.env.REDIS_PASSWORD,
  maxRetriesPerRequest: 3,
  enableReadyCheck: true,
  lazyConnect: false,
  poolSize: 10,
  retryStrategy: (times: number) => {
    if (times > 3) {
      return null; // Stop retry
    }
    return Math.min(times * 100, 3000);
  },
};

2.3. Load Balancing

Nginx Configuration

# nginx.conf
upstream backend {
  least_conn;  # Least connections algorithm
  server backend-1:3000 max_fails=3 fail_timeout=30s weight=1;
  server backend-2:3000 max_fails=3 fail_timeout=30s weight=1;
  server backend-3:3000 max_fails=3 fail_timeout=30s weight=1;

  keepalive 32;
}

server {
  listen 80;
  server_name api.lcbp3.local;

  location /api/v1/document-numbering/ {
    proxy_pass http://backend;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    proxy_next_upstream error timeout;
    proxy_connect_timeout 10s;
    proxy_send_timeout 30s;
    proxy_read_timeout 30s;
  }
}

Docker Compose Scaling

# docker-compose.yml
services:
  backend:
    image: lcbp3-backend:latest
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '1.0'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    environment:
      NODE_ENV: production
      DB_POOL_SIZE: 20
    networks:
      - backend

3. Monitoring & Metrics

3.1. Prometheus Metrics

Key Metrics to Collect

// metrics.service.ts
import { Counter, Histogram, Gauge } from 'prom-client';

// Lock acquisition metrics
export const lockAcquisitionDuration = new Histogram({
  name: 'docnum_lock_acquisition_duration_ms',
  help: 'Lock acquisition time in milliseconds',
  labelNames: ['project', 'type'],
  buckets: [10, 50, 100, 200, 500, 1000, 2000, 5000],
});

export const lockAcquisitionFailures = new Counter({
  name: 'docnum_lock_acquisition_failures_total',
  help: 'Total number of lock acquisition failures',
  labelNames: ['project', 'type', 'reason'],
});

// Generation metrics
export const generationDuration = new Histogram({
  name: 'docnum_generation_duration_ms',
  help: 'Total document number generation time',
  labelNames: ['project', 'type', 'status'],
  buckets: [100, 200, 500, 1000, 2000, 5000],
});

export const retryCount = new Histogram({
  name: 'docnum_retry_count',
  help: 'Number of retries per generation',
  labelNames: ['project', 'type'],
  buckets: [0, 1, 2, 3, 5, 10],
});

// Connection health
export const redisConnectionStatus = new Gauge({
  name: 'docnum_redis_connection_status',
  help: 'Redis connection status (1=up, 0=down)',
});

export const dbConnectionPoolUsage = new Gauge({
  name: 'docnum_db_connection_pool_usage',
  help: 'Database connection pool usage percentage',
});

3.2. Prometheus Alert Rules

# prometheus/alerts.yml
groups:
  - name: document_numbering_alerts
    interval: 30s
    rules:
      # CRITICAL: Redis unavailable
      - alert: RedisUnavailable
        expr: docnum_redis_connection_status == 0
        for: 1m
        labels:
          severity: critical
          component: document-numbering
        annotations:
          summary: "Redis is unavailable for document numbering"
          description: "System is falling back to DB-only locking. Performance degraded by 30-50%."
          runbook_url: "https://wiki.lcbp3/runbooks/redis-unavailable"

      # CRITICAL: High lock failure rate
      - alert: HighLockFailureRate
        expr: |
          rate(docnum_lock_acquisition_failures_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
          component: document-numbering
        annotations:
          summary: "Lock acquisition failure rate > 10%"
          description: "Check Redis and database performance immediately"
          runbook_url: "https://wiki.lcbp3/runbooks/high-lock-failure"

      # WARNING: Elevated lock failure rate
      - alert: ElevatedLockFailureRate
        expr: |
          rate(docnum_lock_acquisition_failures_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
          component: document-numbering
        annotations:
          summary: "Lock acquisition failure rate > 5%"
          description: "Monitor closely. May escalate to critical soon."

      # WARNING: Slow lock acquisition
      - alert: SlowLockAcquisition
        expr: |
          histogram_quantile(0.95,
            rate(docnum_lock_acquisition_duration_ms_bucket[5m])
          ) > 1000
        for: 5m
        labels:
          severity: warning
          component: document-numbering
        annotations:
          summary: "P95 lock acquisition time > 1 second"
          description: "Lock acquisition is slower than expected. Check Redis latency."

      # WARNING: High retry count
      - alert: HighRetryCount
        expr: |
          sum by (project) (
            rate(docnum_retry_count_sum[1h])
          ) > 100
        for: 1h
        labels:
          severity: warning
          component: document-numbering
        annotations:
          summary: "Retry count > 100 per hour in project {{ $labels.project }}"
          description: "High contention detected. Consider scaling."

      # WARNING: Slow generation
      - alert: SlowDocumentNumberGeneration
        expr: |
          histogram_quantile(0.95,
            rate(docnum_generation_duration_ms_bucket[5m])
          ) > 2000
        for: 5m
        labels:
          severity: warning
          component: document-numbering
        annotations:
          summary: "P95 generation time > 2 seconds"
          description: "Document number generation is slower than SLA target"

3.3. AlertManager Configuration

# alertmanager/config.yml
global:
  resolve_timeout: 5m
  slack_api_url: ${SLACK_WEBHOOK_URL}

route:
  group_by: ['alertname', 'severity', 'project']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'ops-team'

  routes:
    # CRITICAL alerts → PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true

    - match:
        severity: critical
      receiver: 'slack-critical'
      continue: false

    # WARNING alerts → Slack only
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: ${PAGERDUTY_SERVICE_KEY}
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#lcbp3-critical-alerts'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Runbook:* {{ .CommonAnnotations.runbook_url }}
        color: 'danger'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#lcbp3-alerts'
        title: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
        color: 'warning'

  - name: 'ops-team'
    email_configs:
      - to: 'ops@example.com'
        subject: '[LCBP3] {{ .GroupLabels.alertname }}'

3.4. Grafana Dashboard

Dashboard panels ที่สำคัญ:

  1. Lock Acquisition Success Rate (Gauge)

    • Query: 1 - (rate(docnum_lock_acquisition_failures_total[5m]) / rate(docnum_lock_acquisition_total[5m]))
    • Alert threshold: < 95%
  2. Lock Acquisition Time Percentiles (Graph)

    • P50: histogram_quantile(0.50, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))
    • P95: histogram_quantile(0.95, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))
    • P99: histogram_quantile(0.99, rate(docnum_lock_acquisition_duration_ms_bucket[5m]))
  3. Generation Rate (Stat)

    • Query: sum(rate(docnum_generation_duration_ms_count[1m])) * 60
    • Unit: documents/minute
  4. Error Rate by Type (Graph)

    • Query: sum by (reason) (rate(docnum_lock_acquisition_failures_total[5m]))
  5. Redis Connection Status (Stat)

    • Query: docnum_redis_connection_status
    • Thresholds: 0 = red, 1 = green
  6. DB Connection Pool Usage (Gauge)

    • Query: docnum_db_connection_pool_usage
    • Alert threshold: > 80%

4. Troubleshooting Runbooks

4.1. Scenario: Redis Unavailable

Symptoms:

  • Alert: RedisUnavailable
  • System falls back to DB-only locking
  • Performance degraded 30-50%

Action Steps:

  1. Check Redis status:

    docker exec lcbp3-redis redis-cli ping
    # Expected: PONG
    
  2. Check Redis logs:

    docker logs lcbp3-redis --tail=100
    
  3. Restart Redis (if needed):

    docker restart lcbp3-redis
    
  4. Verify failover (if using Sentinel):

    docker exec lcbp3-redis-sentinel redis-cli -p 26379 SENTINEL masters
    
  5. Monitor recovery:

    • Check metric: docnum_redis_connection_status returns to 1
    • Check performance: P95 latency returns to normal (< 500ms)

4.2. Scenario: High Lock Failure Rate

Symptoms:

  • Alert: HighLockFailureRate (> 10%)
  • Users report "ระบบกำลังยุ่ง" errors

Action Steps:

  1. Check concurrent load:

    # Check current request rate
    curl http://prometheus:9090/api/v1/query?query=rate(docnum_generation_duration_ms_count[1m])
    
  2. Check database connections:

    SHOW PROCESSLIST;
    -- Look for waiting/locked queries
    
  3. Check Redis memory:

    docker exec lcbp3-redis redis-cli INFO memory
    
  4. Scale up if needed:

    # Increase backend replicas
    docker-compose up -d --scale backend=5
    
  5. Check for deadlocks:

    SHOW ENGINE INNODB STATUS;
    -- Look for LATEST DETECTED DEADLOCK section
    

4.3. Scenario: Slow Performance

Symptoms:

  • Alert: SlowDocumentNumberGeneration
  • P95 > 2 seconds

Action Steps:

  1. Check database query performance:

    SELECT * FROM document_number_counters USE INDEX (idx_counter_lookup)
    WHERE project_id = 2 AND correspondence_type_id = 6 AND current_year = 2025;
    
    -- Check execution plan
    EXPLAIN SELECT ...;
    
  2. Check for missing indexes:

    SHOW INDEX FROM document_number_counters;
    
  3. Check Redis latency:

    docker exec lcbp3-redis redis-cli --latency
    
  4. Check network latency:

    ping mariadb-master
    ping redis-master
    
  5. Review slow query log:

    docker exec lcbp3-mariadb-master cat /var/log/mysql/slow.log
    

4.4. Scenario: Version Conflicts

Symptoms:

  • High retry count
  • Users report "เลขที่เอกสารถูกเปลี่ยน" errors

Action Steps:

  1. Check concurrent requests to same counter:

    SELECT
      project_id,
      correspondence_type_id,
      COUNT(*) as concurrent_requests
    FROM document_number_audit
    WHERE created_at > NOW() - INTERVAL 5 MINUTE
    GROUP BY project_id, correspondence_type_id
    HAVING COUNT(*) > 10
    ORDER BY concurrent_requests DESC;
    
  2. Investigate specific counter:

    SELECT * FROM document_number_counters
    WHERE project_id = X AND correspondence_type_id = Y;
    
    -- Check audit trail
    SELECT * FROM document_number_audit
    WHERE counter_key LIKE '%project_id:X%'
    ORDER BY created_at DESC
    LIMIT 20;
    
  3. Check for application bugs:

    • Review error logs for stack traces
    • Check if retry logic is working correctly
  4. Temporary mitigation:

    • Increase retry count in application config
    • Consider manual counter adjustment (last resort)

5. Maintenance Procedures

5.1. Counter Reset (Manual)

Requires: SUPER_ADMIN role + 2-person approval

Steps:

  1. Request approval via API:

    POST /api/v1/document-numbering/configs/{configId}/reset-counter
    {
      "reason": "เหตุผลที่ชัดเจน อย่างน้อย 20 ตัวอักษร",
      "approver_1": "user_id",
      "approver_2": "user_id"
    }
    
  2. Verify in audit log:

    SELECT * FROM document_number_config_history
    WHERE config_id = X
    ORDER BY changed_at DESC
    LIMIT 1;
    

5.2. Template Update

Best Practices:

  1. Always test template in staging first
  2. Preview generated numbers before applying
  3. Document reason for change
  4. Template changes do NOT affect existing documents

API Call:

PUT /api/v1/document-numbering/configs/{configId}
{
  "template": "{ORIGINATOR}-{RECIPIENT}-{SEQ:4}-{YEAR:B.E.}",
  "change_reason": "เหตุผลในการเปลี่ยนแปลง"
}

5.3. Database Maintenance

Weekly Tasks:

  • Check slow query log
  • Optimize tables if needed:
    OPTIMIZE TABLE document_number_counters;
    OPTIMIZE TABLE document_number_audit;
    

Monthly Tasks:

  • Review and archive old audit logs (> 2 years)
  • Check index usage:
    SELECT * FROM sys.schema_unused_indexes
    WHERE object_schema = 'lcbp3_db';
    

6. Backup & Recovery

6.1. Backup Strategy

Database:

  • Full backup: Daily at 02:00 AM
  • Incremental backup: Every 4 hours
  • Retention: 30 days

Redis:

  • AOF (Append-Only File) enabled
  • Snapshot every 1 hour
  • Retention: 7 days

6.2. Recovery Procedures

See: Backup & Recovery Guide

References