Files
lcbp3/specs/99-archives/07_disaster_recovery.md
admin ef16817f38
All checks were successful
Build and Deploy / deploy (push) Successful in 4m44s
260223:1415 20260223 nextJS & nestJS Best pratices
2026-02-23 14:15:06 +07:00

5.5 KiB
Raw Permalink Blame History

Disaster Recovery Plan สำหรับ LCBP3-DMS

📍 Version: v1.8.0 🖥️ Primary Server: QNAP TS-473A (Application & Database) 💾 Backup Server: ASUSTOR AS5403T (Infrastructure & Backup)


RTO/RPO Targets

Scenario RTO RPO Priority
Single backend node failure 0 min 0 P0
Redis failure 5 min 0 P0
MariaDB failure 10 min 0 P0
QNAP total failure 2 hours 15 min P1
Data corruption 4 hours 1 day P2

1. Quick Recovery Procedures

1.1 Service Not Responding

# Check container status
docker ps -a | grep <service-name>

# Restart specific service
docker restart <container-name>

# Check logs for errors
docker logs <container-name> --tail 100

1.2 Redis Failure

# Check status
docker exec cache redis-cli ping

# Restart
docker restart cache

# Verify
docker exec cache redis-cli ping

1.3 MariaDB Failure

# Check status
docker exec mariadb mysql -u root -p -e "SELECT 1"

# Restart
docker restart mariadb

# Wait for startup
sleep 30

# Verify
docker exec mariadb mysql -u root -p -e "SHOW DATABASES"

2. Full System Recovery

2.1 Recovery Prerequisites (ASUSTOR)

ตรวจสอบว่า Backup files พร้อมใช้งาน:

# SSH to ASUSTOR
ssh admin@192.168.10.9

# List available backups
ls -la /volume1/backup/db/
ls -la /volume1/backup/redis/
ls -la /volume1/backup/config/

# Check Restic snapshots
restic -r /volume1/backup/restic-repo snapshots

2.2 QNAP Recovery Script

#!/bin/bash
# File: /volume1/np-dms/scripts/disaster-recovery.sh
# Run on: ASUSTOR (Push to QNAP)

QNAP_IP="192.168.10.8"
BACKUP_DIR="/volume1/backup"

echo "🚨 Starting Disaster Recovery..."
echo "================================"

# 1. Restore Docker Network
echo "1⃣ Creating Docker network..."
ssh admin@$QNAP_IP "docker network create lcbp3 || true"

# 2. Restore config files
echo "2⃣ Restoring configuration files..."
LATEST_CONFIG=$(ls -t $BACKUP_DIR/config/*.tar.gz | head -1)
tar -xzf $LATEST_CONFIG -C /tmp/
rsync -avz /tmp/np-dms/ admin@$QNAP_IP:/share/np-dms/

# 3. Start infrastructure services
echo "3⃣ Starting MariaDB..."
ssh admin@$QNAP_IP "cd /share/np-dms/mariadb && docker-compose up -d"
sleep 30

# 4. Restore database
echo "4⃣ Restoring database..."
LATEST_DB=$(ls -t $BACKUP_DIR/db/*.sql.gz | head -1)
gunzip -c $LATEST_DB | ssh admin@$QNAP_IP "docker exec -i mariadb mysql -u root -p\$MYSQL_ROOT_PASSWORD lcbp3_db"

# 5. Start Redis
echo "5⃣ Starting Redis..."
ssh admin@$QNAP_IP "cd /share/np-dms/services && docker-compose up -d cache"

# 6. Restore Redis data (if needed)
echo "6⃣ Restoring Redis data..."
LATEST_REDIS=$(ls -t $BACKUP_DIR/redis/*.tar.gz | head -1)
tar -xzf $LATEST_REDIS -C /tmp/
scp /tmp/redis_*.rdb admin@$QNAP_IP:/share/np-dms/services/cache/data/dump.rdb
ssh admin@$QNAP_IP "docker restart cache"

# 7. Start remaining services
echo "7⃣ Starting application services..."
ssh admin@$QNAP_IP "cd /share/np-dms/services && docker-compose up -d"
ssh admin@$QNAP_IP "cd /share/np-dms/npm && docker-compose up -d"

# 8. Health check
echo "8⃣ Running health checks..."
sleep 60
curl -f https://lcbp3.np-dms.work/health || echo "⚠️ Frontend not ready"
curl -f https://backend.np-dms.work/health || echo "⚠️ Backend not ready"

echo ""
echo "✅ Disaster Recovery Complete"
echo "⚠️ Please verify system functionality manually"

3. Data Corruption Recovery

3.1 Point-in-Time Recovery (Database)

# List available Restic snapshots
restic -r /volume1/backup/restic-repo snapshots

# Restore specific snapshot
restic -r /volume1/backup/restic-repo restore <snapshot-id> --target /tmp/restore/

# Apply restored backup
gunzip -c /tmp/restore/lcbp3_*.sql.gz | \
  ssh admin@192.168.10.8 "docker exec -i mariadb mysql -u root -p\$MYSQL_ROOT_PASSWORD lcbp3_db"

3.2 Selective Table Recovery

# Extract specific tables from backup
gunzip -c /volume1/backup/db/lcbp3_YYYYMMDD.sql.gz | \
  grep -A1000 "CREATE TABLE \`documents\`" | \
  grep -B1000 "UNLOCK TABLES" > /tmp/documents_table.sql

# Restore specific table
ssh admin@192.168.10.8 "docker exec -i mariadb mysql -u root -p\$MYSQL_ROOT_PASSWORD lcbp3_db" < /tmp/documents_table.sql

4. Communication & Escalation

4.1 Incident Response

Severity Response Time Notify
P0 Immediate Admin Team + Management
P1 30 minutes Admin Team
P2 2 hours Admin Team (next business day)

4.2 Post-Incident Checklist

  • Identify root cause
  • Document timeline of events
  • Verify all services restored
  • Check data integrity
  • Update monitoring alerts if needed
  • Create incident report

5. Testing Schedule

Test Type Frequency Last Tested Next Due
Backup Verification Weekly - -
Single Service Recovery Monthly - -
Full DR Test Quarterly - -

📝 หมายเหตุ: เอกสารนี้อ้างอิงจาก Architecture Document v1.8.0