Main: revise specs to 1.5.0 (completed)

2025-12-01 01:28:32 +07:00
parent ce795b26e2
commit 047e1b88ce
69 changed files with 28252 additions and 74 deletions
--- a/specs/04-operations/incident-response.md
+++ b/specs/04-operations/incident-response.md
@@ -0,0 +1,483 @@
+# Incident Response Procedures
+
+**Project:** LCBP3-DMS
+**Version:** 1.5.0
+**Last Updated:** 2025-12-01
+
+---
+
+## 📋 Overview
+
+This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.
+
+---
+
+## 🚨 Incident Classification
+
+### Severity Levels
+
+| Severity          | Description                  | Response Time     | Examples                                        |
+| ----------------- | ---------------------------- | ----------------- | ----------------------------------------------- |
+| **P0 - Critical** | Complete system outage       | 15 minutes        | Database down, All services unavailable         |
+| **P1 - High**     | Major functionality impaired | 1 hour            | Authentication failing, Cannot create documents |
+| **P2 - Medium**   | Degraded performance         | 4 hours           | Slow response time, Some features broken        |
+| **P3 - Low**      | Minor issues                 | Next business day | UI glitch, Non-critical bug                     |
+
+---
+
+## 📞 Incident Response Team
+
+### Roles & Responsibilities
+
+**Incident Commander (IC)**
+
+- Coordinates response efforts
+- Makes final decisions
+- Communicates with stakeholders
+
+**Technical Lead (TL)**
+
+- Diagnoses technical issues
+- Implements fixes
+- Coordinates with engineers
+
+**Communications Lead (CL)**
+
+- Updates stakeholders
+- Manages internal/external communications
+- Documents incident timeline
+
+**On-Call Engineer**
+
+- First responder
+- Initial triage and investigation
+- Escalates to appropriate team
+
+---
+
+## 🔄 Incident Response Workflow
+
+```mermaid
+flowchart TD
+    Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
+    Acknowledge --> Assess[Assess Severity]
+    Assess --> P0{Severity?}
+
+    P0 -->|P0/P1| Alert[Page Incident Commander]
+    P0 -->|P2/P3| Assign[Assign to On-Call]
+
+    Alert --> Investigate[Investigate Root Cause]
+    Assign --> Investigate
+
+    Investigate --> Mitigate[Implement Mitigation]
+    Mitigate --> Verify[Verify Resolution]
+
+    Verify --> Resolved{Resolved?}
+    Resolved -->|No| Escalate[Escalate/Re-assess]
+    Escalate --> Investigate
+
+    Resolved -->|Yes| Communicate[Communicate Resolution]
+    Communicate --> PostMortem[Schedule Post-Mortem]
+    PostMortem --> End([Close Incident])
+```
+
+---
+
+## 📋 Incident Response Playbooks
+
+### P0: Database Down
+
+**Symptoms:**
+
+- Backend returns 500 errors
+- Cannot connect to database
+- Health check fails
+
+**Immediate Actions:**
+
+1. **Verify Issue**
+
+   ```bash
+   docker ps | grep mariadb
+   docker logs lcbp3-mariadb --tail=50
+   ```
+
+2. **Attempt Restart**
+
+   ```bash
+   docker restart lcbp3-mariadb
+   ```
+
+3. **Check Database Process**
+
+   ```bash
+   docker exec lcbp3-mariadb ps aux | grep mysql
+   ```
+
+4. **If Restart Fails:**
+
+   ```bash
+   # Check disk space
+   df -h
+
+   # Check database logs for corruption
+   docker exec lcbp3-mariadb cat /var/log/mysql/error.log
+
+   # If corrupted, restore from backup
+   # See backup-recovery.md
+   ```
+
+5. **Escalate to DBA** if not resolved in 30 minutes
+
+---
+
+### P0: Complete System Outage
+
+**Symptoms:**
+
+- All services return 502/503
+- Health checks fail
+- Users cannot access system
+
+**Immediate Actions:**
+
+1. **Check Container Status**
+
+   ```bash
+   docker-compose ps
+   # Identify which containers are down
+   ```
+
+2. **Restart All Services**
+
+   ```bash
+   docker-compose restart
+   ```
+
+3. **Check QNAP Server Resources**
+
+   ```bash
+   top
+   df -h
+   free -h
+   ```
+
+4. **Check Network**
+
+   ```bash
+   ping 8.8.8.8
+   netstat -tlnp
+   ```
+
+5. **If Server Issue:**
+   - Reboot QNAP server
+   - Contact QNAP support
+
+---
+
+### P1: Authentication System Failing
+
+**Symptoms:**
+
+- Users cannot log in
+- JWT validation fails
+- 401 errors
+
+**Immediate Actions:**
+
+1. **Check Redis (Session Store)**
+
+   ```bash
+   docker exec lcbp3-redis redis-cli ping
+   # Should return PONG
+   ```
+
+2. **Check JWT Secret Configuration**
+
+   ```bash
+   docker exec lcbp3-backend env | grep JWT_SECRET
+   # Verify not empty
+   ```
+
+3. **Check Backend Logs**
+
+   ```bash
+   docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"
+   ```
+
+4. **Temporary Mitigation:**
+   ```bash
+   # Restart backend to reload config
+   docker restart lcbp3-backend
+   ```
+
+---
+
+### P1: File Upload Failing
+
+**Symptoms:**
+
+- Users cannot upload files
+- 500 errors on file upload
+- "Disk full" errors
+
+**Immediate Actions:**
+
+1. **Check Disk Space**
+
+   ```bash
+   df -h /var/lib/docker/volumes/lcbp3_uploads
+   ```
+
+2. **If Disk Full:**
+
+   ```bash
+   # Clean up temp uploads
+   find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
+     -type f -mtime +1 -delete
+   ```
+
+3. **Check ClamAV (Virus Scanner)**
+
+   ```bash
+   docker logs lcbp3-clamav --tail=50
+   docker restart lcbp3-clamav
+   ```
+
+4. **Check File Permissions**
+   ```bash
+   docker exec lcbp3-backend ls -la /app/uploads
+   ```
+
+---
+
+### P2: Slow Performance
+
+**Symptoms:**
+
+- Pages load slowly
+- API response time > 2s
+- Users complain about slowness
+
+**Actions:**
+
+1. **Check System Resources**
+
+   ```bash
+   docker stats
+   # Identify high CPU/memory containers
+   ```
+
+2. **Check Database Performance**
+
+   ```sql
+   -- Show slow queries
+   SHOW PROCESSLIST;
+
+   -- Check connections
+   SHOW STATUS LIKE 'Threads_connected';
+   ```
+
+3. **Check Redis**
+
+   ```bash
+   docker exec lcbp3-redis redis-cli --stat
+   ```
+
+4. **Check Application Logs**
+
+   ```bash
+   docker logs lcbp3-backend | grep "Slow request"
+   ```
+
+5. **Temporary Mitigation:**
+   - Restart slow containers
+   - Clear Redis cache if needed
+   - Kill long-running queries
+
+---
+
+### P2: Email Notifications Not Sending
+
+**Symptoms:**
+
+- Users not receiving emails
+- Email queue backing up
+
+**Actions:**
+
+1. **Check Email Queue**
+
+   ```bash
+   # Access BullMQ dashboard or check Redis
+   docker exec lcbp3-redis redis-cli LLEN bull:email:waiting
+   ```
+
+2. **Check Email Processor Logs**
+
+   ```bash
+   docker logs lcbp3-backend | grep "email\|SMTP"
+   ```
+
+3. **Test SMTP Connection**
+
+   ```bash
+   docker exec lcbp3-backend node -e "
+   const nodemailer = require('nodemailer');
+   const transport = nodemailer.createTransport({
+     host: process.env.SMTP_HOST,
+     port: process.env.SMTP_PORT,
+     auth: {
+       user: process.env.SMTP_USER,
+       pass: process.env.SMTP_PASS
+     }
+   });
+   transport.verify().then(console.log).catch(console.error);
+   "
+   ```
+
+4. **Check SMTP Credentials**
+   - Verify not expired
+   - Check firewall/network access
+
+---
+
+## 📝 Incident Documentation
+
+### Incident Report Template
+
+```markdown
+# Incident Report: [Brief Description]
+
+**Incident ID:** INC-YYYYMMDD-001
+**Severity:** P1
+**Status:** Resolved
+**Incident Commander:** [Name]
+
+## Timeline
+
+| Time  | Event                                                     |
+| ----- | --------------------------------------------------------- |
+| 14:00 | Alert: High error rate detected                           |
+| 14:05 | On-call engineer acknowledged                             |
+| 14:10 | Identified root cause: Database connection pool exhausted |
+| 14:15 | Implemented mitigation: Increased pool size               |
+| 14:20 | Verified resolution                                       |
+| 14:30 | Incident resolved                                         |
+
+## Impact
+
+- **Duration:** 30 minutes
+- **Affected Users:** ~50 users
+- **Affected Services:** Document creation, Search
+- **Data Loss:** None
+
+## Root Cause
+
+Database connection pool was exhausted due to slow queries not releasing connections.
+
+## Resolution
+
+1. Increased connection pool size from 10 to 20
+2. Optimized slow queries
+3. Added connection pool monitoring
+
+## Action Items
+
+- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
+- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
+- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)
+
+## Lessons Learned
+
+- Connection pool monitoring was insufficient
+- Need automated remediation for common issues
+```
+
+---
+
+## 🔍 Post-Incident Review (PIR)
+
+### PIR Meeting Agenda
+
+1. **Timeline Review** (10 min)
+
+   - What happened and when?
+   - What was the impact?
+
+2. **Root Cause Analysis** (15 min)
+
+   - Why did it happen?
+   - What were the contributing factors?
+
+3. **What Went Well** (10 min)
+
+   - What did we do right?
+   - What helped us resolve quickly?
+
+4. **What Went Wrong** (15 min)
+
+   - What could we have done better?
+   - What slowed us down?
+
+5. **Action Items** (10 min)
+   - What changes will prevent this?
+   - Who owns each action?
+   - When will they be completed?
+
+### PIR Best Practices
+
+- **Blameless Culture:** Focus on systems, not individuals
+- **Actionable Outcomes:** Every PIR should produce concrete actions
+- **Follow Through:** Track action items to completion
+- **Share Learnings:** Distribute PIR summary to entire team
+
+---
+
+## 📊 Incident Metrics
+
+### Track & Review Monthly
+
+- **MTTR (Mean Time To Resolution):** Average time to resolve incidents
+- **MTBF (Mean Time Between Failures):** Average time between incidents
+- **Incident Frequency:** Number of incidents per month
+- **Severity Distribution:** Breakdown by P0/P1/P2/P3
+- **Repeat Incidents:** Same root cause occurring multiple times
+
+---
+
+## ✅ Incident Response Checklist
+
+### During Incident
+
+- [ ] Acknowledge incident in tracking system
+- [ ] Assess severity and assign IC
+- [ ] Create incident channel (Slack/Teams)
+- [ ] Begin documenting timeline
+- [ ] Investigate and implement mitigation
+- [ ] Communicate status updates every 30 min (P0/P1)
+- [ ] Verify resolution
+- [ ] Communicate resolution to stakeholders
+
+### After Incident
+
+- [ ] Create incident report
+- [ ] Schedule PIR within 48 hours
+- [ ] Identify action items
+- [ ] Assign owners and deadlines
+- [ ] Update runbooks/playbooks
+- [ ] Share learnings with team
+
+---
+
+## 🔗 Related Documents
+
+- [Monitoring & Alerting](./monitoring-alerting.md)
+- [Backup & Recovery](./backup-recovery.md)
+- [Security Operations](./security-operations.md)
+
+---
+
+**Version:** 1.5.0
+**Last Review:** 2025-12-01
+**Next Review:** 2026-03-01