Files
lcbp3/specs/04-operations/incident-response.md

484 lines
9.9 KiB
Markdown

# Incident Response Procedures
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.
---
## 🚨 Incident Classification
### Severity Levels
| Severity | Description | Response Time | Examples |
| ----------------- | ---------------------------- | ----------------- | ----------------------------------------------- |
| **P0 - Critical** | Complete system outage | 15 minutes | Database down, All services unavailable |
| **P1 - High** | Major functionality impaired | 1 hour | Authentication failing, Cannot create documents |
| **P2 - Medium** | Degraded performance | 4 hours | Slow response time, Some features broken |
| **P3 - Low** | Minor issues | Next business day | UI glitch, Non-critical bug |
---
## 📞 Incident Response Team
### Roles & Responsibilities
**Incident Commander (IC)**
- Coordinates response efforts
- Makes final decisions
- Communicates with stakeholders
**Technical Lead (TL)**
- Diagnoses technical issues
- Implements fixes
- Coordinates with engineers
**Communications Lead (CL)**
- Updates stakeholders
- Manages internal/external communications
- Documents incident timeline
**On-Call Engineer**
- First responder
- Initial triage and investigation
- Escalates to appropriate team
---
## 🔄 Incident Response Workflow
```mermaid
flowchart TD
Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
Acknowledge --> Assess[Assess Severity]
Assess --> P0{Severity?}
P0 -->|P0/P1| Alert[Page Incident Commander]
P0 -->|P2/P3| Assign[Assign to On-Call]
Alert --> Investigate[Investigate Root Cause]
Assign --> Investigate
Investigate --> Mitigate[Implement Mitigation]
Mitigate --> Verify[Verify Resolution]
Verify --> Resolved{Resolved?}
Resolved -->|No| Escalate[Escalate/Re-assess]
Escalate --> Investigate
Resolved -->|Yes| Communicate[Communicate Resolution]
Communicate --> PostMortem[Schedule Post-Mortem]
PostMortem --> End([Close Incident])
```
---
## 📋 Incident Response Playbooks
### P0: Database Down
**Symptoms:**
- Backend returns 500 errors
- Cannot connect to database
- Health check fails
**Immediate Actions:**
1. **Verify Issue**
```bash
docker ps | grep mariadb
docker logs lcbp3-mariadb --tail=50
```
2. **Attempt Restart**
```bash
docker restart lcbp3-mariadb
```
3. **Check Database Process**
```bash
docker exec lcbp3-mariadb ps aux | grep mysql
```
4. **If Restart Fails:**
```bash
# Check disk space
df -h
# Check database logs for corruption
docker exec lcbp3-mariadb cat /var/log/mysql/error.log
# If corrupted, restore from backup
# See backup-recovery.md
```
5. **Escalate to DBA** if not resolved in 30 minutes
---
### P0: Complete System Outage
**Symptoms:**
- All services return 502/503
- Health checks fail
- Users cannot access system
**Immediate Actions:**
1. **Check Container Status**
```bash
docker-compose ps
# Identify which containers are down
```
2. **Restart All Services**
```bash
docker-compose restart
```
3. **Check QNAP Server Resources**
```bash
top
df -h
free -h
```
4. **Check Network**
```bash
ping 8.8.8.8
netstat -tlnp
```
5. **If Server Issue:**
- Reboot QNAP server
- Contact QNAP support
---
### P1: Authentication System Failing
**Symptoms:**
- Users cannot log in
- JWT validation fails
- 401 errors
**Immediate Actions:**
1. **Check Redis (Session Store)**
```bash
docker exec lcbp3-redis redis-cli ping
# Should return PONG
```
2. **Check JWT Secret Configuration**
```bash
docker exec lcbp3-backend env | grep JWT_SECRET
# Verify not empty
```
3. **Check Backend Logs**
```bash
docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"
```
4. **Temporary Mitigation:**
```bash
# Restart backend to reload config
docker restart lcbp3-backend
```
---
### P1: File Upload Failing
**Symptoms:**
- Users cannot upload files
- 500 errors on file upload
- "Disk full" errors
**Immediate Actions:**
1. **Check Disk Space**
```bash
df -h /var/lib/docker/volumes/lcbp3_uploads
```
2. **If Disk Full:**
```bash
# Clean up temp uploads
find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
-type f -mtime +1 -delete
```
3. **Check ClamAV (Virus Scanner)**
```bash
docker logs lcbp3-clamav --tail=50
docker restart lcbp3-clamav
```
4. **Check File Permissions**
```bash
docker exec lcbp3-backend ls -la /app/uploads
```
---
### P2: Slow Performance
**Symptoms:**
- Pages load slowly
- API response time > 2s
- Users complain about slowness
**Actions:**
1. **Check System Resources**
```bash
docker stats
# Identify high CPU/memory containers
```
2. **Check Database Performance**
```sql
-- Show slow queries
SHOW PROCESSLIST;
-- Check connections
SHOW STATUS LIKE 'Threads_connected';
```
3. **Check Redis**
```bash
docker exec lcbp3-redis redis-cli --stat
```
4. **Check Application Logs**
```bash
docker logs lcbp3-backend | grep "Slow request"
```
5. **Temporary Mitigation:**
- Restart slow containers
- Clear Redis cache if needed
- Kill long-running queries
---
### P2: Email Notifications Not Sending
**Symptoms:**
- Users not receiving emails
- Email queue backing up
**Actions:**
1. **Check Email Queue**
```bash
# Access BullMQ dashboard or check Redis
docker exec lcbp3-redis redis-cli LLEN bull:email:waiting
```
2. **Check Email Processor Logs**
```bash
docker logs lcbp3-backend | grep "email\|SMTP"
```
3. **Test SMTP Connection**
```bash
docker exec lcbp3-backend node -e "
const nodemailer = require('nodemailer');
const transport = nodemailer.createTransport({
host: process.env.SMTP_HOST,
port: process.env.SMTP_PORT,
auth: {
user: process.env.SMTP_USER,
pass: process.env.SMTP_PASS
}
});
transport.verify().then(console.log).catch(console.error);
"
```
4. **Check SMTP Credentials**
- Verify not expired
- Check firewall/network access
---
## 📝 Incident Documentation
### Incident Report Template
```markdown
# Incident Report: [Brief Description]
**Incident ID:** INC-YYYYMMDD-001
**Severity:** P1
**Status:** Resolved
**Incident Commander:** [Name]
## Timeline
| Time | Event |
| ----- | --------------------------------------------------------- |
| 14:00 | Alert: High error rate detected |
| 14:05 | On-call engineer acknowledged |
| 14:10 | Identified root cause: Database connection pool exhausted |
| 14:15 | Implemented mitigation: Increased pool size |
| 14:20 | Verified resolution |
| 14:30 | Incident resolved |
## Impact
- **Duration:** 30 minutes
- **Affected Users:** ~50 users
- **Affected Services:** Document creation, Search
- **Data Loss:** None
## Root Cause
Database connection pool was exhausted due to slow queries not releasing connections.
## Resolution
1. Increased connection pool size from 10 to 20
2. Optimized slow queries
3. Added connection pool monitoring
## Action Items
- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)
## Lessons Learned
- Connection pool monitoring was insufficient
- Need automated remediation for common issues
```
---
## 🔍 Post-Incident Review (PIR)
### PIR Meeting Agenda
1. **Timeline Review** (10 min)
- What happened and when?
- What was the impact?
2. **Root Cause Analysis** (15 min)
- Why did it happen?
- What were the contributing factors?
3. **What Went Well** (10 min)
- What did we do right?
- What helped us resolve quickly?
4. **What Went Wrong** (15 min)
- What could we have done better?
- What slowed us down?
5. **Action Items** (10 min)
- What changes will prevent this?
- Who owns each action?
- When will they be completed?
### PIR Best Practices
- **Blameless Culture:** Focus on systems, not individuals
- **Actionable Outcomes:** Every PIR should produce concrete actions
- **Follow Through:** Track action items to completion
- **Share Learnings:** Distribute PIR summary to entire team
---
## 📊 Incident Metrics
### Track & Review Monthly
- **MTTR (Mean Time To Resolution):** Average time to resolve incidents
- **MTBF (Mean Time Between Failures):** Average time between incidents
- **Incident Frequency:** Number of incidents per month
- **Severity Distribution:** Breakdown by P0/P1/P2/P3
- **Repeat Incidents:** Same root cause occurring multiple times
---
## ✅ Incident Response Checklist
### During Incident
- [ ] Acknowledge incident in tracking system
- [ ] Assess severity and assign IC
- [ ] Create incident channel (Slack/Teams)
- [ ] Begin documenting timeline
- [ ] Investigate and implement mitigation
- [ ] Communicate status updates every 30 min (P0/P1)
- [ ] Verify resolution
- [ ] Communicate resolution to stakeholders
### After Incident
- [ ] Create incident report
- [ ] Schedule PIR within 48 hours
- [ ] Identify action items
- [ ] Assign owners and deadlines
- [ ] Update runbooks/playbooks
- [ ] Share learnings with team
---
## 🔗 Related Documents
- [Monitoring & Alerting](./monitoring-alerting.md)
- [Backup & Recovery](./backup-recovery.md)
- [Security Operations](./security-operations.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01