484 lines
9.9 KiB
Markdown
484 lines
9.9 KiB
Markdown
# Incident Response Procedures
|
|
|
|
**Project:** LCBP3-DMS
|
|
**Version:** 1.5.0
|
|
**Last Updated:** 2025-12-01
|
|
|
|
---
|
|
|
|
## 📋 Overview
|
|
|
|
This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.
|
|
|
|
---
|
|
|
|
## 🚨 Incident Classification
|
|
|
|
### Severity Levels
|
|
|
|
| Severity | Description | Response Time | Examples |
|
|
| ----------------- | ---------------------------- | ----------------- | ----------------------------------------------- |
|
|
| **P0 - Critical** | Complete system outage | 15 minutes | Database down, All services unavailable |
|
|
| **P1 - High** | Major functionality impaired | 1 hour | Authentication failing, Cannot create documents |
|
|
| **P2 - Medium** | Degraded performance | 4 hours | Slow response time, Some features broken |
|
|
| **P3 - Low** | Minor issues | Next business day | UI glitch, Non-critical bug |
|
|
|
|
---
|
|
|
|
## 📞 Incident Response Team
|
|
|
|
### Roles & Responsibilities
|
|
|
|
**Incident Commander (IC)**
|
|
|
|
- Coordinates response efforts
|
|
- Makes final decisions
|
|
- Communicates with stakeholders
|
|
|
|
**Technical Lead (TL)**
|
|
|
|
- Diagnoses technical issues
|
|
- Implements fixes
|
|
- Coordinates with engineers
|
|
|
|
**Communications Lead (CL)**
|
|
|
|
- Updates stakeholders
|
|
- Manages internal/external communications
|
|
- Documents incident timeline
|
|
|
|
**On-Call Engineer**
|
|
|
|
- First responder
|
|
- Initial triage and investigation
|
|
- Escalates to appropriate team
|
|
|
|
---
|
|
|
|
## 🔄 Incident Response Workflow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
|
|
Acknowledge --> Assess[Assess Severity]
|
|
Assess --> P0{Severity?}
|
|
|
|
P0 -->|P0/P1| Alert[Page Incident Commander]
|
|
P0 -->|P2/P3| Assign[Assign to On-Call]
|
|
|
|
Alert --> Investigate[Investigate Root Cause]
|
|
Assign --> Investigate
|
|
|
|
Investigate --> Mitigate[Implement Mitigation]
|
|
Mitigate --> Verify[Verify Resolution]
|
|
|
|
Verify --> Resolved{Resolved?}
|
|
Resolved -->|No| Escalate[Escalate/Re-assess]
|
|
Escalate --> Investigate
|
|
|
|
Resolved -->|Yes| Communicate[Communicate Resolution]
|
|
Communicate --> PostMortem[Schedule Post-Mortem]
|
|
PostMortem --> End([Close Incident])
|
|
```
|
|
|
|
---
|
|
|
|
## 📋 Incident Response Playbooks
|
|
|
|
### P0: Database Down
|
|
|
|
**Symptoms:**
|
|
|
|
- Backend returns 500 errors
|
|
- Cannot connect to database
|
|
- Health check fails
|
|
|
|
**Immediate Actions:**
|
|
|
|
1. **Verify Issue**
|
|
|
|
```bash
|
|
docker ps | grep mariadb
|
|
docker logs lcbp3-mariadb --tail=50
|
|
```
|
|
|
|
2. **Attempt Restart**
|
|
|
|
```bash
|
|
docker restart lcbp3-mariadb
|
|
```
|
|
|
|
3. **Check Database Process**
|
|
|
|
```bash
|
|
docker exec lcbp3-mariadb ps aux | grep mysql
|
|
```
|
|
|
|
4. **If Restart Fails:**
|
|
|
|
```bash
|
|
# Check disk space
|
|
df -h
|
|
|
|
# Check database logs for corruption
|
|
docker exec lcbp3-mariadb cat /var/log/mysql/error.log
|
|
|
|
# If corrupted, restore from backup
|
|
# See backup-recovery.md
|
|
```
|
|
|
|
5. **Escalate to DBA** if not resolved in 30 minutes
|
|
|
|
---
|
|
|
|
### P0: Complete System Outage
|
|
|
|
**Symptoms:**
|
|
|
|
- All services return 502/503
|
|
- Health checks fail
|
|
- Users cannot access system
|
|
|
|
**Immediate Actions:**
|
|
|
|
1. **Check Container Status**
|
|
|
|
```bash
|
|
docker-compose ps
|
|
# Identify which containers are down
|
|
```
|
|
|
|
2. **Restart All Services**
|
|
|
|
```bash
|
|
docker-compose restart
|
|
```
|
|
|
|
3. **Check QNAP Server Resources**
|
|
|
|
```bash
|
|
top
|
|
df -h
|
|
free -h
|
|
```
|
|
|
|
4. **Check Network**
|
|
|
|
```bash
|
|
ping 8.8.8.8
|
|
netstat -tlnp
|
|
```
|
|
|
|
5. **If Server Issue:**
|
|
- Reboot QNAP server
|
|
- Contact QNAP support
|
|
|
|
---
|
|
|
|
### P1: Authentication System Failing
|
|
|
|
**Symptoms:**
|
|
|
|
- Users cannot log in
|
|
- JWT validation fails
|
|
- 401 errors
|
|
|
|
**Immediate Actions:**
|
|
|
|
1. **Check Redis (Session Store)**
|
|
|
|
```bash
|
|
docker exec lcbp3-redis redis-cli ping
|
|
# Should return PONG
|
|
```
|
|
|
|
2. **Check JWT Secret Configuration**
|
|
|
|
```bash
|
|
docker exec lcbp3-backend env | grep JWT_SECRET
|
|
# Verify not empty
|
|
```
|
|
|
|
3. **Check Backend Logs**
|
|
|
|
```bash
|
|
docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"
|
|
```
|
|
|
|
4. **Temporary Mitigation:**
|
|
```bash
|
|
# Restart backend to reload config
|
|
docker restart lcbp3-backend
|
|
```
|
|
|
|
---
|
|
|
|
### P1: File Upload Failing
|
|
|
|
**Symptoms:**
|
|
|
|
- Users cannot upload files
|
|
- 500 errors on file upload
|
|
- "Disk full" errors
|
|
|
|
**Immediate Actions:**
|
|
|
|
1. **Check Disk Space**
|
|
|
|
```bash
|
|
df -h /var/lib/docker/volumes/lcbp3_uploads
|
|
```
|
|
|
|
2. **If Disk Full:**
|
|
|
|
```bash
|
|
# Clean up temp uploads
|
|
find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
|
|
-type f -mtime +1 -delete
|
|
```
|
|
|
|
3. **Check ClamAV (Virus Scanner)**
|
|
|
|
```bash
|
|
docker logs lcbp3-clamav --tail=50
|
|
docker restart lcbp3-clamav
|
|
```
|
|
|
|
4. **Check File Permissions**
|
|
```bash
|
|
docker exec lcbp3-backend ls -la /app/uploads
|
|
```
|
|
|
|
---
|
|
|
|
### P2: Slow Performance
|
|
|
|
**Symptoms:**
|
|
|
|
- Pages load slowly
|
|
- API response time > 2s
|
|
- Users complain about slowness
|
|
|
|
**Actions:**
|
|
|
|
1. **Check System Resources**
|
|
|
|
```bash
|
|
docker stats
|
|
# Identify high CPU/memory containers
|
|
```
|
|
|
|
2. **Check Database Performance**
|
|
|
|
```sql
|
|
-- Show slow queries
|
|
SHOW PROCESSLIST;
|
|
|
|
-- Check connections
|
|
SHOW STATUS LIKE 'Threads_connected';
|
|
```
|
|
|
|
3. **Check Redis**
|
|
|
|
```bash
|
|
docker exec lcbp3-redis redis-cli --stat
|
|
```
|
|
|
|
4. **Check Application Logs**
|
|
|
|
```bash
|
|
docker logs lcbp3-backend | grep "Slow request"
|
|
```
|
|
|
|
5. **Temporary Mitigation:**
|
|
- Restart slow containers
|
|
- Clear Redis cache if needed
|
|
- Kill long-running queries
|
|
|
|
---
|
|
|
|
### P2: Email Notifications Not Sending
|
|
|
|
**Symptoms:**
|
|
|
|
- Users not receiving emails
|
|
- Email queue backing up
|
|
|
|
**Actions:**
|
|
|
|
1. **Check Email Queue**
|
|
|
|
```bash
|
|
# Access BullMQ dashboard or check Redis
|
|
docker exec lcbp3-redis redis-cli LLEN bull:email:waiting
|
|
```
|
|
|
|
2. **Check Email Processor Logs**
|
|
|
|
```bash
|
|
docker logs lcbp3-backend | grep "email\|SMTP"
|
|
```
|
|
|
|
3. **Test SMTP Connection**
|
|
|
|
```bash
|
|
docker exec lcbp3-backend node -e "
|
|
const nodemailer = require('nodemailer');
|
|
const transport = nodemailer.createTransport({
|
|
host: process.env.SMTP_HOST,
|
|
port: process.env.SMTP_PORT,
|
|
auth: {
|
|
user: process.env.SMTP_USER,
|
|
pass: process.env.SMTP_PASS
|
|
}
|
|
});
|
|
transport.verify().then(console.log).catch(console.error);
|
|
"
|
|
```
|
|
|
|
4. **Check SMTP Credentials**
|
|
- Verify not expired
|
|
- Check firewall/network access
|
|
|
|
---
|
|
|
|
## 📝 Incident Documentation
|
|
|
|
### Incident Report Template
|
|
|
|
```markdown
|
|
# Incident Report: [Brief Description]
|
|
|
|
**Incident ID:** INC-YYYYMMDD-001
|
|
**Severity:** P1
|
|
**Status:** Resolved
|
|
**Incident Commander:** [Name]
|
|
|
|
## Timeline
|
|
|
|
| Time | Event |
|
|
| ----- | --------------------------------------------------------- |
|
|
| 14:00 | Alert: High error rate detected |
|
|
| 14:05 | On-call engineer acknowledged |
|
|
| 14:10 | Identified root cause: Database connection pool exhausted |
|
|
| 14:15 | Implemented mitigation: Increased pool size |
|
|
| 14:20 | Verified resolution |
|
|
| 14:30 | Incident resolved |
|
|
|
|
## Impact
|
|
|
|
- **Duration:** 30 minutes
|
|
- **Affected Users:** ~50 users
|
|
- **Affected Services:** Document creation, Search
|
|
- **Data Loss:** None
|
|
|
|
## Root Cause
|
|
|
|
Database connection pool was exhausted due to slow queries not releasing connections.
|
|
|
|
## Resolution
|
|
|
|
1. Increased connection pool size from 10 to 20
|
|
2. Optimized slow queries
|
|
3. Added connection pool monitoring
|
|
|
|
## Action Items
|
|
|
|
- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
|
|
- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
|
|
- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)
|
|
|
|
## Lessons Learned
|
|
|
|
- Connection pool monitoring was insufficient
|
|
- Need automated remediation for common issues
|
|
```
|
|
|
|
---
|
|
|
|
## 🔍 Post-Incident Review (PIR)
|
|
|
|
### PIR Meeting Agenda
|
|
|
|
1. **Timeline Review** (10 min)
|
|
|
|
- What happened and when?
|
|
- What was the impact?
|
|
|
|
2. **Root Cause Analysis** (15 min)
|
|
|
|
- Why did it happen?
|
|
- What were the contributing factors?
|
|
|
|
3. **What Went Well** (10 min)
|
|
|
|
- What did we do right?
|
|
- What helped us resolve quickly?
|
|
|
|
4. **What Went Wrong** (15 min)
|
|
|
|
- What could we have done better?
|
|
- What slowed us down?
|
|
|
|
5. **Action Items** (10 min)
|
|
- What changes will prevent this?
|
|
- Who owns each action?
|
|
- When will they be completed?
|
|
|
|
### PIR Best Practices
|
|
|
|
- **Blameless Culture:** Focus on systems, not individuals
|
|
- **Actionable Outcomes:** Every PIR should produce concrete actions
|
|
- **Follow Through:** Track action items to completion
|
|
- **Share Learnings:** Distribute PIR summary to entire team
|
|
|
|
---
|
|
|
|
## 📊 Incident Metrics
|
|
|
|
### Track & Review Monthly
|
|
|
|
- **MTTR (Mean Time To Resolution):** Average time to resolve incidents
|
|
- **MTBF (Mean Time Between Failures):** Average time between incidents
|
|
- **Incident Frequency:** Number of incidents per month
|
|
- **Severity Distribution:** Breakdown by P0/P1/P2/P3
|
|
- **Repeat Incidents:** Same root cause occurring multiple times
|
|
|
|
---
|
|
|
|
## ✅ Incident Response Checklist
|
|
|
|
### During Incident
|
|
|
|
- [ ] Acknowledge incident in tracking system
|
|
- [ ] Assess severity and assign IC
|
|
- [ ] Create incident channel (Slack/Teams)
|
|
- [ ] Begin documenting timeline
|
|
- [ ] Investigate and implement mitigation
|
|
- [ ] Communicate status updates every 30 min (P0/P1)
|
|
- [ ] Verify resolution
|
|
- [ ] Communicate resolution to stakeholders
|
|
|
|
### After Incident
|
|
|
|
- [ ] Create incident report
|
|
- [ ] Schedule PIR within 48 hours
|
|
- [ ] Identify action items
|
|
- [ ] Assign owners and deadlines
|
|
- [ ] Update runbooks/playbooks
|
|
- [ ] Share learnings with team
|
|
|
|
---
|
|
|
|
## 🔗 Related Documents
|
|
|
|
- [Monitoring & Alerting](./monitoring-alerting.md)
|
|
- [Backup & Recovery](./backup-recovery.md)
|
|
- [Security Operations](./security-operations.md)
|
|
|
|
---
|
|
|
|
**Version:** 1.5.0
|
|
**Last Review:** 2025-12-01
|
|
**Next Review:** 2026-03-01
|