lcbp3/specs/04-operations/incident-response.md

# Incident Response Procedures

**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01

---

## 📋 Overview

This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.

---

## 🚨 Incident Classification

### Severity Levels

| Severity          | Description                  | Response Time     | Examples                                        |
| ----------------- | ---------------------------- | ----------------- | ----------------------------------------------- |
| **P0 - Critical** | Complete system outage       | 15 minutes        | Database down, All services unavailable         |
| **P1 - High**     | Major functionality impaired | 1 hour            | Authentication failing, Cannot create documents |
| **P2 - Medium**   | Degraded performance         | 4 hours           | Slow response time, Some features broken        |
| **P3 - Low**      | Minor issues                 | Next business day | UI glitch, Non-critical bug                     |

---

## 📞 Incident Response Team

### Roles & Responsibilities

**Incident Commander (IC)**

- Coordinates response efforts
- Makes final decisions
- Communicates with stakeholders

**Technical Lead (TL)**

- Diagnoses technical issues
- Implements fixes
- Coordinates with engineers

**Communications Lead (CL)**

- Updates stakeholders
- Manages internal/external communications
- Documents incident timeline

**On-Call Engineer**

- First responder
- Initial triage and investigation
- Escalates to appropriate team

---

## 🔄 Incident Response Workflow

```mermaid
flowchart TD
    Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
    Acknowledge --> Assess[Assess Severity]
    Assess --> P0{Severity?}

    P0 -->|P0/P1| Alert[Page Incident Commander]
    P0 -->|P2/P3| Assign[Assign to On-Call]

    Alert --> Investigate[Investigate Root Cause]
    Assign --> Investigate

    Investigate --> Mitigate[Implement Mitigation]
    Mitigate --> Verify[Verify Resolution]

    Verify --> Resolved{Resolved?}
    Resolved -->|No| Escalate[Escalate/Re-assess]
    Escalate --> Investigate

    Resolved -->|Yes| Communicate[Communicate Resolution]
    Communicate --> PostMortem[Schedule Post-Mortem]
    PostMortem --> End([Close Incident])
```

---

## 📋 Incident Response Playbooks

### P0: Database Down

**Symptoms:**

- Backend returns 500 errors
- Cannot connect to database
- Health check fails

**Immediate Actions:**

1. **Verify Issue**

   ```bash
   docker ps | grep mariadb
   docker logs lcbp3-mariadb --tail=50
   ```

2. **Attempt Restart**

   ```bash
   docker restart lcbp3-mariadb
   ```

3. **Check Database Process**

   ```bash
   docker exec lcbp3-mariadb ps aux | grep mysql
   ```

4. **If Restart Fails:**

   ```bash
   # Check disk space
   df -h

   # Check database logs for corruption
   docker exec lcbp3-mariadb cat /var/log/mysql/error.log

   # If corrupted, restore from backup
   # See backup-recovery.md
   ```

5. **Escalate to DBA** if not resolved in 30 minutes

---

### P0: Complete System Outage

**Symptoms:**

- All services return 502/503
- Health checks fail
- Users cannot access system

**Immediate Actions:**

1. **Check Container Status**

   ```bash
   docker-compose ps
   # Identify which containers are down
   ```

2. **Restart All Services**

   ```bash
   docker-compose restart
   ```

3. **Check QNAP Server Resources**

   ```bash
   top
   df -h
   free -h
   ```

4. **Check Network**

   ```bash
   ping 8.8.8.8
   netstat -tlnp
   ```

5. **If Server Issue:**
   - Reboot QNAP server
   - Contact QNAP support

---

### P1: Authentication System Failing

**Symptoms:**

- Users cannot log in
- JWT validation fails
- 401 errors

**Immediate Actions:**

1. **Check Redis (Session Store)**

   ```bash
   docker exec lcbp3-redis redis-cli ping
   # Should return PONG
   ```

2. **Check JWT Secret Configuration**

   ```bash
   docker exec lcbp3-backend env | grep JWT_SECRET
   # Verify not empty
   ```

3. **Check Backend Logs**

   ```bash
   docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"
   ```

4. **Temporary Mitigation:**
   ```bash
   # Restart backend to reload config
   docker restart lcbp3-backend
   ```

---

### P1: File Upload Failing

**Symptoms:**

- Users cannot upload files
- 500 errors on file upload
- "Disk full" errors

**Immediate Actions:**

1. **Check Disk Space**

   ```bash
   df -h /var/lib/docker/volumes/lcbp3_uploads
   ```

2. **If Disk Full:**

   ```bash
   # Clean up temp uploads
   find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
     -type f -mtime +1 -delete
   ```

3. **Check ClamAV (Virus Scanner)**

   ```bash
   docker logs lcbp3-clamav --tail=50
   docker restart lcbp3-clamav
   ```

4. **Check File Permissions**
   ```bash
   docker exec lcbp3-backend ls -la /app/uploads
   ```

---

### P2: Slow Performance

**Symptoms:**

- Pages load slowly
- API response time > 2s
- Users complain about slowness

**Actions:**

1. **Check System Resources**

   ```bash
   docker stats
   # Identify high CPU/memory containers
   ```

2. **Check Database Performance**

   ```sql
   -- Show slow queries
   SHOW PROCESSLIST;

   -- Check connections
   SHOW STATUS LIKE 'Threads_connected';
   ```

3. **Check Redis**

   ```bash
   docker exec lcbp3-redis redis-cli --stat
   ```

4. **Check Application Logs**

   ```bash
   docker logs lcbp3-backend | grep "Slow request"
   ```

5. **Temporary Mitigation:**
   - Restart slow containers
   - Clear Redis cache if needed
   - Kill long-running queries

---

### P2: Email Notifications Not Sending

**Symptoms:**

- Users not receiving emails
- Email queue backing up

**Actions:**

1. **Check Email Queue**

   ```bash
   # Access BullMQ dashboard or check Redis
   docker exec lcbp3-redis redis-cli LLEN bull:email:waiting
   ```

2. **Check Email Processor Logs**

   ```bash
   docker logs lcbp3-backend | grep "email\|SMTP"
   ```

3. **Test SMTP Connection**

   ```bash
   docker exec lcbp3-backend node -e "
   const nodemailer = require('nodemailer');
   const transport = nodemailer.createTransport({
     host: process.env.SMTP_HOST,
     port: process.env.SMTP_PORT,
     auth: {
       user: process.env.SMTP_USER,
       pass: process.env.SMTP_PASS
     }
   });
   transport.verify().then(console.log).catch(console.error);
   "
   ```

4. **Check SMTP Credentials**
   - Verify not expired
   - Check firewall/network access

---

## 📝 Incident Documentation

### Incident Report Template

```markdown
# Incident Report: [Brief Description]

**Incident ID:** INC-YYYYMMDD-001
**Severity:** P1
**Status:** Resolved
**Incident Commander:** [Name]

## Timeline

| Time  | Event                                                     |
| ----- | --------------------------------------------------------- |
| 14:00 | Alert: High error rate detected                           |
| 14:05 | On-call engineer acknowledged                             |
| 14:10 | Identified root cause: Database connection pool exhausted |
| 14:15 | Implemented mitigation: Increased pool size               |
| 14:20 | Verified resolution                                       |
| 14:30 | Incident resolved                                         |

## Impact

- **Duration:** 30 minutes
- **Affected Users:** ~50 users
- **Affected Services:** Document creation, Search
- **Data Loss:** None

## Root Cause

Database connection pool was exhausted due to slow queries not releasing connections.

## Resolution

1. Increased connection pool size from 10 to 20
2. Optimized slow queries
3. Added connection pool monitoring

## Action Items

- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)

## Lessons Learned

- Connection pool monitoring was insufficient
- Need automated remediation for common issues
```

---

## 🔍 Post-Incident Review (PIR)

### PIR Meeting Agenda

1. **Timeline Review** (10 min)

   - What happened and when?
   - What was the impact?

2. **Root Cause Analysis** (15 min)

   - Why did it happen?
   - What were the contributing factors?

3. **What Went Well** (10 min)

   - What did we do right?
   - What helped us resolve quickly?

4. **What Went Wrong** (15 min)

   - What could we have done better?
   - What slowed us down?

5. **Action Items** (10 min)
   - What changes will prevent this?
   - Who owns each action?
   - When will they be completed?

### PIR Best Practices

- **Blameless Culture:** Focus on systems, not individuals
- **Actionable Outcomes:** Every PIR should produce concrete actions
- **Follow Through:** Track action items to completion
- **Share Learnings:** Distribute PIR summary to entire team

---

## 📊 Incident Metrics

### Track & Review Monthly

- **MTTR (Mean Time To Resolution):** Average time to resolve incidents
- **MTBF (Mean Time Between Failures):** Average time between incidents
- **Incident Frequency:** Number of incidents per month
- **Severity Distribution:** Breakdown by P0/P1/P2/P3
- **Repeat Incidents:** Same root cause occurring multiple times

---

## ✅ Incident Response Checklist

### During Incident

- [ ] Acknowledge incident in tracking system
- [ ] Assess severity and assign IC
- [ ] Create incident channel (Slack/Teams)
- [ ] Begin documenting timeline
- [ ] Investigate and implement mitigation
- [ ] Communicate status updates every 30 min (P0/P1)
- [ ] Verify resolution
- [ ] Communicate resolution to stakeholders

### After Incident

- [ ] Create incident report
- [ ] Schedule PIR within 48 hours
- [ ] Identify action items
- [ ] Assign owners and deadlines
- [ ] Update runbooks/playbooks
- [ ] Share learnings with team

---

## 🔗 Related Documents

- [Monitoring & Alerting](./monitoring-alerting.md)
- [Backup & Recovery](./backup-recovery.md)
- [Security Operations](./security-operations.md)

---

**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01