np-dms/lcbp3

Fork 0

Files

admin 83704377f4

Spec Validation / validate-markdown (push) Waiting to run

Details

Spec Validation / validate-diagrams (push) Waiting to run

Details

Spec Validation / check-todos (push) Waiting to run

Details

251218:1701 On going update to 1.7.0: Documnet Number rebuild

2025-12-18 17:01:42 +07:00

9.9 KiB

Raw Blame History

Incident Response Procedures

Project: LCBP3-DMS Version: 1.6.0 Last Updated: 2025-12-02

📋 Overview

This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.

🚨 Incident Classification

Severity Levels

Severity	Description	Response Time	Examples
P0 - Critical	Complete system outage	15 minutes	Database down, All services unavailable
P1 - High	Major functionality impaired	1 hour	Authentication failing, Cannot create documents
P2 - Medium	Degraded performance	4 hours	Slow response time, Some features broken
P3 - Low	Minor issues	Next business day	UI glitch, Non-critical bug

📞 Incident Response Team

Roles & Responsibilities

Incident Commander (IC)

Coordinates response efforts
Makes final decisions
Communicates with stakeholders

Technical Lead (TL)

Diagnoses technical issues
Implements fixes
Coordinates with engineers

Communications Lead (CL)

Updates stakeholders
Manages internal/external communications
Documents incident timeline

On-Call Engineer

First responder
Initial triage and investigation
Escalates to appropriate team

🔄 Incident Response Workflow

flowchart TD
    Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
    Acknowledge --> Assess[Assess Severity]
    Assess --> P0{Severity?}

    P0 -->|P0/P1| Alert[Page Incident Commander]
    P0 -->|P2/P3| Assign[Assign to On-Call]

    Alert --> Investigate[Investigate Root Cause]
    Assign --> Investigate

    Investigate --> Mitigate[Implement Mitigation]
    Mitigate --> Verify[Verify Resolution]

    Verify --> Resolved{Resolved?}
    Resolved -->|No| Escalate[Escalate/Re-assess]
    Escalate --> Investigate

    Resolved -->|Yes| Communicate[Communicate Resolution]
    Communicate --> PostMortem[Schedule Post-Mortem]
    PostMortem --> End([Close Incident])

📋 Incident Response Playbooks

P0: Database Down

Symptoms:

Backend returns 500 errors
Cannot connect to database
Health check fails

Immediate Actions:

Verify Issue

docker ps | grep mariadb
docker logs lcbp3-mariadb --tail=50

Attempt Restart
```
docker restart lcbp3-mariadb
```

Check Database Process

docker exec lcbp3-mariadb ps aux | grep mysql

If Restart Fails:

# Check disk space
df -h

# Check database logs for corruption
docker exec lcbp3-mariadb cat /var/log/mysql/error.log

# If corrupted, restore from backup
# See backup-recovery.md

Escalate to DBA if not resolved in 30 minutes

P0: Complete System Outage

Symptoms:

All services return 502/503
Health checks fail
Users cannot access system

Immediate Actions:

Check Container Status

docker-compose ps
# Identify which containers are down

Restart All Services
```
docker-compose restart
```
Check QNAP Server Resources
```
top
df -h
free -h
```
Check Network
```
ping 8.8.8.8
netstat -tlnp
```
If Server Issue:
- Reboot QNAP server
- Contact QNAP support

P1: Authentication System Failing

Symptoms:

Users cannot log in
JWT validation fails
401 errors

Immediate Actions:

Check Redis (Session Store)

docker exec lcbp3-redis redis-cli ping
# Should return PONG

Check JWT Secret Configuration

docker exec lcbp3-backend env | grep JWT_SECRET
# Verify not empty

Check Backend Logs

docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"

Temporary Mitigation:

# Restart backend to reload config
docker restart lcbp3-backend

P1: File Upload Failing

Symptoms:

Users cannot upload files
500 errors on file upload
"Disk full" errors

Immediate Actions:

Check Disk Space

df -h /var/lib/docker/volumes/lcbp3_uploads

If Disk Full:

# Clean up temp uploads
find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
  -type f -mtime +1 -delete

Check ClamAV (Virus Scanner)

docker logs lcbp3-clamav --tail=50
docker restart lcbp3-clamav

Check File Permissions

docker exec lcbp3-backend ls -la /app/uploads

P2: Slow Performance

Symptoms:

Pages load slowly
API response time > 2s
Users complain about slowness

Actions:

Check System Resources

docker stats
# Identify high CPU/memory containers

Check Database Performance

-- Show slow queries
SHOW PROCESSLIST;

-- Check connections
SHOW STATUS LIKE 'Threads_connected';

Check Redis

docker exec lcbp3-redis redis-cli --stat

Check Application Logs

docker logs lcbp3-backend | grep "Slow request"

Temporary Mitigation:
- Restart slow containers
- Clear Redis cache if needed
- Kill long-running queries

P2: Email Notifications Not Sending

Symptoms:

Users not receiving emails
Email queue backing up

Actions:

Check Email Queue

# Access BullMQ dashboard or check Redis
docker exec lcbp3-redis redis-cli LLEN bull:email:waiting

Check Email Processor Logs

docker logs lcbp3-backend | grep "email\|SMTP"

Test SMTP Connection

docker exec lcbp3-backend node -e "
const nodemailer = require('nodemailer');
const transport = nodemailer.createTransport({
  host: process.env.SMTP_HOST,
  port: process.env.SMTP_PORT,
  auth: {
    user: process.env.SMTP_USER,
    pass: process.env.SMTP_PASS
  }
});
transport.verify().then(console.log).catch(console.error);
"

Check SMTP Credentials
- Verify not expired
- Check firewall/network access

📝 Incident Documentation

Incident Report Template

# Incident Report: [Brief Description]

**Incident ID:** INC-YYYYMMDD-001
**Severity:** P1
**Status:** Resolved
**Incident Commander:** [Name]

## Timeline

| Time  | Event                                                     |
| ----- | --------------------------------------------------------- |
| 14:00 | Alert: High error rate detected                           |
| 14:05 | On-call engineer acknowledged                             |
| 14:10 | Identified root cause: Database connection pool exhausted |
| 14:15 | Implemented mitigation: Increased pool size               |
| 14:20 | Verified resolution                                       |
| 14:30 | Incident resolved                                         |

## Impact

- **Duration:** 30 minutes
- **Affected Users:** ~50 users
- **Affected Services:** Document creation, Search
- **Data Loss:** None

## Root Cause

Database connection pool was exhausted due to slow queries not releasing connections.

## Resolution

1. Increased connection pool size from 10 to 20
2. Optimized slow queries
3. Added connection pool monitoring

## Action Items

- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)

## Lessons Learned

- Connection pool monitoring was insufficient
- Need automated remediation for common issues

🔍 Post-Incident Review (PIR)

PIR Meeting Agenda

Timeline Review (10 min)
- What happened and when?
- What was the impact?
Root Cause Analysis (15 min)
- Why did it happen?
- What were the contributing factors?
What Went Well (10 min)
- What did we do right?
- What helped us resolve quickly?
What Went Wrong (15 min)
- What could we have done better?
- What slowed us down?
Action Items (10 min)
- What changes will prevent this?
- Who owns each action?
- When will they be completed?

PIR Best Practices

Blameless Culture: Focus on systems, not individuals
Actionable Outcomes: Every PIR should produce concrete actions
Follow Through: Track action items to completion
Share Learnings: Distribute PIR summary to entire team

📊 Incident Metrics

Track & Review Monthly

MTTR (Mean Time To Resolution): Average time to resolve incidents
MTBF (Mean Time Between Failures): Average time between incidents
Incident Frequency: Number of incidents per month
Severity Distribution: Breakdown by P0/P1/P2/P3
Repeat Incidents: Same root cause occurring multiple times

✅ Incident Response Checklist

During Incident

Acknowledge incident in tracking system
Assess severity and assign IC
Create incident channel (Slack/Teams)
Begin documenting timeline
Investigate and implement mitigation
Communicate status updates every 30 min (P0/P1)
Verify resolution
Communicate resolution to stakeholders

After Incident

Create incident report
Schedule PIR within 48 hours
Identify action items
Assign owners and deadlines
Update runbooks/playbooks
Share learnings with team

Version: 1.6.0 Last Review: 2025-12-01 Next Review: 2026-03-01

9.9 KiB Raw Blame History

Incident Response Procedures

📋 Overview

🚨 Incident Classification

Severity Levels

📞 Incident Response Team

Roles & Responsibilities

🔄 Incident Response Workflow

📋 Incident Response Playbooks

P0: Database Down

P0: Complete System Outage

P1: Authentication System Failing

P1: File Upload Failing

P2: Slow Performance

P2: Email Notifications Not Sending

📝 Incident Documentation

Incident Report Template

🔍 Post-Incident Review (PIR)

PIR Meeting Agenda

PIR Best Practices

📊 Incident Metrics

Track & Review Monthly

✅ Incident Response Checklist

During Incident

After Incident

🔗 Related Documents

9.9 KiB

Raw Blame History