Main: revise specs to 1.5.0 (completed)
This commit is contained in:
190
specs/04-operations/README.md
Normal file
190
specs/04-operations/README.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Operations Documentation
|
||||
|
||||
**Project:** LCBP3-DMS (Laem Chabang Port Phase 3 - Document Management System)
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This directory contains operational documentation for deploying, maintaining, and monitoring the LCBP3-DMS system.
|
||||
|
||||
---
|
||||
|
||||
## 📚 Documentation Index
|
||||
|
||||
### Deployment & Infrastructure
|
||||
|
||||
| Document | Description | Status |
|
||||
| ---------------------------------------------- | ------------------------------------------------------ | ----------- |
|
||||
| [deployment-guide.md](./deployment-guide.md) | Docker deployment procedures on QNAP Container Station | ✅ Complete |
|
||||
| [environment-setup.md](./environment-setup.md) | Environment variables and configuration management | ✅ Complete |
|
||||
|
||||
### Monitoring & Maintenance
|
||||
|
||||
| Document | Description | Status |
|
||||
| -------------------------------------------------------- | --------------------------------------------------- | ----------- |
|
||||
| [monitoring-alerting.md](./monitoring-alerting.md) | Monitoring setup, health checks, and alerting rules | ✅ Complete |
|
||||
| [backup-recovery.md](./backup-recovery.md) | Backup strategies and disaster recovery procedures | ✅ Complete |
|
||||
| [maintenance-procedures.md](./maintenance-procedures.md) | Routine maintenance and update procedures | ✅ Complete |
|
||||
|
||||
### Security & Compliance
|
||||
|
||||
| Document | Description | Status |
|
||||
| -------------------------------------------------- | ---------------------------------------------- | ----------- |
|
||||
| [security-operations.md](./security-operations.md) | Security monitoring and incident response | ✅ Complete |
|
||||
| [incident-response.md](./incident-response.md) | Incident classification and response playbooks | ✅ Complete |
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Quick Start for Operations Team
|
||||
|
||||
### Initial Setup
|
||||
|
||||
1. **Read Deployment Guide** - [deployment-guide.md](./deployment-guide.md)
|
||||
2. **Configure Environment** - [environment-setup.md](./environment-setup.md)
|
||||
3. **Setup Monitoring** - [monitoring-alerting.md](./monitoring-alerting.md)
|
||||
4. **Configure Backups** - [backup-recovery.md](./backup-recovery.md)
|
||||
|
||||
### Daily Operations
|
||||
|
||||
1. Monitor system health via logs and metrics
|
||||
2. Review backup status (automated daily)
|
||||
3. Check for security alerts
|
||||
4. Review system performance metrics
|
||||
|
||||
### Weekly/Monthly Tasks
|
||||
|
||||
- Review and update SSL certificates (90 days before expiry)
|
||||
- Database optimization and cleanup
|
||||
- Log rotation and archival
|
||||
- Security patch review and application
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Infrastructure Overview
|
||||
|
||||
### QNAP Container Station Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "QNAP Server"
|
||||
subgraph "Container Station"
|
||||
NGINX[NGINX<br/>Reverse Proxy<br/>Port 80/443]
|
||||
Backend[NestJS Backend<br/>Port 3000]
|
||||
Frontend[Next.js Frontend<br/>Port 3001]
|
||||
MariaDB[(MariaDB 10.11<br/>Port 3306)]
|
||||
Redis[(Redis 7.2<br/>Port 6379)]
|
||||
ES[(Elasticsearch<br/>Port 9200)]
|
||||
end
|
||||
|
||||
Volumes[("Persistent Volumes<br/>- database<br/>- uploads<br/>- logs")]
|
||||
end
|
||||
|
||||
Internet([Internet]) --> NGINX
|
||||
NGINX --> Frontend
|
||||
NGINX --> Backend
|
||||
Backend --> MariaDB
|
||||
Backend --> Redis
|
||||
Backend --> ES
|
||||
MariaDB --> Volumes
|
||||
Backend --> Volumes
|
||||
```
|
||||
|
||||
### Container Services
|
||||
|
||||
| Service | Container Name | Ports | Persistent Volume |
|
||||
| ------------- | ------------------- | ------- | ----------------------------- |
|
||||
| NGINX | lcbp3-nginx | 80, 443 | /config/nginx |
|
||||
| Backend | lcbp3-backend | 3000 | /app/uploads, /app/logs |
|
||||
| Frontend | lcbp3-frontend | 3001 | - |
|
||||
| MariaDB | lcbp3-mariadb | 3306 | /var/lib/mysql |
|
||||
| Redis | lcbp3-redis | 6379 | /data |
|
||||
| Elasticsearch | lcbp3-elasticsearch | 9200 | /usr/share/elasticsearch/data |
|
||||
|
||||
---
|
||||
|
||||
## 👥 Roles & Responsibilities
|
||||
|
||||
### System Administrator
|
||||
|
||||
- Deploy and configure infrastructure
|
||||
- Manage QNAP server and Container Station
|
||||
- Configure networking and firewall rules
|
||||
- SSL certificate management
|
||||
|
||||
### Database Administrator (DBA)
|
||||
|
||||
- Database backup and recovery
|
||||
- Performance tuning and optimization
|
||||
- Migration execution
|
||||
- Access control management
|
||||
|
||||
### DevOps Engineer
|
||||
|
||||
- CI/CD pipeline maintenance
|
||||
- Container orchestration
|
||||
- Monitoring and alerting setup
|
||||
- Log aggregation
|
||||
|
||||
### Security Officer
|
||||
|
||||
- Security monitoring
|
||||
- Incident response coordination
|
||||
- Access audit reviews
|
||||
- Vulnerability management
|
||||
|
||||
---
|
||||
|
||||
## 📞 Support & Escalation
|
||||
|
||||
### Support Tiers
|
||||
|
||||
**Tier 1: User Support**
|
||||
|
||||
- User access issues
|
||||
- Password resets
|
||||
- Basic troubleshooting
|
||||
|
||||
**Tier 2: Technical Support**
|
||||
|
||||
- Application errors
|
||||
- Performance issues
|
||||
- Feature bugs
|
||||
|
||||
**Tier 3: Operations Team**
|
||||
|
||||
- Infrastructure failures
|
||||
- Database issues
|
||||
- Security incidents
|
||||
|
||||
### Escalation Path
|
||||
|
||||
1. **Minor Issues** → Tier 1/2 Support → Resolution within 24h
|
||||
2. **Major Issues** → Tier 3 Operations → Resolution within 4h
|
||||
3. **Critical Issues** → Immediate escalation to System Architect → Resolution within 1h
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Documentation
|
||||
|
||||
- [Architecture Documentation](../02-architecture/)
|
||||
- [Implementation Guidelines](../03-implementation/)
|
||||
- [Architecture Decision Records](../05-decisions/)
|
||||
- [Backend Development Tasks](../06-tasks/)
|
||||
|
||||
---
|
||||
|
||||
## 📝 Document Maintenance
|
||||
|
||||
- **Review Frequency:** Monthly
|
||||
- **Owner:** Operations Team
|
||||
- **Last Review:** 2025-12-01
|
||||
- **Next Review:** 2026-01-01
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Status:** Active
|
||||
**Classification:** Internal Use Only
|
||||
374
specs/04-operations/backup-recovery.md
Normal file
374
specs/04-operations/backup-recovery.md
Normal file
@@ -0,0 +1,374 @@
|
||||
# Backup & Recovery Procedures
|
||||
|
||||
**Project:** LCBP3-DMS
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This document outlines backup strategies, recovery procedures, and disaster recovery planning for LCBP3-DMS.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Backup Strategy
|
||||
|
||||
### Backup Schedule
|
||||
|
||||
| Data Type | Frequency | Retention | Method |
|
||||
| ---------------------- | -------------- | --------- | ----------------------- |
|
||||
| Database (Full) | Daily at 02:00 | 30 days | mysqldump + compression |
|
||||
| Database (Incremental) | Every 6 hours | 7 days | Binary logs |
|
||||
| File Uploads | Daily at 03:00 | 30 days | rsync to backup server |
|
||||
| Configuration Files | Weekly | 90 days | Git repository |
|
||||
| Elasticsearch Indexes | Weekly | 14 days | Snapshot to S3/NFS |
|
||||
| Application Logs | Daily | 90 days | Rotation + archival |
|
||||
|
||||
### Backup Locations
|
||||
|
||||
**Primary Backup:** QNAP NAS `/backup/lcbp3-dms`
|
||||
**Secondary Backup:** External backup server (rsync)
|
||||
**Offsite Backup:** Cloud storage (optional - for critical data)
|
||||
|
||||
---
|
||||
|
||||
## 💾 Database Backup
|
||||
|
||||
### Automated Daily Backup Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/backup-database.sh
|
||||
|
||||
# Configuration
|
||||
BACKUP_DIR="/backup/lcbp3-dms/database"
|
||||
DB_CONTAINER="lcbp3-mariadb"
|
||||
DB_NAME="lcbp3_dms"
|
||||
DB_USER="backup_user"
|
||||
DB_PASS="<BACKUP_USER_PASSWORD>"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Create backup directory
|
||||
BACKUP_FILE="$BACKUP_DIR/lcbp3_$(date +%Y%m%d_%H%M%S).sql.gz"
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
|
||||
# Perform backup
|
||||
echo "Starting database backup to $BACKUP_FILE"
|
||||
docker exec $DB_CONTAINER mysqldump \
|
||||
--user=$DB_USER \
|
||||
--password=$DB_PASS \
|
||||
--single-transaction \
|
||||
--routines \
|
||||
--triggers \
|
||||
--databases $DB_NAME \
|
||||
| gzip > "$BACKUP_FILE"
|
||||
|
||||
# Check backup success
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "Backup completed successfully"
|
||||
|
||||
# Delete old backups
|
||||
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
|
||||
echo "Old backups cleaned up (retention: $RETENTION_DAYS days)"
|
||||
else
|
||||
echo "ERROR: Backup failed!"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### Schedule with Cron
|
||||
|
||||
```bash
|
||||
# Edit crontab
|
||||
crontab -e
|
||||
|
||||
# Add backup job (runs daily at 2 AM)
|
||||
0 2 * * * /scripts/backup-database.sh >> /var/log/backup-database.log 2>&1
|
||||
```
|
||||
|
||||
### Manual Database Backup
|
||||
|
||||
```bash
|
||||
# Backup specific database
|
||||
docker exec lcbp3-mariadb mysqldump \
|
||||
-u root -p \
|
||||
--single-transaction \
|
||||
lcbp3_dms > backup_$(date +%Y%m%d).sql
|
||||
|
||||
# Compress backup
|
||||
gzip backup_$(date +%Y%m%d).sql
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📂 File Uploads Backup
|
||||
|
||||
### Automated Rsync Backup
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/backup-uploads.sh
|
||||
|
||||
SOURCE="/var/lib/docker/volumes/lcbp3_uploads/_data"
|
||||
DEST="/backup/lcbp3-dms/uploads"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
# Create incremental backup with rsync
|
||||
rsync -av --delete \
|
||||
--backup --backup-dir="$DEST/backup-$(date +%Y%m%d)" \
|
||||
"$SOURCE/" "$DEST/current/"
|
||||
|
||||
# Cleanup old backups
|
||||
find "$DEST" -maxdepth 1 -type d -name "backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
|
||||
|
||||
echo "Upload backup completed: $(date)"
|
||||
```
|
||||
|
||||
### Schedule Uploads Backup
|
||||
|
||||
```bash
|
||||
# Run daily at 3 AM
|
||||
0 3 * * * /scripts/backup-uploads.sh >> /var/log/backup-uploads.log 2>&1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Database Recovery
|
||||
|
||||
### Full Database Restore
|
||||
|
||||
```bash
|
||||
# Step 1: Stop backend application
|
||||
docker stop lcbp3-backend
|
||||
|
||||
# Step 2: Restore database from backup
|
||||
gunzip < backup_20241201.sql.gz | \
|
||||
docker exec -i lcbp3-mariadb mysql -u root -p lcbp3_dms
|
||||
|
||||
# Step 3: Verify restore
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
USE lcbp3_dms;
|
||||
SELECT COUNT(*) FROM users;
|
||||
SELECT COUNT(*) FROM correspondences;
|
||||
"
|
||||
|
||||
# Step 4: Restart backend
|
||||
docker start lcbp3-backend
|
||||
```
|
||||
|
||||
### Point-in-Time Recovery (Using Binary Logs)
|
||||
|
||||
```bash
|
||||
# Step 1: Restore last full backup
|
||||
gunzip < backup_20241201_020000.sql.gz | \
|
||||
docker exec -i lcbp3-mariadb mysql -u root -p lcbp3_dms
|
||||
|
||||
# Step 2: Apply binary logs since backup
|
||||
docker exec lcbp3-mariadb mysqlbinlog \
|
||||
--start-datetime="2024-12-01 02:00:00" \
|
||||
--stop-datetime="2024-12-01 14:30:00" \
|
||||
/var/lib/mysql/mysql-bin.000001 | \
|
||||
docker exec -i lcbp3-mariadb mysql -u root -p lcbp3_dms
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 File Uploads Recovery
|
||||
|
||||
### Restore from Backup
|
||||
|
||||
```bash
|
||||
# Stop backend to prevent file operations
|
||||
docker stop lcbp3-backend
|
||||
|
||||
# Restore files
|
||||
rsync -av \
|
||||
/backup/lcbp3-dms/uploads/current/ \
|
||||
/var/lib/docker/volumes/lcbp3_uploads/_data/
|
||||
|
||||
# Verify permissions
|
||||
docker exec lcbp3-backend chown -R node:node /app/uploads
|
||||
|
||||
# Restart backend
|
||||
docker start lcbp3-backend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Disaster Recovery Plan
|
||||
|
||||
### RTO & RPO
|
||||
|
||||
- **RTO (Recovery Time Objective):** 4 hours
|
||||
- **RPO (Recovery Point Objective):** 24 hours (for files), 6 hours (for database)
|
||||
|
||||
### DR Scenarios
|
||||
|
||||
#### Scenario 1: Database Corruption
|
||||
|
||||
**Detection:** Database errors in logs, application errors
|
||||
**Recovery Time:** 30 minutes
|
||||
**Steps:**
|
||||
|
||||
1. Stop backend
|
||||
2. Restore last full backup
|
||||
3. Apply binary logs (if needed)
|
||||
4. Verify data integrity
|
||||
5. Restart services
|
||||
|
||||
#### Scenario 2: Complete Server Failure
|
||||
|
||||
**Detection:** Server unresponsive
|
||||
**Recovery Time:** 4 hours
|
||||
**Steps:**
|
||||
|
||||
1. Provision new QNAP server or VM
|
||||
2. Install Docker & Container Station
|
||||
3. Clone Git repository
|
||||
4. Restore database backup
|
||||
5. Restore file uploads
|
||||
6. Deploy containers
|
||||
7. Update DNS (if needed)
|
||||
8. Verify functionality
|
||||
|
||||
#### Scenario 3: Ransomware Attack
|
||||
|
||||
**Detection:** Encrypted files, ransom note
|
||||
**Recovery Time:** 6 hours
|
||||
**Steps:**
|
||||
|
||||
1. **DO NOT pay ransom**
|
||||
2. Isolate infected server
|
||||
3. Provision clean environment
|
||||
4. Restore from offsite backup
|
||||
5. Scan restored backup for malware
|
||||
6. Deploy and verify
|
||||
7. Review security logs
|
||||
8. Implement additional security measures
|
||||
|
||||
---
|
||||
|
||||
## ✅ Backup Verification
|
||||
|
||||
### Weekly Backup Testing
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/test-backup.sh
|
||||
|
||||
# Create temporary test database
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
CREATE DATABASE IF NOT EXISTS test_restore;
|
||||
"
|
||||
|
||||
# Restore latest backup to test database
|
||||
LATEST_BACKUP=$(ls -t /backup/lcbp3-dms/database/*.sql.gz | head -1)
|
||||
gunzip < "$LATEST_BACKUP" | \
|
||||
sed 's/USE `lcbp3_dms`/USE `test_restore`/g' | \
|
||||
docker exec -i lcbp3-mariadb mysql -u root -p
|
||||
|
||||
# Verify table counts
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
SELECT COUNT(*) FROM test_restore.users;
|
||||
SELECT COUNT(*) FROM test_restore.correspondences;
|
||||
"
|
||||
|
||||
# Cleanup
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
DROP DATABASE test_restore;
|
||||
"
|
||||
|
||||
echo "Backup verification completed: $(date)"
|
||||
```
|
||||
|
||||
### Monthly DR Drill
|
||||
|
||||
- Test full system restore on standby server
|
||||
- Document time taken and issues encountered
|
||||
- Update DR procedures based on findings
|
||||
|
||||
---
|
||||
|
||||
## 📊 Backup Monitoring
|
||||
|
||||
### Backup Status Dashboard
|
||||
|
||||
Monitor:
|
||||
|
||||
- ✅ Last successful backup timestamp
|
||||
- ✅ Backup file size (detect anomalies)
|
||||
- ✅ Backup success/failure rate
|
||||
- ✅ Available backup storage space
|
||||
|
||||
### Alerts
|
||||
|
||||
Send alert if:
|
||||
|
||||
- ❌ Backup fails
|
||||
- ❌ Backup file size < 50% of average (possible corruption)
|
||||
- ❌ No backup in last 48 hours
|
||||
- ❌ Backup storage < 20% free
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Maintenance
|
||||
|
||||
### Optimize Backup Performance
|
||||
|
||||
```sql
|
||||
-- Enable InnoDB compression for large tables
|
||||
ALTER TABLE correspondences ROW_FORMAT=COMPRESSED;
|
||||
ALTER TABLE workflow_history ROW_FORMAT=COMPRESSED;
|
||||
|
||||
-- Archive old audit logs
|
||||
-- Move records older than 1 year to archive table
|
||||
INSERT INTO audit_logs_archive
|
||||
SELECT * FROM audit_logs
|
||||
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
|
||||
|
||||
DELETE FROM audit_logs
|
||||
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 Backup Checklist
|
||||
|
||||
### Daily Tasks
|
||||
|
||||
- [ ] Verify automated backups completed
|
||||
- [ ] Check backup log files for errors
|
||||
- [ ] Monitor backup storage space
|
||||
|
||||
### Weekly Tasks
|
||||
|
||||
- [ ] Test restore from random backup
|
||||
- [ ] Review backup size trends
|
||||
- [ ] Verify offsite backups synced
|
||||
|
||||
### Monthly Tasks
|
||||
|
||||
- [ ] Full DR drill
|
||||
- [ ] Review and update DR procedures
|
||||
- [ ] Test backup restoration on different server
|
||||
|
||||
### Quarterly Tasks
|
||||
|
||||
- [ ] Audit backup access controls
|
||||
- [ ] Review backup retention policies
|
||||
- [ ] Update backup documentation
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Documents
|
||||
|
||||
- [Deployment Guide](./deployment-guide.md)
|
||||
- [Monitoring & Alerting](./monitoring-alerting.md)
|
||||
- [Incident Response](./incident-response.md)
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Last Review:** 2025-12-01
|
||||
**Next Review:** 2026-03-01
|
||||
0
specs/04-operations/deployment.md
Normal file
0
specs/04-operations/deployment.md
Normal file
0
specs/04-operations/disaster-recovery.md
Normal file
0
specs/04-operations/disaster-recovery.md
Normal file
463
specs/04-operations/environment-setup.md
Normal file
463
specs/04-operations/environment-setup.md
Normal file
@@ -0,0 +1,463 @@
|
||||
# Environment Setup & Configuration
|
||||
|
||||
**Project:** LCBP3-DMS
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This document describes environment variables, configuration files, and secrets management for LCBP3-DMS deployment.
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Environment Variables
|
||||
|
||||
### Backend (.env)
|
||||
|
||||
```bash
|
||||
# File: backend/.env (DO NOT commit to Git)
|
||||
|
||||
# Application
|
||||
NODE_ENV=production
|
||||
APP_PORT=3000
|
||||
APP_URL=https://lcbp3-dms.example.com
|
||||
|
||||
# Database
|
||||
DB_HOST=lcbp3-mariadb
|
||||
DB_PORT=3306
|
||||
DB_USER=lcbp3_user
|
||||
DB_PASS=<STRONG_PASSWORD>
|
||||
DB_NAME=lcbp3_dms
|
||||
|
||||
# Redis
|
||||
REDIS_HOST=lcbp3-redis
|
||||
REDIS_PORT=6379
|
||||
REDIS_PASSWORD=<STRONG_PASSWORD>
|
||||
|
||||
# JWT Authentication
|
||||
JWT_SECRET=<RANDOM_256_BIT_SECRET>
|
||||
JWT_EXPIRATION=1h
|
||||
JWT_REFRESH_SECRET=<RANDOM_256_BIT_SECRET>
|
||||
JWT_REFRESH_EXPIRATION=7d
|
||||
|
||||
# File Storage
|
||||
UPLOAD_DIR=/app/uploads
|
||||
TEMP_UPLOAD_DIR=/app/uploads/temp
|
||||
MAX_FILE_SIZE=104857600 # 100MB
|
||||
ALLOWED_FILE_TYPES=pdf,doc,docx,xls,xlsx,dwg,jpg,png
|
||||
|
||||
# SMTP Email
|
||||
SMTP_HOST=smtp.gmail.com
|
||||
SMTP_PORT=587
|
||||
SMTP_USER=noreply@example.com
|
||||
SMTP_PASS=<APP_PASSWORD>
|
||||
SMTP_FROM="LCBP3-DMS System <noreply@example.com>"
|
||||
|
||||
# LINE Notify (Optional)
|
||||
LINE_NOTIFY_ENABLED=true
|
||||
|
||||
# ClamAV Virus Scanner
|
||||
CLAMAV_HOST=clamav
|
||||
CLAMAV_PORT=3310
|
||||
|
||||
# Elasticsearch
|
||||
ELASTICSEARCH_NODE=http://lcbp3-elasticsearch:9200
|
||||
ELASTICSEARCH_INDEX_PREFIX=lcbp3_
|
||||
|
||||
# Logging
|
||||
LOG_LEVEL=info
|
||||
LOG_FILE_PATH=/app/logs
|
||||
|
||||
# Frontend URL (for email links)
|
||||
FRONTEND_URL=https://lcbp3-dms.example.com
|
||||
|
||||
# Rate Limiting
|
||||
RATE_LIMIT_TTL=60
|
||||
RATE_LIMIT_MAX=100
|
||||
```
|
||||
|
||||
### Frontend (.env.local)
|
||||
|
||||
```bash
|
||||
# File: frontend/.env.local (DO NOT commit to Git)
|
||||
|
||||
# API Backend
|
||||
NEXT_PUBLIC_API_URL=https://lcbp3-dms.example.com/api
|
||||
|
||||
# Application
|
||||
NEXT_PUBLIC_APP_NAME=LCBP3-DMS
|
||||
NEXT_PUBLIC_APP_VERSION=1.5.0
|
||||
|
||||
# Feature Flags
|
||||
NEXT_PUBLIC_ENABLE_NOTIFICATIONS=true
|
||||
NEXT_PUBLIC_ENABLE_LINE_NOTIFY=true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐳 Docker Compose Configuration
|
||||
|
||||
### Production docker-compose.yml
|
||||
|
||||
```yaml
|
||||
# File: docker-compose.yml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
# NGINX Reverse Proxy
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
container_name: lcbp3-nginx
|
||||
ports:
|
||||
- '80:80'
|
||||
- '443:443'
|
||||
volumes:
|
||||
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||
- ./nginx/ssl:/etc/nginx/ssl:ro
|
||||
- nginx-logs:/var/log/nginx
|
||||
depends_on:
|
||||
- backend
|
||||
- frontend
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
|
||||
# NestJS Backend
|
||||
backend:
|
||||
image: lcbp3-backend:latest
|
||||
container_name: lcbp3-backend
|
||||
environment:
|
||||
- NODE_ENV=production
|
||||
env_file:
|
||||
- ./backend/.env
|
||||
volumes:
|
||||
- uploads:/app/uploads
|
||||
- backend-logs:/app/logs
|
||||
depends_on:
|
||||
- mariadb
|
||||
- redis
|
||||
- elasticsearch
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
healthcheck:
|
||||
test: ['CMD', 'curl', '-f', 'http://localhost:3000/health']
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
# Next.js Frontend
|
||||
frontend:
|
||||
image: lcbp3-frontend:latest
|
||||
container_name: lcbp3-frontend
|
||||
environment:
|
||||
- NODE_ENV=production
|
||||
env_file:
|
||||
- ./frontend/.env.local
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
|
||||
# MariaDB Database
|
||||
mariadb:
|
||||
image: mariadb:10.11
|
||||
container_name: lcbp3-mariadb
|
||||
environment:
|
||||
MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASS}
|
||||
MYSQL_DATABASE: ${DB_NAME}
|
||||
MYSQL_USER: ${DB_USER}
|
||||
MYSQL_PASSWORD: ${DB_PASS}
|
||||
volumes:
|
||||
- mariadb-data:/var/lib/mysql
|
||||
- ./mariadb/init:/docker-entrypoint-initdb.d:ro
|
||||
ports:
|
||||
- '3306:3306'
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
|
||||
|
||||
# Redis Cache & Queue
|
||||
redis:
|
||||
image: redis:7.2-alpine
|
||||
container_name: lcbp3-redis
|
||||
command: redis-server --requirepass ${REDIS_PASSWORD}
|
||||
volumes:
|
||||
- redis-data:/data
|
||||
ports:
|
||||
- '6379:6379'
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
|
||||
# Elasticsearch
|
||||
elasticsearch:
|
||||
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
|
||||
container_name: lcbp3-elasticsearch
|
||||
environment:
|
||||
- discovery.type=single-node
|
||||
- 'ES_JAVA_OPTS=-Xms512m -Xmx512m'
|
||||
- xpack.security.enabled=false
|
||||
volumes:
|
||||
- elasticsearch-data:/usr/share/elasticsearch/data
|
||||
ports:
|
||||
- '9200:9200'
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
|
||||
# ClamAV (Optional - for virus scanning)
|
||||
clamav:
|
||||
image: clamav/clamav:latest
|
||||
container_name: lcbp3-clamav
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- lcbp3-network
|
||||
|
||||
networks:
|
||||
lcbp3-network:
|
||||
driver: bridge
|
||||
|
||||
volumes:
|
||||
mariadb-data:
|
||||
redis-data:
|
||||
elasticsearch-data:
|
||||
uploads:
|
||||
backend-logs:
|
||||
nginx-logs:
|
||||
```
|
||||
|
||||
### Development docker-compose.override.yml
|
||||
|
||||
```yaml
|
||||
# File: docker-compose.override.yml (Local development only)
|
||||
# Add to .gitignore
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
backend:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile.dev
|
||||
volumes:
|
||||
- ./backend:/app
|
||||
- /app/node_modules
|
||||
environment:
|
||||
- NODE_ENV=development
|
||||
- LOG_LEVEL=debug
|
||||
ports:
|
||||
- '3000:3000'
|
||||
- '9229:9229' # Node.js debugger
|
||||
|
||||
frontend:
|
||||
build:
|
||||
context: ./frontend
|
||||
dockerfile: Dockerfile.dev
|
||||
volumes:
|
||||
- ./frontend:/app
|
||||
- /app/node_modules
|
||||
- /app/.next
|
||||
ports:
|
||||
- '3001:3000'
|
||||
|
||||
mariadb:
|
||||
ports:
|
||||
- '3307:3306' # Avoid conflict with local MySQL
|
||||
|
||||
redis:
|
||||
ports:
|
||||
- '6380:6379'
|
||||
|
||||
elasticsearch:
|
||||
environment:
|
||||
- 'ES_JAVA_OPTS=-Xms256m -Xmx256m' # Lower memory for dev
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔑 Secrets Management
|
||||
|
||||
### Using Docker Secrets (Recommended for Production)
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
services:
|
||||
backend:
|
||||
secrets:
|
||||
- db_password
|
||||
- jwt_secret
|
||||
environment:
|
||||
DB_PASS_FILE: /run/secrets/db_password
|
||||
JWT_SECRET_FILE: /run/secrets/jwt_secret
|
||||
|
||||
secrets:
|
||||
db_password:
|
||||
file: ./secrets/db_password.txt
|
||||
jwt_secret:
|
||||
file: ./secrets/jwt_secret.txt
|
||||
```
|
||||
|
||||
### Generate Strong Secrets
|
||||
|
||||
```bash
|
||||
# Generate JWT Secret
|
||||
openssl rand -base64 64
|
||||
|
||||
# Generate Database Password
|
||||
openssl rand -base64 32
|
||||
|
||||
# Generate Redis Password
|
||||
openssl rand -base64 32
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 Directory Structure
|
||||
|
||||
```
|
||||
lcbp3/
|
||||
├── backend/
|
||||
│ ├── .env # Backend environment (DO NOT commit)
|
||||
│ ├── .env.example # Example template (commit this)
|
||||
│ └── ...
|
||||
├── frontend/
|
||||
│ ├── .env.local # Frontend environment (DO NOT commit)
|
||||
│ ├── .env.example # Example template
|
||||
│ └── ...
|
||||
├── nginx/
|
||||
│ ├── nginx.conf
|
||||
│ └── ssl/
|
||||
│ ├── cert.pem
|
||||
│ └── key.pem
|
||||
├── secrets/ # Docker secrets (DO NOT commit)
|
||||
│ ├── db_password.txt
|
||||
│ ├── jwt_secret.txt
|
||||
│ └── redis_password.txt
|
||||
├── docker-compose.yml # Production config
|
||||
└── docker-compose.override.yml # Development config (DO NOT commit)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Configuration Management
|
||||
|
||||
### Environment-Specific Configs
|
||||
|
||||
**Development:**
|
||||
|
||||
```bash
|
||||
NODE_ENV=development
|
||||
LOG_LEVEL=debug
|
||||
DB_HOST=localhost
|
||||
```
|
||||
|
||||
**Staging:**
|
||||
|
||||
```bash
|
||||
NODE_ENV=staging
|
||||
LOG_LEVEL=info
|
||||
DB_HOST=staging-db.internal
|
||||
```
|
||||
|
||||
**Production:**
|
||||
|
||||
```bash
|
||||
NODE_ENV=production
|
||||
LOG_LEVEL=warn
|
||||
DB_HOST=prod-db.internal
|
||||
```
|
||||
|
||||
### Configuration Validation
|
||||
|
||||
Backend validates environment variables at startup:
|
||||
|
||||
```typescript
|
||||
// File: backend/src/config/env.validation.ts
|
||||
import * as Joi from 'joi';
|
||||
|
||||
export const envValidationSchema = Joi.object({
|
||||
NODE_ENV: Joi.string()
|
||||
.valid('development', 'staging', 'production')
|
||||
.required(),
|
||||
DB_HOST: Joi.string().required(),
|
||||
DB_PORT: Joi.number().default(3306),
|
||||
DB_USER: Joi.string().required(),
|
||||
DB_PASS: Joi.string().required(),
|
||||
JWT_SECRET: Joi.string().min(32).required(),
|
||||
// ...
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔒 Security Best Practices
|
||||
|
||||
### DO:
|
||||
|
||||
- ✅ Use strong, random passwords (minimum 32 characters)
|
||||
- ✅ Rotate secrets every 90 days
|
||||
- ✅ Use Docker secrets for production
|
||||
- ✅ Add `.env` files to `.gitignore`
|
||||
- ✅ Provide `.env.example` templates
|
||||
- ✅ Validate environment variables at startup
|
||||
|
||||
### DON'T:
|
||||
|
||||
- ❌ Commit `.env` files to Git
|
||||
- ❌ Use weak or default passwords
|
||||
- ❌ Share production credentials via email/chat
|
||||
- ❌ Reuse passwords across environments
|
||||
- ❌ Hardcode secrets in source code
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Backend can't connect to database:**
|
||||
|
||||
```bash
|
||||
# Check database container is running
|
||||
docker ps | grep mariadb
|
||||
|
||||
# Check database logs
|
||||
docker logs lcbp3-mariadb
|
||||
|
||||
# Verify credentials
|
||||
docker exec lcbp3-backend env | grep DB_
|
||||
```
|
||||
|
||||
**Redis connection refused:**
|
||||
|
||||
```bash
|
||||
# Test Redis connection
|
||||
docker exec lcbp3-redis redis-cli -a <PASSWORD> ping
|
||||
# Should return: PONG
|
||||
```
|
||||
|
||||
**Environment variable not loading:**
|
||||
|
||||
```bash
|
||||
# Check if env file exists
|
||||
ls -la backend/.env
|
||||
|
||||
# Check if backend loaded the env
|
||||
docker exec lcbp3-backend env | grep NODE_ENV
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 Related Documents
|
||||
|
||||
- [Deployment Guide](./deployment-guide.md)
|
||||
- [Security Operations](./security-operations.md)
|
||||
- [ADR-005: Technology Stack](../05-decisions/ADR-005-technology-stack.md)
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Last Review:** 2025-12-01
|
||||
**Next Review:** 2026-03-01
|
||||
483
specs/04-operations/incident-response.md
Normal file
483
specs/04-operations/incident-response.md
Normal file
@@ -0,0 +1,483 @@
|
||||
# Incident Response Procedures
|
||||
|
||||
**Project:** LCBP3-DMS
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Incident Classification
|
||||
|
||||
### Severity Levels
|
||||
|
||||
| Severity | Description | Response Time | Examples |
|
||||
| ----------------- | ---------------------------- | ----------------- | ----------------------------------------------- |
|
||||
| **P0 - Critical** | Complete system outage | 15 minutes | Database down, All services unavailable |
|
||||
| **P1 - High** | Major functionality impaired | 1 hour | Authentication failing, Cannot create documents |
|
||||
| **P2 - Medium** | Degraded performance | 4 hours | Slow response time, Some features broken |
|
||||
| **P3 - Low** | Minor issues | Next business day | UI glitch, Non-critical bug |
|
||||
|
||||
---
|
||||
|
||||
## 📞 Incident Response Team
|
||||
|
||||
### Roles & Responsibilities
|
||||
|
||||
**Incident Commander (IC)**
|
||||
|
||||
- Coordinates response efforts
|
||||
- Makes final decisions
|
||||
- Communicates with stakeholders
|
||||
|
||||
**Technical Lead (TL)**
|
||||
|
||||
- Diagnoses technical issues
|
||||
- Implements fixes
|
||||
- Coordinates with engineers
|
||||
|
||||
**Communications Lead (CL)**
|
||||
|
||||
- Updates stakeholders
|
||||
- Manages internal/external communications
|
||||
- Documents incident timeline
|
||||
|
||||
**On-Call Engineer**
|
||||
|
||||
- First responder
|
||||
- Initial triage and investigation
|
||||
- Escalates to appropriate team
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Incident Response Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
|
||||
Acknowledge --> Assess[Assess Severity]
|
||||
Assess --> P0{Severity?}
|
||||
|
||||
P0 -->|P0/P1| Alert[Page Incident Commander]
|
||||
P0 -->|P2/P3| Assign[Assign to On-Call]
|
||||
|
||||
Alert --> Investigate[Investigate Root Cause]
|
||||
Assign --> Investigate
|
||||
|
||||
Investigate --> Mitigate[Implement Mitigation]
|
||||
Mitigate --> Verify[Verify Resolution]
|
||||
|
||||
Verify --> Resolved{Resolved?}
|
||||
Resolved -->|No| Escalate[Escalate/Re-assess]
|
||||
Escalate --> Investigate
|
||||
|
||||
Resolved -->|Yes| Communicate[Communicate Resolution]
|
||||
Communicate --> PostMortem[Schedule Post-Mortem]
|
||||
PostMortem --> End([Close Incident])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📋 Incident Response Playbooks
|
||||
|
||||
### P0: Database Down
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Backend returns 500 errors
|
||||
- Cannot connect to database
|
||||
- Health check fails
|
||||
|
||||
**Immediate Actions:**
|
||||
|
||||
1. **Verify Issue**
|
||||
|
||||
```bash
|
||||
docker ps | grep mariadb
|
||||
docker logs lcbp3-mariadb --tail=50
|
||||
```
|
||||
|
||||
2. **Attempt Restart**
|
||||
|
||||
```bash
|
||||
docker restart lcbp3-mariadb
|
||||
```
|
||||
|
||||
3. **Check Database Process**
|
||||
|
||||
```bash
|
||||
docker exec lcbp3-mariadb ps aux | grep mysql
|
||||
```
|
||||
|
||||
4. **If Restart Fails:**
|
||||
|
||||
```bash
|
||||
# Check disk space
|
||||
df -h
|
||||
|
||||
# Check database logs for corruption
|
||||
docker exec lcbp3-mariadb cat /var/log/mysql/error.log
|
||||
|
||||
# If corrupted, restore from backup
|
||||
# See backup-recovery.md
|
||||
```
|
||||
|
||||
5. **Escalate to DBA** if not resolved in 30 minutes
|
||||
|
||||
---
|
||||
|
||||
### P0: Complete System Outage
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- All services return 502/503
|
||||
- Health checks fail
|
||||
- Users cannot access system
|
||||
|
||||
**Immediate Actions:**
|
||||
|
||||
1. **Check Container Status**
|
||||
|
||||
```bash
|
||||
docker-compose ps
|
||||
# Identify which containers are down
|
||||
```
|
||||
|
||||
2. **Restart All Services**
|
||||
|
||||
```bash
|
||||
docker-compose restart
|
||||
```
|
||||
|
||||
3. **Check QNAP Server Resources**
|
||||
|
||||
```bash
|
||||
top
|
||||
df -h
|
||||
free -h
|
||||
```
|
||||
|
||||
4. **Check Network**
|
||||
|
||||
```bash
|
||||
ping 8.8.8.8
|
||||
netstat -tlnp
|
||||
```
|
||||
|
||||
5. **If Server Issue:**
|
||||
- Reboot QNAP server
|
||||
- Contact QNAP support
|
||||
|
||||
---
|
||||
|
||||
### P1: Authentication System Failing
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Users cannot log in
|
||||
- JWT validation fails
|
||||
- 401 errors
|
||||
|
||||
**Immediate Actions:**
|
||||
|
||||
1. **Check Redis (Session Store)**
|
||||
|
||||
```bash
|
||||
docker exec lcbp3-redis redis-cli ping
|
||||
# Should return PONG
|
||||
```
|
||||
|
||||
2. **Check JWT Secret Configuration**
|
||||
|
||||
```bash
|
||||
docker exec lcbp3-backend env | grep JWT_SECRET
|
||||
# Verify not empty
|
||||
```
|
||||
|
||||
3. **Check Backend Logs**
|
||||
|
||||
```bash
|
||||
docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"
|
||||
```
|
||||
|
||||
4. **Temporary Mitigation:**
|
||||
```bash
|
||||
# Restart backend to reload config
|
||||
docker restart lcbp3-backend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### P1: File Upload Failing
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Users cannot upload files
|
||||
- 500 errors on file upload
|
||||
- "Disk full" errors
|
||||
|
||||
**Immediate Actions:**
|
||||
|
||||
1. **Check Disk Space**
|
||||
|
||||
```bash
|
||||
df -h /var/lib/docker/volumes/lcbp3_uploads
|
||||
```
|
||||
|
||||
2. **If Disk Full:**
|
||||
|
||||
```bash
|
||||
# Clean up temp uploads
|
||||
find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
|
||||
-type f -mtime +1 -delete
|
||||
```
|
||||
|
||||
3. **Check ClamAV (Virus Scanner)**
|
||||
|
||||
```bash
|
||||
docker logs lcbp3-clamav --tail=50
|
||||
docker restart lcbp3-clamav
|
||||
```
|
||||
|
||||
4. **Check File Permissions**
|
||||
```bash
|
||||
docker exec lcbp3-backend ls -la /app/uploads
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### P2: Slow Performance
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Pages load slowly
|
||||
- API response time > 2s
|
||||
- Users complain about slowness
|
||||
|
||||
**Actions:**
|
||||
|
||||
1. **Check System Resources**
|
||||
|
||||
```bash
|
||||
docker stats
|
||||
# Identify high CPU/memory containers
|
||||
```
|
||||
|
||||
2. **Check Database Performance**
|
||||
|
||||
```sql
|
||||
-- Show slow queries
|
||||
SHOW PROCESSLIST;
|
||||
|
||||
-- Check connections
|
||||
SHOW STATUS LIKE 'Threads_connected';
|
||||
```
|
||||
|
||||
3. **Check Redis**
|
||||
|
||||
```bash
|
||||
docker exec lcbp3-redis redis-cli --stat
|
||||
```
|
||||
|
||||
4. **Check Application Logs**
|
||||
|
||||
```bash
|
||||
docker logs lcbp3-backend | grep "Slow request"
|
||||
```
|
||||
|
||||
5. **Temporary Mitigation:**
|
||||
- Restart slow containers
|
||||
- Clear Redis cache if needed
|
||||
- Kill long-running queries
|
||||
|
||||
---
|
||||
|
||||
### P2: Email Notifications Not Sending
|
||||
|
||||
**Symptoms:**
|
||||
|
||||
- Users not receiving emails
|
||||
- Email queue backing up
|
||||
|
||||
**Actions:**
|
||||
|
||||
1. **Check Email Queue**
|
||||
|
||||
```bash
|
||||
# Access BullMQ dashboard or check Redis
|
||||
docker exec lcbp3-redis redis-cli LLEN bull:email:waiting
|
||||
```
|
||||
|
||||
2. **Check Email Processor Logs**
|
||||
|
||||
```bash
|
||||
docker logs lcbp3-backend | grep "email\|SMTP"
|
||||
```
|
||||
|
||||
3. **Test SMTP Connection**
|
||||
|
||||
```bash
|
||||
docker exec lcbp3-backend node -e "
|
||||
const nodemailer = require('nodemailer');
|
||||
const transport = nodemailer.createTransport({
|
||||
host: process.env.SMTP_HOST,
|
||||
port: process.env.SMTP_PORT,
|
||||
auth: {
|
||||
user: process.env.SMTP_USER,
|
||||
pass: process.env.SMTP_PASS
|
||||
}
|
||||
});
|
||||
transport.verify().then(console.log).catch(console.error);
|
||||
"
|
||||
```
|
||||
|
||||
4. **Check SMTP Credentials**
|
||||
- Verify not expired
|
||||
- Check firewall/network access
|
||||
|
||||
---
|
||||
|
||||
## 📝 Incident Documentation
|
||||
|
||||
### Incident Report Template
|
||||
|
||||
```markdown
|
||||
# Incident Report: [Brief Description]
|
||||
|
||||
**Incident ID:** INC-YYYYMMDD-001
|
||||
**Severity:** P1
|
||||
**Status:** Resolved
|
||||
**Incident Commander:** [Name]
|
||||
|
||||
## Timeline
|
||||
|
||||
| Time | Event |
|
||||
| ----- | --------------------------------------------------------- |
|
||||
| 14:00 | Alert: High error rate detected |
|
||||
| 14:05 | On-call engineer acknowledged |
|
||||
| 14:10 | Identified root cause: Database connection pool exhausted |
|
||||
| 14:15 | Implemented mitigation: Increased pool size |
|
||||
| 14:20 | Verified resolution |
|
||||
| 14:30 | Incident resolved |
|
||||
|
||||
## Impact
|
||||
|
||||
- **Duration:** 30 minutes
|
||||
- **Affected Users:** ~50 users
|
||||
- **Affected Services:** Document creation, Search
|
||||
- **Data Loss:** None
|
||||
|
||||
## Root Cause
|
||||
|
||||
Database connection pool was exhausted due to slow queries not releasing connections.
|
||||
|
||||
## Resolution
|
||||
|
||||
1. Increased connection pool size from 10 to 20
|
||||
2. Optimized slow queries
|
||||
3. Added connection pool monitoring
|
||||
|
||||
## Action Items
|
||||
|
||||
- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
|
||||
- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
|
||||
- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
- Connection pool monitoring was insufficient
|
||||
- Need automated remediation for common issues
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Post-Incident Review (PIR)
|
||||
|
||||
### PIR Meeting Agenda
|
||||
|
||||
1. **Timeline Review** (10 min)
|
||||
|
||||
- What happened and when?
|
||||
- What was the impact?
|
||||
|
||||
2. **Root Cause Analysis** (15 min)
|
||||
|
||||
- Why did it happen?
|
||||
- What were the contributing factors?
|
||||
|
||||
3. **What Went Well** (10 min)
|
||||
|
||||
- What did we do right?
|
||||
- What helped us resolve quickly?
|
||||
|
||||
4. **What Went Wrong** (15 min)
|
||||
|
||||
- What could we have done better?
|
||||
- What slowed us down?
|
||||
|
||||
5. **Action Items** (10 min)
|
||||
- What changes will prevent this?
|
||||
- Who owns each action?
|
||||
- When will they be completed?
|
||||
|
||||
### PIR Best Practices
|
||||
|
||||
- **Blameless Culture:** Focus on systems, not individuals
|
||||
- **Actionable Outcomes:** Every PIR should produce concrete actions
|
||||
- **Follow Through:** Track action items to completion
|
||||
- **Share Learnings:** Distribute PIR summary to entire team
|
||||
|
||||
---
|
||||
|
||||
## 📊 Incident Metrics
|
||||
|
||||
### Track & Review Monthly
|
||||
|
||||
- **MTTR (Mean Time To Resolution):** Average time to resolve incidents
|
||||
- **MTBF (Mean Time Between Failures):** Average time between incidents
|
||||
- **Incident Frequency:** Number of incidents per month
|
||||
- **Severity Distribution:** Breakdown by P0/P1/P2/P3
|
||||
- **Repeat Incidents:** Same root cause occurring multiple times
|
||||
|
||||
---
|
||||
|
||||
## ✅ Incident Response Checklist
|
||||
|
||||
### During Incident
|
||||
|
||||
- [ ] Acknowledge incident in tracking system
|
||||
- [ ] Assess severity and assign IC
|
||||
- [ ] Create incident channel (Slack/Teams)
|
||||
- [ ] Begin documenting timeline
|
||||
- [ ] Investigate and implement mitigation
|
||||
- [ ] Communicate status updates every 30 min (P0/P1)
|
||||
- [ ] Verify resolution
|
||||
- [ ] Communicate resolution to stakeholders
|
||||
|
||||
### After Incident
|
||||
|
||||
- [ ] Create incident report
|
||||
- [ ] Schedule PIR within 48 hours
|
||||
- [ ] Identify action items
|
||||
- [ ] Assign owners and deadlines
|
||||
- [ ] Update runbooks/playbooks
|
||||
- [ ] Share learnings with team
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Documents
|
||||
|
||||
- [Monitoring & Alerting](./monitoring-alerting.md)
|
||||
- [Backup & Recovery](./backup-recovery.md)
|
||||
- [Security Operations](./security-operations.md)
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Last Review:** 2025-12-01
|
||||
**Next Review:** 2026-03-01
|
||||
501
specs/04-operations/maintenance-procedures.md
Normal file
501
specs/04-operations/maintenance-procedures.md
Normal file
@@ -0,0 +1,501 @@
|
||||
# Maintenance Procedures
|
||||
|
||||
**Project:** LCBP3-DMS
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This document outlines routine maintenance tasks, update procedures, and optimization guidelines for LCBP3-DMS.
|
||||
|
||||
---
|
||||
|
||||
## 📅 Maintenance Schedule
|
||||
|
||||
### Daily Tasks
|
||||
|
||||
- Monitor system health and backups
|
||||
- Review error logs
|
||||
- Check disk space
|
||||
|
||||
### Weekly Tasks
|
||||
|
||||
- Database optimization
|
||||
- Log rotation and cleanup
|
||||
- Security patch review
|
||||
- Performance monitoring review
|
||||
|
||||
### Monthly Tasks
|
||||
|
||||
- SSL certificate check
|
||||
- Dependency updates (Security patches)
|
||||
- Database maintenance
|
||||
- Backup restoration test
|
||||
|
||||
### Quarterly Tasks
|
||||
|
||||
- Full system update
|
||||
- Capacity planning review
|
||||
- Security audit
|
||||
- Disaster recovery drill
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Update Procedures
|
||||
|
||||
### Application Updates
|
||||
|
||||
#### Backend Update
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/update-backend.sh
|
||||
|
||||
# Step 1: Backup database
|
||||
/scripts/backup-database.sh
|
||||
|
||||
# Step 2: Pull latest code
|
||||
cd /app/lcbp3/backend
|
||||
git pull origin main
|
||||
|
||||
# Step 3: Install dependencies
|
||||
docker exec lcbp3-backend npm install
|
||||
|
||||
# Step 4: Run migrations
|
||||
docker exec lcbp3-backend npm run migration:run
|
||||
|
||||
# Step 5: Build application
|
||||
docker exec lcbp3-backend npm run build
|
||||
|
||||
# Step 6: Restart backend
|
||||
docker restart lcbp3-backend
|
||||
|
||||
# Step 7: Verify health
|
||||
sleep 10
|
||||
curl -f http://localhost:3000/health || {
|
||||
echo "Health check failed! Rolling back..."
|
||||
docker exec lcbp3-backend npm run migration:revert
|
||||
docker restart lcbp3-backend
|
||||
exit 1
|
||||
}
|
||||
|
||||
echo "Backend updated successfully"
|
||||
```
|
||||
|
||||
#### Frontend Update
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/update-frontend.sh
|
||||
|
||||
# Step 1: Pull latest code
|
||||
cd /app/lcbp3/frontend
|
||||
git pull origin main
|
||||
|
||||
# Step 2: Install dependencies
|
||||
docker exec lcbp3-frontend npm install
|
||||
|
||||
# Step 3: Build application
|
||||
docker exec lcbp3-frontend npm run build
|
||||
|
||||
# Step 4: Restart frontend
|
||||
docker restart lcbp3-frontend
|
||||
|
||||
# Step 5: Verify
|
||||
sleep 10
|
||||
curl -f http://localhost:3001 || {
|
||||
echo "Frontend failed to start!"
|
||||
exit 1
|
||||
}
|
||||
|
||||
echo "Frontend updated successfully"
|
||||
```
|
||||
|
||||
### Zero-Downtime Deployment
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/zero-downtime-deploy.sh
|
||||
|
||||
# Using blue-green deployment strategy
|
||||
|
||||
# Step 1: Start new "green" backend
|
||||
docker-compose -f docker-compose.green.yml up -d backend
|
||||
|
||||
# Step 2: Wait for health check
|
||||
for i in {1..30}; do
|
||||
curl -f http://localhost:3002/health && break
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# Step 3: Switch NGINX to green
|
||||
docker exec lcbp3-nginx nginx -s reload
|
||||
|
||||
# Step 4: Stop old "blue" backend
|
||||
docker stop lcbp3-backend-blue
|
||||
|
||||
echo "Deployment completed with zero downtime"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🗄️ Database Maintenance
|
||||
|
||||
### Weekly Database Optimization
|
||||
|
||||
```sql
|
||||
-- File: /scripts/optimize-database.sql
|
||||
|
||||
-- Optimize tables
|
||||
OPTIMIZE TABLE correspondences;
|
||||
OPTIMIZE TABLE rfas;
|
||||
OPTIMIZE TABLE workflow_instances;
|
||||
OPTIMIZE TABLE attachments;
|
||||
|
||||
-- Analyze tables for query optimization
|
||||
ANALYZE TABLE correspondences;
|
||||
ANALYZE TABLE rfas;
|
||||
|
||||
-- Check for table corruption
|
||||
CHECK TABLE correspondences;
|
||||
CHECK TABLE rfas;
|
||||
|
||||
-- Rebuild indexes if fragmented
|
||||
ALTER TABLE correspondences ENGINE=InnoDB;
|
||||
```
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/weekly-db-maintenance.sh
|
||||
|
||||
docker exec lcbp3-mariadb mysql -u root -p lcbp3_dms < /scripts/optimize-database.sql
|
||||
|
||||
echo "Database optimization completed: $(date)"
|
||||
```
|
||||
|
||||
### Monthly Database Cleanup
|
||||
|
||||
```sql
|
||||
-- Archive old audit logs (older than 1 year)
|
||||
INSERT INTO audit_logs_archive
|
||||
SELECT * FROM audit_logs
|
||||
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
|
||||
|
||||
DELETE FROM audit_logs
|
||||
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
|
||||
|
||||
-- Clean up deleted notifications (older than 90 days)
|
||||
DELETE FROM notifications
|
||||
WHERE deleted_at IS NOT NULL
|
||||
AND deleted_at < DATE_SUB(NOW(), INTERVAL 90 DAY);
|
||||
|
||||
-- Clean up expired temp uploads (older than 24h)
|
||||
DELETE FROM temp_uploads
|
||||
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 DAY);
|
||||
|
||||
-- Optimize after cleanup
|
||||
OPTIMIZE TABLE audit_logs;
|
||||
OPTIMIZE TABLE notifications;
|
||||
OPTIMIZE TABLE temp_uploads;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📦 Dependency Updates
|
||||
|
||||
### Security Patch Updates (Monthly)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/update-dependencies.sh
|
||||
|
||||
cd /app/lcbp3/backend
|
||||
|
||||
# Check for security vulnerabilities
|
||||
npm audit
|
||||
|
||||
# Update security patches only (no major versions)
|
||||
npm audit fix
|
||||
|
||||
# Run tests
|
||||
npm test
|
||||
|
||||
# If tests pass, commit and deploy
|
||||
git add package*.json
|
||||
git commit -m "chore: security patch updates"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
### Major Version Updates (Quarterly)
|
||||
|
||||
```bash
|
||||
# Check for outdated packages
|
||||
npm outdated
|
||||
|
||||
# Update one major dependency at a time
|
||||
npm install @nestjs/core@latest
|
||||
|
||||
# Test thoroughly
|
||||
npm test
|
||||
npm run test:e2e
|
||||
|
||||
# If successful, commit
|
||||
git commit -am "chore: update @nestjs/core to vX.X.X"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧹 Log Management
|
||||
|
||||
### Log Rotation Configuration
|
||||
|
||||
```bash
|
||||
# File: /etc/logrotate.d/lcbp3-dms
|
||||
|
||||
/app/logs/*.log {
|
||||
daily
|
||||
rotate 30
|
||||
compress
|
||||
delaycompress
|
||||
missingok
|
||||
notifempty
|
||||
create 0640 node node
|
||||
sharedscripts
|
||||
postrotate
|
||||
docker exec lcbp3-backend kill -USR1 1
|
||||
endscript
|
||||
}
|
||||
```
|
||||
|
||||
### Manual Log Cleanup
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/cleanup-logs.sh
|
||||
|
||||
# Delete logs older than 90 days
|
||||
find /app/logs -name "*.log" -type f -mtime +90 -delete
|
||||
|
||||
# Compress logs older than 7 days
|
||||
find /app/logs -name "*.log" -type f -mtime +7 -exec gzip {} \;
|
||||
|
||||
# Clean Docker logs
|
||||
docker system prune -f --volumes --filter "until=720h"
|
||||
|
||||
echo "Log cleanup completed: $(date)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔐 SSL Certificate Renewal
|
||||
|
||||
### Check Certificate Expiry
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/check-ssl-cert.sh
|
||||
|
||||
CERT_FILE="/app/nginx/ssl/cert.pem"
|
||||
EXPIRY_DATE=$(openssl x509 -enddate -noout -in "$CERT_FILE" | cut -d= -f2)
|
||||
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
|
||||
NOW_EPOCH=$(date +%s)
|
||||
DAYS_LEFT=$(( ($EXPIRY_EPOCH - $NOW_EPOCH) / 86400 ))
|
||||
|
||||
echo "SSL certificate expires in $DAYS_LEFT days"
|
||||
|
||||
if [ $DAYS_LEFT -lt 30 ]; then
|
||||
echo "WARNING: SSL certificate expires soon!"
|
||||
# Send alert
|
||||
/scripts/send-alert-email.sh "SSL Certificate Expiring" "Certificate expires in $DAYS_LEFT days"
|
||||
fi
|
||||
```
|
||||
|
||||
### Renew SSL Certificate (Let's Encrypt)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/renew-ssl.sh
|
||||
|
||||
# Renew certificate
|
||||
certbot renew --webroot -w /app/nginx/html
|
||||
|
||||
# Copy new certificate
|
||||
cp /etc/letsencrypt/live/lcbp3-dms.example.com/fullchain.pem /app/nginx/ssl/cert.pem
|
||||
cp /etc/letsencrypt/live/lcbp3-dms.example.com/privkey.pem /app/nginx/ssl/key.pem
|
||||
|
||||
# Reload NGINX
|
||||
docker exec lcbp3-nginx nginx -s reload
|
||||
|
||||
echo "SSL certificate renewed: $(date)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Performance Optimization
|
||||
|
||||
### Database Query Optimization
|
||||
|
||||
```sql
|
||||
-- Find slow queries
|
||||
SELECT * FROM mysql.slow_log
|
||||
ORDER BY query_time DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Add indexes for frequently queried columns
|
||||
CREATE INDEX idx_correspondences_status ON correspondences(status);
|
||||
CREATE INDEX idx_rfas_workflow_status ON rfas(workflow_status);
|
||||
CREATE INDEX idx_attachments_entity ON attachments(entity_type, entity_id);
|
||||
|
||||
-- Analyze query execution plan
|
||||
EXPLAIN SELECT * FROM correspondences
|
||||
WHERE status = 'PENDING'
|
||||
AND created_at > DATE_SUB(NOW(), INTERVAL 30 DAY);
|
||||
```
|
||||
|
||||
### Redis Cache Optimization
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/optimize-redis.sh
|
||||
|
||||
# Check Redis memory usage
|
||||
docker exec lcbp3-redis redis-cli INFO memory
|
||||
|
||||
# Set max memory policy
|
||||
docker exec lcbp3-redis redis-cli CONFIG SET maxmemory 1gb
|
||||
docker exec lcbp3-redis redis-cli CONFIG SET maxmemory-policy allkeys-lru
|
||||
|
||||
# Save configuration
|
||||
docker exec lcbp3-redis redis-cli CONFIG REWRITE
|
||||
|
||||
# Clear stale cache (if needed)
|
||||
docker exec lcbp3-redis redis-cli FLUSHDB
|
||||
```
|
||||
|
||||
### Application Performance Tuning
|
||||
|
||||
```typescript
|
||||
// Enable production optimizations in NestJS
|
||||
// File: backend/src/main.ts
|
||||
|
||||
async function bootstrap() {
|
||||
const app = await NestFactory.create(AppModule, {
|
||||
logger:
|
||||
process.env.NODE_ENV === 'production'
|
||||
? ['error', 'warn']
|
||||
: ['log', 'error', 'warn', 'debug'],
|
||||
});
|
||||
|
||||
// Enable compression
|
||||
app.use(compression());
|
||||
|
||||
// Enable caching
|
||||
app.useGlobalInterceptors(new CacheInterceptor());
|
||||
|
||||
// Set global timeout
|
||||
app.use(timeout('30s'));
|
||||
|
||||
await app.listen(3000);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔒 Security Maintenance
|
||||
|
||||
### Monthly Security Tasks
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/security-maintenance.sh
|
||||
|
||||
# Update system packages
|
||||
apt-get update && apt-get upgrade -y
|
||||
|
||||
# Update ClamAV virus definitions
|
||||
docker exec lcbp3-clamav freshclam
|
||||
|
||||
# Scan for rootkits
|
||||
rkhunter --check --skip-keypress
|
||||
|
||||
# Check for unauthorized users
|
||||
awk -F: '($3 >= 1000) {print $1}' /etc/passwd
|
||||
|
||||
# Review sudo access
|
||||
cat /etc/sudoers
|
||||
|
||||
# Check firewall rules
|
||||
iptables -L -n -v
|
||||
|
||||
echo "Security maintenance completed: $(date)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ Maintenance Checklist
|
||||
|
||||
### Pre-Maintenance
|
||||
|
||||
- [ ] Announce maintenance window to users
|
||||
- [ ] Backup database and files
|
||||
- [ ] Document current system state
|
||||
- [ ] Prepare rollback plan
|
||||
|
||||
### During Maintenance
|
||||
|
||||
- [ ] Put system in maintenance mode (if needed)
|
||||
- [ ] Perform updates/changes
|
||||
- [ ] Run smoke tests
|
||||
- [ ] Monitor system health
|
||||
|
||||
### Post-Maintenance
|
||||
|
||||
- [ ] Verify all services running
|
||||
- [ ] Run full test suite
|
||||
- [ ] Monitor performance metrics
|
||||
- [ ] Communicate completion to users
|
||||
- [ ] Document changes made
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Emergency Maintenance
|
||||
|
||||
### Unplanned Maintenance Procedures
|
||||
|
||||
1. **Assess Urgency**
|
||||
|
||||
- Can it wait for scheduled maintenance?
|
||||
- Is it causing active issues?
|
||||
|
||||
2. **Communicate Impact**
|
||||
|
||||
- Notify stakeholders immediately
|
||||
- Estimate downtime
|
||||
- Provide updates every 30 minutes
|
||||
|
||||
3. **Execute Carefully**
|
||||
|
||||
- Always backup first
|
||||
- Have rollback plan ready
|
||||
- Test in staging if possible
|
||||
|
||||
4. **Post-Maintenance Review**
|
||||
- Document what happened
|
||||
- Identify preventive measures
|
||||
- Update runbooks
|
||||
|
||||
---
|
||||
|
||||
## 📚 Related Documents
|
||||
|
||||
- [Deployment Guide](./deployment-guide.md)
|
||||
- [Backup & Recovery](./backup-recovery.md)
|
||||
- [Monitoring & Alerting](./monitoring-alerting.md)
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Last Review:** 2025-12-01
|
||||
**Next Review:** 2026-03-01
|
||||
443
specs/04-operations/monitoring-alerting.md
Normal file
443
specs/04-operations/monitoring-alerting.md
Normal file
@@ -0,0 +1,443 @@
|
||||
# Monitoring & Alerting
|
||||
|
||||
**Project:** LCBP3-DMS
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This document describes monitoring setup, health checks, and alerting rules for LCBP3-DMS.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Monitoring Objectives
|
||||
|
||||
- **Availability:** System uptime > 99.5%
|
||||
- **Performance:** API response time < 500ms (P95)
|
||||
- **Reliability:** Error rate < 1%
|
||||
- **Capacity:** Resource utilization < 80%
|
||||
|
||||
---
|
||||
|
||||
## 📊 Key Metrics
|
||||
|
||||
### Application Metrics
|
||||
|
||||
| Metric | Target | Alert Threshold |
|
||||
| ----------------------- | ------- | ------------------ |
|
||||
| API Response Time (P95) | < 500ms | > 1000ms |
|
||||
| Error Rate | < 1% | > 5% |
|
||||
| Request Rate | N/A | Sudden ±50% change |
|
||||
| Active Users | N/A | - |
|
||||
| Queue Length (BullMQ) | < 100 | > 500 |
|
||||
|
||||
### Infrastructure Metrics
|
||||
|
||||
| Metric | Target | Alert Threshold |
|
||||
| ------------ | ------ | ----------------- |
|
||||
| CPU Usage | < 70% | > 90% |
|
||||
| Memory Usage | < 80% | > 95% |
|
||||
| Disk Usage | < 80% | > 90% |
|
||||
| Network I/O | N/A | Anomaly detection |
|
||||
|
||||
### Database Metrics
|
||||
|
||||
| Metric | Target | Alert Threshold |
|
||||
| --------------------- | ------- | --------------- |
|
||||
| Query Time (P95) | < 100ms | > 500ms |
|
||||
| Connection Pool Usage | < 80% | > 95% |
|
||||
| Slow Queries | 0 | > 10/min |
|
||||
| Replication Lag | 0s | > 30s |
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Health Checks
|
||||
|
||||
### Backend Health Endpoint
|
||||
|
||||
```typescript
|
||||
// File: backend/src/health/health.controller.ts
|
||||
import { Controller, Get } from '@nestjs/common';
|
||||
import {
|
||||
HealthCheck,
|
||||
HealthCheckService,
|
||||
TypeOrmHealthIndicator,
|
||||
DiskHealthIndicator,
|
||||
} from '@nestjs/terminus';
|
||||
|
||||
@Controller('health')
|
||||
export class HealthController {
|
||||
constructor(
|
||||
private health: HealthCheckService,
|
||||
private db: TypeOrmHealthIndicator,
|
||||
private disk: DiskHealthIndicator
|
||||
) {}
|
||||
|
||||
@Get()
|
||||
@HealthCheck()
|
||||
check() {
|
||||
return this.health.check([
|
||||
// Database health
|
||||
() => this.db.pingCheck('database'),
|
||||
|
||||
// Disk health
|
||||
() =>
|
||||
this.disk.checkStorage('storage', {
|
||||
path: '/',
|
||||
thresholdPercent: 0.9,
|
||||
}),
|
||||
|
||||
// Redis health
|
||||
async () => {
|
||||
const redis = await this.redis.ping();
|
||||
return { redis: { status: redis === 'PONG' ? 'up' : 'down' } };
|
||||
},
|
||||
]);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Health Check Response
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"info": {
|
||||
"database": {
|
||||
"status": "up"
|
||||
},
|
||||
"storage": {
|
||||
"status": "up",
|
||||
"freePercent": 0.75
|
||||
},
|
||||
"redis": {
|
||||
"status": "up"
|
||||
}
|
||||
},
|
||||
"error": {},
|
||||
"details": {
|
||||
"database": {
|
||||
"status": "up"
|
||||
},
|
||||
"storage": {
|
||||
"status": "up",
|
||||
"freePercent": 0.75
|
||||
},
|
||||
"redis": {
|
||||
"status": "up"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐳 Docker Container Monitoring
|
||||
|
||||
### Health Check in docker-compose.yml
|
||||
|
||||
```yaml
|
||||
services:
|
||||
backend:
|
||||
healthcheck:
|
||||
test: ['CMD', 'curl', '-f', 'http://localhost:3000/health']
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
|
||||
mariadb:
|
||||
healthcheck:
|
||||
test: ['CMD', 'mysqladmin', 'ping', '-h', 'localhost']
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
redis:
|
||||
healthcheck:
|
||||
test: ['CMD', 'redis-cli', 'ping']
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
```
|
||||
|
||||
### Monitor Container Status
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/monitor-containers.sh
|
||||
|
||||
# Check all containers are healthy
|
||||
CONTAINERS=("lcbp3-backend" "lcbp3-frontend" "lcbp3-mariadb" "lcbp3-redis")
|
||||
|
||||
for CONTAINER in "${CONTAINERS[@]}"; do
|
||||
HEALTH=$(docker inspect --format='{{.State.Health.Status}}' $CONTAINER 2>/dev/null)
|
||||
|
||||
if [ "$HEALTH" != "healthy" ]; then
|
||||
echo "ALERT: $CONTAINER is $HEALTH"
|
||||
# Send alert (email, Slack, etc.)
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 Application Performance Monitoring (APM)
|
||||
|
||||
### Log-Based Monitoring (MVP Phase)
|
||||
|
||||
```typescript
|
||||
// File: backend/src/common/interceptors/performance.interceptor.ts
|
||||
import {
|
||||
Injectable,
|
||||
NestInterceptor,
|
||||
ExecutionContext,
|
||||
CallHandler,
|
||||
} from '@nestjs/common';
|
||||
import { Observable } from 'rxjs';
|
||||
import { tap } from 'rxjs/operators';
|
||||
import { logger } from 'src/config/logger.config';
|
||||
|
||||
@Injectable()
|
||||
export class PerformanceInterceptor implements NestInterceptor {
|
||||
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
|
||||
const request = context.switchToHttp().getRequest();
|
||||
const start = Date.now();
|
||||
|
||||
return next.handle().pipe(
|
||||
tap({
|
||||
next: () => {
|
||||
const duration = Date.now() - start;
|
||||
|
||||
logger.info('Request completed', {
|
||||
method: request.method,
|
||||
url: request.url,
|
||||
statusCode: context.switchToHttp().getResponse().statusCode,
|
||||
duration: `${duration}ms`,
|
||||
userId: request.user?.user_id,
|
||||
});
|
||||
|
||||
// Alert on slow requests
|
||||
if (duration > 1000) {
|
||||
logger.warn('Slow request detected', {
|
||||
method: request.method,
|
||||
url: request.url,
|
||||
duration: `${duration}ms`,
|
||||
});
|
||||
}
|
||||
},
|
||||
error: (error) => {
|
||||
const duration = Date.now() - start;
|
||||
|
||||
logger.error('Request failed', {
|
||||
method: request.method,
|
||||
url: request.url,
|
||||
duration: `${duration}ms`,
|
||||
error: error.message,
|
||||
});
|
||||
},
|
||||
})
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Alerting Rules
|
||||
|
||||
### Critical Alerts (Immediate Action Required)
|
||||
|
||||
| Alert | Condition | Action |
|
||||
| --------------- | ------------------------------------------- | --------------------------- |
|
||||
| Service Down | Health check fails for 3 consecutive checks | Page on-call engineer |
|
||||
| Database Down | Cannot connect to database | Page DBA + on-call engineer |
|
||||
| Disk Full | Disk usage > 95% | Page operations team |
|
||||
| High Error Rate | Error rate > 10% for 5 min | Page on-call engineer |
|
||||
|
||||
### Warning Alerts (Review Within 1 Hour)
|
||||
|
||||
| Alert | Condition | Action |
|
||||
| ------------- | ----------------------- | ---------------------- |
|
||||
| High CPU | CPU > 90% for 10 min | Notify operations team |
|
||||
| High Memory | Memory > 95% for 10 min | Notify operations team |
|
||||
| Slow Queries | > 50 slow queries/min | Notify DBA |
|
||||
| Queue Backlog | BullMQ queue > 500 jobs | Notify backend team |
|
||||
|
||||
### Info Alerts (Review During Business Hours)
|
||||
|
||||
| Alert | Condition | Action |
|
||||
| ------------------ | ------------------------------------ | --------------------- |
|
||||
| Backup Failed | Daily backup job failed | Email operations team |
|
||||
| SSL Expiring | SSL certificate expires in < 30 days | Email operations team |
|
||||
| Disk Space Warning | Disk usage > 80% | Email operations team |
|
||||
|
||||
---
|
||||
|
||||
## 📧 Alert Notification Channels
|
||||
|
||||
### Email Alerts
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/send-alert-email.sh
|
||||
|
||||
TO="ops-team@example.com"
|
||||
SUBJECT="$1"
|
||||
MESSAGE="$2"
|
||||
|
||||
echo "$MESSAGE" | mail -s "[LCBP3-DMS] $SUBJECT" "$TO"
|
||||
```
|
||||
|
||||
### Slack Alerts
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/send-alert-slack.sh
|
||||
|
||||
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
MESSAGE="$1"
|
||||
|
||||
curl -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"🚨 LCBP3-DMS Alert: $MESSAGE\"}" \
|
||||
"$WEBHOOK_URL"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Monitoring Dashboard
|
||||
|
||||
### Metrics to Display
|
||||
|
||||
**System Overview:**
|
||||
|
||||
- Service status (up/down)
|
||||
- Overall system health score
|
||||
- Active user count
|
||||
- Request rate (req/s)
|
||||
|
||||
**Performance:**
|
||||
|
||||
- API response time (P50, P95, P99)
|
||||
- Database query time
|
||||
- Queue processing time
|
||||
|
||||
**Resources:**
|
||||
|
||||
- CPU usage %
|
||||
- Memory usage %
|
||||
- Disk usage %
|
||||
- Network I/O
|
||||
|
||||
**Business Metrics:**
|
||||
|
||||
- Documents created today
|
||||
- Workflows completed today
|
||||
- Active correspondences
|
||||
- Pending approvals
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Log Aggregation
|
||||
|
||||
### Centralized Logging with Docker
|
||||
|
||||
```bash
|
||||
# Configure Docker logging driver
|
||||
# File: /etc/docker/daemon.json
|
||||
{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "10m",
|
||||
"max-file": "3",
|
||||
"labels": "service,environment"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### View Aggregated Logs
|
||||
|
||||
```bash
|
||||
# View all LCBP3 container logs
|
||||
docker-compose logs -f --tail=100
|
||||
|
||||
# View specific service logs
|
||||
docker logs lcbp3-backend -f --since=1h
|
||||
|
||||
# Search logs
|
||||
docker logs lcbp3-backend 2>&1 | grep "ERROR"
|
||||
|
||||
# Export logs for analysis
|
||||
docker logs lcbp3-backend > backend-logs.txt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 Performance Baseline
|
||||
|
||||
### Establish Baselines
|
||||
|
||||
Run load tests to establish performance baselines:
|
||||
|
||||
```bash
|
||||
# Install Apache Bench
|
||||
apt-get install apache2-utils
|
||||
|
||||
# Test API endpoint
|
||||
ab -n 1000 -c 10 \
|
||||
-H "Authorization: Bearer <TOKEN>" \
|
||||
https://lcbp3-dms.example.com/api/correspondences
|
||||
|
||||
# Results to record:
|
||||
# - Requests per second
|
||||
# - Mean response time
|
||||
# - P95 response time
|
||||
# - Error rate
|
||||
```
|
||||
|
||||
### Regular Performance Testing
|
||||
|
||||
- **Weekly:** Quick health check (100 requests)
|
||||
- **Monthly:** Full load test (10,000 requests)
|
||||
- **Quarterly:** Stress test (find breaking point)
|
||||
|
||||
---
|
||||
|
||||
## ✅ Monitoring Checklist
|
||||
|
||||
### Daily
|
||||
|
||||
- [ ] Check service health dashboard
|
||||
- [ ] Review error logs
|
||||
- [ ] Verify backup completion
|
||||
- [ ] Check disk space
|
||||
|
||||
### Weekly
|
||||
|
||||
- [ ] Review performance metrics trends
|
||||
- [ ] Analyze slow query log
|
||||
- [ ] Check SSL certificate expiry
|
||||
- [ ] Review security alerts
|
||||
|
||||
### Monthly
|
||||
|
||||
- [ ] Capacity planning review
|
||||
- [ ] Update monitoring thresholds
|
||||
- [ ] Test alert notifications
|
||||
- [ ] Review and tune performance
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Documents
|
||||
|
||||
- [Backup & Recovery](./backup-recovery.md)
|
||||
- [Incident Response](./incident-response.md)
|
||||
- [ADR-010: Logging Strategy](../05-decisions/ADR-010-logging-monitoring-strategy.md)
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Last Review:** 2025-12-01
|
||||
**Next Review:** 2026-03-01
|
||||
0
specs/04-operations/monitoring.md
Normal file
0
specs/04-operations/monitoring.md
Normal file
444
specs/04-operations/security-operations.md
Normal file
444
specs/04-operations/security-operations.md
Normal file
@@ -0,0 +1,444 @@
|
||||
# Security Operations
|
||||
|
||||
**Project:** LCBP3-DMS
|
||||
**Version:** 1.5.0
|
||||
**Last Updated:** 2025-12-01
|
||||
|
||||
---
|
||||
|
||||
## 📋 Overview
|
||||
|
||||
This document outlines security monitoring, access control management, vulnerability management, and security incident response for LCBP3-DMS.
|
||||
|
||||
---
|
||||
|
||||
## 🔒 Access Control Management
|
||||
|
||||
### User Access Review
|
||||
|
||||
**Monthly Tasks:**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/audit-user-access.sh
|
||||
|
||||
# Export active users
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
SELECT user_id, username, email, primary_organization_id, is_active, last_login_at
|
||||
FROM lcbp3_dms.users
|
||||
WHERE is_active = 1
|
||||
ORDER BY last_login_at DESC;
|
||||
" > /reports/active-users-$(date +%Y%m%d).csv
|
||||
|
||||
# Find dormant accounts (no login > 90 days)
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
SELECT user_id, username, email, last_login_at,
|
||||
DATEDIFF(NOW(), last_login_at) AS days_inactive
|
||||
FROM lcbp3_dms.users
|
||||
WHERE is_active = 1
|
||||
AND (last_login_at IS NULL OR last_login_at < DATE_SUB(NOW(), INTERVAL 90 DAY));
|
||||
"
|
||||
|
||||
echo "User access audit completed: $(date)"
|
||||
```
|
||||
|
||||
### Role & Permission Audit
|
||||
|
||||
```sql
|
||||
-- Review users with elevated permissions
|
||||
SELECT u.username, u.email, r.role_name, r.scope
|
||||
FROM users u
|
||||
JOIN user_assignments ua ON u.user_id = ua.user_id
|
||||
JOIN roles r ON ua.role_id = r.role_id
|
||||
WHERE r.role_name IN ('Superadmin', 'Document Controller', 'Project Manager')
|
||||
ORDER BY r.role_name, u.username;
|
||||
|
||||
-- Review Global scope roles (highest privilege)
|
||||
SELECT u.username, r.role_name
|
||||
FROM users u
|
||||
JOIN user_assignments ua ON u.user_id = ua.user_id
|
||||
JOIN roles r ON ua.role_id = r.role_id
|
||||
WHERE r.scope = 'Global';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ Security Monitoring
|
||||
|
||||
### Log Monitoring for Security Events
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/monitor-security-events.sh
|
||||
|
||||
# Check for failed login attempts
|
||||
docker logs lcbp3-backend | grep "Failed login" | tail -20
|
||||
|
||||
# Check for unauthorized access attempts (403)
|
||||
docker logs lcbp3-backend | grep "403" | tail -20
|
||||
|
||||
# Check for unusual activity patterns
|
||||
docker logs lcbp3-backend | grep -E "DELETE|DROP|TRUNCATE" | tail -20
|
||||
|
||||
# Check for SQL injection attempts
|
||||
docker logs lcbp3-backend | grep -i "SELECT.*FROM.*WHERE" | grep -v "legitimate" | tail -20
|
||||
```
|
||||
|
||||
### Failed Login Monitoring
|
||||
|
||||
```sql
|
||||
-- Find accounts with multiple failed login attempts
|
||||
SELECT username, failed_attempts, locked_until
|
||||
FROM users
|
||||
WHERE failed_attempts >= 3
|
||||
ORDER BY failed_attempts DESC;
|
||||
|
||||
-- Unlock user account after verification
|
||||
UPDATE users
|
||||
SET failed_attempts = 0, locked_until = NULL
|
||||
WHERE user_id = ?;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Secrets & Credentials Management
|
||||
|
||||
### Password Rotation Schedule
|
||||
|
||||
| Credential | Rotation Frequency | Owner |
|
||||
| ---------------------- | ------------------------ | ------------ |
|
||||
| Database Root Password | Every 90 days | DBA |
|
||||
| Database App Password | Every 90 days | DevOps |
|
||||
| JWT Secret | Every 180 days | Backend Team |
|
||||
| Redis Password | Every 90 days | DevOps |
|
||||
| SMTP Password | When provider requires | Operations |
|
||||
| SSL Private Key | With certificate renewal | Operations |
|
||||
|
||||
### Password Rotation Procedure
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/rotate-db-password.sh
|
||||
|
||||
# Generate new password
|
||||
NEW_PASSWORD=$(openssl rand -base64 32)
|
||||
|
||||
# Update database user password
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
ALTER USER 'lcbp3_user'@'%' IDENTIFIED BY '$NEW_PASSWORD';
|
||||
FLUSH PRIVILEGES;
|
||||
"
|
||||
|
||||
# Update application .env file
|
||||
sed -i "s/^DB_PASS=.*/DB_PASS=$NEW_PASSWORD/" /app/backend/.env
|
||||
|
||||
# Restart backend to apply new password
|
||||
docker restart lcbp3-backend
|
||||
|
||||
# Verify connection
|
||||
sleep 10
|
||||
curl -f http://localhost:3000/health || {
|
||||
echo "FAILED: Backend cannot connect with new password"
|
||||
# Rollback procedure...
|
||||
exit 1
|
||||
}
|
||||
|
||||
echo "Database password rotated successfully: $(date)"
|
||||
# Store password securely (e.g., password manager)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Vulnerability Management
|
||||
|
||||
### Dependency Vulnerability Scanning
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/scan-vulnerabilities.sh
|
||||
|
||||
# Backend dependencies
|
||||
cd /app/backend
|
||||
npm audit --production
|
||||
|
||||
# Critical/High vulnerabilities
|
||||
VULNERABILITIES=$(npm audit --production --json | jq '.metadata.vulnerabilities.high + .metadata.vulnerabilities.critical')
|
||||
|
||||
if [ "$VULNERABILITIES" -gt 0 ]; then
|
||||
echo "WARNING: $VULNERABILITIES critical/high vulnerabilities found!"
|
||||
npm audit --production > /reports/security-audit-$(date +%Y%m%d).txt
|
||||
# Send alert
|
||||
/scripts/send-alert-email.sh "Security Vulnerabilities Detected" "Found $VULNERABILITIES critical/high vulnerabilities"
|
||||
fi
|
||||
|
||||
# Frontend dependencies
|
||||
cd /app/frontend
|
||||
npm audit --production
|
||||
```
|
||||
|
||||
### Container Image Scanning
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/scan-images.sh
|
||||
|
||||
# Install Trivy (if not installed)
|
||||
# wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | apt-key add -
|
||||
# echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | tee -a /etc/apt/sources.list.d/trivy.list
|
||||
# apt-get update && apt-get install trivy
|
||||
|
||||
# Scan Docker images
|
||||
trivy image --severity HIGH,CRITICAL lcbp3-backend:latest
|
||||
trivy image --severity HIGH,CRITICAL lcbp3-frontend:latest
|
||||
trivy image --severity HIGH,CRITICAL mariadb:10.11
|
||||
trivy image --severity HIGH,CRITICAL redis:7.2-alpine
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Security Hardening
|
||||
|
||||
### Server Hardening Checklist
|
||||
|
||||
- [ ] Disable root SSH login
|
||||
- [ ] Use SSH key authentication only
|
||||
- [ ] Configure firewall (allow only necessary ports)
|
||||
- [ ] Enable automatic security updates
|
||||
- [ ] Remove unnecessary services
|
||||
- [ ] Configure fail2ban for brute-force protection
|
||||
- [ ] Enable SELinux/AppArmor
|
||||
- [ ] Regular security patch updates
|
||||
|
||||
### Docker Security
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml - Security best practices
|
||||
|
||||
services:
|
||||
backend:
|
||||
# Run as non-root user
|
||||
user: 'node:node'
|
||||
|
||||
# Read-only root filesystem
|
||||
read_only: true
|
||||
|
||||
# No new privileges
|
||||
security_opt:
|
||||
- no-new-privileges:true
|
||||
|
||||
# Limit capabilities
|
||||
cap_drop:
|
||||
- ALL
|
||||
cap_add:
|
||||
- NET_BIND_SERVICE
|
||||
|
||||
# Resource limits
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '2'
|
||||
memory: 2G
|
||||
reservations:
|
||||
memory: 512M
|
||||
```
|
||||
|
||||
### Database Security
|
||||
|
||||
```sql
|
||||
-- Remove anonymous users
|
||||
DELETE FROM mysql.user WHERE User='';
|
||||
|
||||
-- Remove test database
|
||||
DROP DATABASE IF EXISTS test;
|
||||
|
||||
-- Remove remote root login
|
||||
DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1');
|
||||
|
||||
-- Create dedicated backup user with minimal privileges
|
||||
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'STRONG_PASSWORD';
|
||||
GRANT SELECT, LOCK TABLES, SHOW VIEW, EVENT, TRIGGER ON lcbp3_dms.* TO 'backup_user'@'localhost';
|
||||
|
||||
-- Enable SSL for database connections
|
||||
-- GRANT USAGE ON *.* TO 'lcbp3_user'@'%' REQUIRE SSL;
|
||||
|
||||
FLUSH PRIVILEGES;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Security Incident Response
|
||||
|
||||
### Incident Classification
|
||||
|
||||
| Type | Examples | Response Time |
|
||||
| ----------------------- | ---------------------------- | ---------------- |
|
||||
| **Data Breach** | Unauthorized data access | Immediate (< 1h) |
|
||||
| **Account Compromise** | Stolen credentials | Immediate (< 1h) |
|
||||
| **DDoS Attack** | Service unavailable | Immediate (< 1h) |
|
||||
| **Malware/Ransomware** | Infected systems | Immediate (< 1h) |
|
||||
| **Unauthorized Access** | Failed authentication spikes | High (< 4h) |
|
||||
| **Suspicious Activity** | Unusual patterns | Medium (< 24h) |
|
||||
|
||||
### Data Breach Response
|
||||
|
||||
**Immediate Actions:**
|
||||
|
||||
1. **Contain the breach**
|
||||
|
||||
```bash
|
||||
# Block suspicious IPs at firewall level
|
||||
iptables -A INPUT -s <SUSPICIOUS_IP> -j DROP
|
||||
|
||||
# Disable compromised user accounts
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
UPDATE lcbp3_dms.users
|
||||
SET is_active = 0
|
||||
WHERE user_id = <COMPROMISED_USER_ID>;
|
||||
"
|
||||
```
|
||||
|
||||
2. **Assess impact**
|
||||
|
||||
```sql
|
||||
-- Check audit logs for unauthorized access
|
||||
SELECT * FROM audit_logs
|
||||
WHERE user_id = <COMPROMISED_USER_ID>
|
||||
AND created_at >= '<SUSPECTED_START_TIME>'
|
||||
ORDER BY created_at DESC;
|
||||
|
||||
-- Check what documents were accessed
|
||||
SELECT DISTINCT entity_id, entity_type, action
|
||||
FROM audit_logs
|
||||
WHERE user_id = <COMPROMISED_USER_ID>;
|
||||
```
|
||||
|
||||
3. **Notify stakeholders**
|
||||
|
||||
- Security officer
|
||||
- Management
|
||||
- Affected users (if applicable)
|
||||
- Legal team (if required by law)
|
||||
|
||||
4. **Document everything**
|
||||
- Timeline of events
|
||||
- Data accessed/compromised
|
||||
- Actions taken
|
||||
- Lessons learned
|
||||
|
||||
### Account Compromise Response
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# File: /scripts/respond-account-compromise.sh
|
||||
|
||||
USER_ID=$1
|
||||
|
||||
# 1. Immediately disable account
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
UPDATE lcbp3_dms.users
|
||||
SET is_active = 0,
|
||||
locked_until = DATE_ADD(NOW(), INTERVAL 24 HOUR)
|
||||
WHERE user_id = $USER_ID;
|
||||
"
|
||||
|
||||
# 2. Invalidate all sessions
|
||||
docker exec lcbp3-redis redis-cli DEL "session:user:$USER_ID:*"
|
||||
|
||||
# 3. Generate audit report
|
||||
docker exec lcbp3-mariadb mysql -u root -p -e "
|
||||
SELECT * FROM lcbp3_dms.audit_logs
|
||||
WHERE user_id = $USER_ID
|
||||
AND created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
|
||||
ORDER BY created_at DESC;
|
||||
" > /reports/compromise-audit-user-$USER_ID-$(date +%Y%m%d).txt
|
||||
|
||||
# 4. Notify security team
|
||||
/scripts/send-alert-email.sh "Account Compromise" "User ID $USER_ID has been compromised and disabled"
|
||||
|
||||
echo "Account compromise response completed for User ID: $USER_ID"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Security Metrics & KPIs
|
||||
|
||||
### Monthly Security Report
|
||||
|
||||
| Metric | Target | Actual |
|
||||
| --------------------------- | --------- | ------ |
|
||||
| Failed Login Attempts | < 100/day | Track |
|
||||
| Locked Accounts | < 5/month | Track |
|
||||
| Critical Vulnerabilities | 0 | Track |
|
||||
| High Vulnerabilities | < 5 | Track |
|
||||
| Unpatched Systems | 0 | Track |
|
||||
| Security Incidents | 0 | Track |
|
||||
| Mean Time To Detect (MTTD) | < 1 hour | Track |
|
||||
| Mean Time To Respond (MTTR) | < 4 hours | Track |
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Compliance & Audit
|
||||
|
||||
### Audit Log Retention
|
||||
|
||||
- **Access Logs:** 1 year
|
||||
- **Security Events:** 2 years
|
||||
- **Admin Actions:** 3 years
|
||||
- **Data Changes:** 7 years (as required)
|
||||
|
||||
### Compliance Checklist
|
||||
|
||||
- [ ] Regular security audits (quarterly)
|
||||
- [ ] Penetration testing (annually)
|
||||
- [ ] Access control reviews (monthly)
|
||||
- [ ] Encryption at rest and in transit
|
||||
- [ ] Secure password policies enforced
|
||||
- [ ] Multi-factor authentication (if required)
|
||||
- [ ] Data backup and recovery tested
|
||||
- [ ] Incident response plan documented and tested
|
||||
|
||||
---
|
||||
|
||||
## ✅ Security Operations Checklist
|
||||
|
||||
### Daily
|
||||
|
||||
- [ ] Review security alerts and logs
|
||||
- [ ] Monitor failed login attempts
|
||||
- [ ] Check for unusual access patterns
|
||||
- [ ] Verify backup completion
|
||||
|
||||
### Weekly
|
||||
|
||||
- [ ] Review user access logs
|
||||
- [ ] Scan for vulnerabilities
|
||||
- [ ] Update virus definitions
|
||||
- [ ] Review firewall logs
|
||||
|
||||
### Monthly
|
||||
|
||||
- [ ] User access audit
|
||||
- [ ] Role and permission review
|
||||
- [ ] Security patch application
|
||||
- [ ] Compliance review
|
||||
|
||||
### Quarterly
|
||||
|
||||
- [ ] Full security audit
|
||||
- [ ] Penetration testing
|
||||
- [ ] Disaster recovery drill
|
||||
- [ ] Update security policies
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Related Documents
|
||||
|
||||
- [Incident Response](./incident-response.md)
|
||||
- [Monitoring & Alerting](./monitoring-alerting.md)
|
||||
- [ADR-004: RBAC Implementation](../05-decisions/ADR-004-rbac-implementation.md)
|
||||
|
||||
---
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Last Review:** 2025-12-01
|
||||
**Next Review:** 2026-03-01
|
||||
Reference in New Issue
Block a user