Main: revise specs to 1.5.0 (completed)

This commit is contained in:
2025-12-01 01:28:32 +07:00
parent 241022ada6
commit 71c091055a
69 changed files with 28252 additions and 74 deletions

View File

@@ -0,0 +1,190 @@
# Operations Documentation
**Project:** LCBP3-DMS (Laem Chabang Port Phase 3 - Document Management System)
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This directory contains operational documentation for deploying, maintaining, and monitoring the LCBP3-DMS system.
---
## 📚 Documentation Index
### Deployment & Infrastructure
| Document | Description | Status |
| ---------------------------------------------- | ------------------------------------------------------ | ----------- |
| [deployment-guide.md](./deployment-guide.md) | Docker deployment procedures on QNAP Container Station | ✅ Complete |
| [environment-setup.md](./environment-setup.md) | Environment variables and configuration management | ✅ Complete |
### Monitoring & Maintenance
| Document | Description | Status |
| -------------------------------------------------------- | --------------------------------------------------- | ----------- |
| [monitoring-alerting.md](./monitoring-alerting.md) | Monitoring setup, health checks, and alerting rules | ✅ Complete |
| [backup-recovery.md](./backup-recovery.md) | Backup strategies and disaster recovery procedures | ✅ Complete |
| [maintenance-procedures.md](./maintenance-procedures.md) | Routine maintenance and update procedures | ✅ Complete |
### Security & Compliance
| Document | Description | Status |
| -------------------------------------------------- | ---------------------------------------------- | ----------- |
| [security-operations.md](./security-operations.md) | Security monitoring and incident response | ✅ Complete |
| [incident-response.md](./incident-response.md) | Incident classification and response playbooks | ✅ Complete |
---
## 🚀 Quick Start for Operations Team
### Initial Setup
1. **Read Deployment Guide** - [deployment-guide.md](./deployment-guide.md)
2. **Configure Environment** - [environment-setup.md](./environment-setup.md)
3. **Setup Monitoring** - [monitoring-alerting.md](./monitoring-alerting.md)
4. **Configure Backups** - [backup-recovery.md](./backup-recovery.md)
### Daily Operations
1. Monitor system health via logs and metrics
2. Review backup status (automated daily)
3. Check for security alerts
4. Review system performance metrics
### Weekly/Monthly Tasks
- Review and update SSL certificates (90 days before expiry)
- Database optimization and cleanup
- Log rotation and archival
- Security patch review and application
---
## 🏗️ Infrastructure Overview
### QNAP Container Station Architecture
```mermaid
graph TB
subgraph "QNAP Server"
subgraph "Container Station"
NGINX[NGINX<br/>Reverse Proxy<br/>Port 80/443]
Backend[NestJS Backend<br/>Port 3000]
Frontend[Next.js Frontend<br/>Port 3001]
MariaDB[(MariaDB 10.11<br/>Port 3306)]
Redis[(Redis 7.2<br/>Port 6379)]
ES[(Elasticsearch<br/>Port 9200)]
end
Volumes[("Persistent Volumes<br/>- database<br/>- uploads<br/>- logs")]
end
Internet([Internet]) --> NGINX
NGINX --> Frontend
NGINX --> Backend
Backend --> MariaDB
Backend --> Redis
Backend --> ES
MariaDB --> Volumes
Backend --> Volumes
```
### Container Services
| Service | Container Name | Ports | Persistent Volume |
| ------------- | ------------------- | ------- | ----------------------------- |
| NGINX | lcbp3-nginx | 80, 443 | /config/nginx |
| Backend | lcbp3-backend | 3000 | /app/uploads, /app/logs |
| Frontend | lcbp3-frontend | 3001 | - |
| MariaDB | lcbp3-mariadb | 3306 | /var/lib/mysql |
| Redis | lcbp3-redis | 6379 | /data |
| Elasticsearch | lcbp3-elasticsearch | 9200 | /usr/share/elasticsearch/data |
---
## 👥 Roles & Responsibilities
### System Administrator
- Deploy and configure infrastructure
- Manage QNAP server and Container Station
- Configure networking and firewall rules
- SSL certificate management
### Database Administrator (DBA)
- Database backup and recovery
- Performance tuning and optimization
- Migration execution
- Access control management
### DevOps Engineer
- CI/CD pipeline maintenance
- Container orchestration
- Monitoring and alerting setup
- Log aggregation
### Security Officer
- Security monitoring
- Incident response coordination
- Access audit reviews
- Vulnerability management
---
## 📞 Support & Escalation
### Support Tiers
**Tier 1: User Support**
- User access issues
- Password resets
- Basic troubleshooting
**Tier 2: Technical Support**
- Application errors
- Performance issues
- Feature bugs
**Tier 3: Operations Team**
- Infrastructure failures
- Database issues
- Security incidents
### Escalation Path
1. **Minor Issues** → Tier 1/2 Support → Resolution within 24h
2. **Major Issues** → Tier 3 Operations → Resolution within 4h
3. **Critical Issues** → Immediate escalation to System Architect → Resolution within 1h
---
## 🔗 Related Documentation
- [Architecture Documentation](../02-architecture/)
- [Implementation Guidelines](../03-implementation/)
- [Architecture Decision Records](../05-decisions/)
- [Backend Development Tasks](../06-tasks/)
---
## 📝 Document Maintenance
- **Review Frequency:** Monthly
- **Owner:** Operations Team
- **Last Review:** 2025-12-01
- **Next Review:** 2026-01-01
---
**Version:** 1.5.0
**Status:** Active
**Classification:** Internal Use Only

View File

@@ -0,0 +1,374 @@
# Backup & Recovery Procedures
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document outlines backup strategies, recovery procedures, and disaster recovery planning for LCBP3-DMS.
---
## 🎯 Backup Strategy
### Backup Schedule
| Data Type | Frequency | Retention | Method |
| ---------------------- | -------------- | --------- | ----------------------- |
| Database (Full) | Daily at 02:00 | 30 days | mysqldump + compression |
| Database (Incremental) | Every 6 hours | 7 days | Binary logs |
| File Uploads | Daily at 03:00 | 30 days | rsync to backup server |
| Configuration Files | Weekly | 90 days | Git repository |
| Elasticsearch Indexes | Weekly | 14 days | Snapshot to S3/NFS |
| Application Logs | Daily | 90 days | Rotation + archival |
### Backup Locations
**Primary Backup:** QNAP NAS `/backup/lcbp3-dms`
**Secondary Backup:** External backup server (rsync)
**Offsite Backup:** Cloud storage (optional - for critical data)
---
## 💾 Database Backup
### Automated Daily Backup Script
```bash
#!/bin/bash
# File: /scripts/backup-database.sh
# Configuration
BACKUP_DIR="/backup/lcbp3-dms/database"
DB_CONTAINER="lcbp3-mariadb"
DB_NAME="lcbp3_dms"
DB_USER="backup_user"
DB_PASS="<BACKUP_USER_PASSWORD>"
RETENTION_DAYS=30
# Create backup directory
BACKUP_FILE="$BACKUP_DIR/lcbp3_$(date +%Y%m%d_%H%M%S).sql.gz"
mkdir -p "$BACKUP_DIR"
# Perform backup
echo "Starting database backup to $BACKUP_FILE"
docker exec $DB_CONTAINER mysqldump \
--user=$DB_USER \
--password=$DB_PASS \
--single-transaction \
--routines \
--triggers \
--databases $DB_NAME \
| gzip > "$BACKUP_FILE"
# Check backup success
if [ $? -eq 0 ]; then
echo "Backup completed successfully"
# Delete old backups
find "$BACKUP_DIR" -name "*.sql.gz" -type f -mtime +$RETENTION_DAYS -delete
echo "Old backups cleaned up (retention: $RETENTION_DAYS days)"
else
echo "ERROR: Backup failed!"
exit 1
fi
```
### Schedule with Cron
```bash
# Edit crontab
crontab -e
# Add backup job (runs daily at 2 AM)
0 2 * * * /scripts/backup-database.sh >> /var/log/backup-database.log 2>&1
```
### Manual Database Backup
```bash
# Backup specific database
docker exec lcbp3-mariadb mysqldump \
-u root -p \
--single-transaction \
lcbp3_dms > backup_$(date +%Y%m%d).sql
# Compress backup
gzip backup_$(date +%Y%m%d).sql
```
---
## 📂 File Uploads Backup
### Automated Rsync Backup
```bash
#!/bin/bash
# File: /scripts/backup-uploads.sh
SOURCE="/var/lib/docker/volumes/lcbp3_uploads/_data"
DEST="/backup/lcbp3-dms/uploads"
RETENTION_DAYS=30
# Create incremental backup with rsync
rsync -av --delete \
--backup --backup-dir="$DEST/backup-$(date +%Y%m%d)" \
"$SOURCE/" "$DEST/current/"
# Cleanup old backups
find "$DEST" -maxdepth 1 -type d -name "backup-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;
echo "Upload backup completed: $(date)"
```
### Schedule Uploads Backup
```bash
# Run daily at 3 AM
0 3 * * * /scripts/backup-uploads.sh >> /var/log/backup-uploads.log 2>&1
```
---
## 🔄 Database Recovery
### Full Database Restore
```bash
# Step 1: Stop backend application
docker stop lcbp3-backend
# Step 2: Restore database from backup
gunzip < backup_20241201.sql.gz | \
docker exec -i lcbp3-mariadb mysql -u root -p lcbp3_dms
# Step 3: Verify restore
docker exec lcbp3-mariadb mysql -u root -p -e "
USE lcbp3_dms;
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM correspondences;
"
# Step 4: Restart backend
docker start lcbp3-backend
```
### Point-in-Time Recovery (Using Binary Logs)
```bash
# Step 1: Restore last full backup
gunzip < backup_20241201_020000.sql.gz | \
docker exec -i lcbp3-mariadb mysql -u root -p lcbp3_dms
# Step 2: Apply binary logs since backup
docker exec lcbp3-mariadb mysqlbinlog \
--start-datetime="2024-12-01 02:00:00" \
--stop-datetime="2024-12-01 14:30:00" \
/var/lib/mysql/mysql-bin.000001 | \
docker exec -i lcbp3-mariadb mysql -u root -p lcbp3_dms
```
---
## 📁 File Uploads Recovery
### Restore from Backup
```bash
# Stop backend to prevent file operations
docker stop lcbp3-backend
# Restore files
rsync -av \
/backup/lcbp3-dms/uploads/current/ \
/var/lib/docker/volumes/lcbp3_uploads/_data/
# Verify permissions
docker exec lcbp3-backend chown -R node:node /app/uploads
# Restart backend
docker start lcbp3-backend
```
---
## 🚨 Disaster Recovery Plan
### RTO & RPO
- **RTO (Recovery Time Objective):** 4 hours
- **RPO (Recovery Point Objective):** 24 hours (for files), 6 hours (for database)
### DR Scenarios
#### Scenario 1: Database Corruption
**Detection:** Database errors in logs, application errors
**Recovery Time:** 30 minutes
**Steps:**
1. Stop backend
2. Restore last full backup
3. Apply binary logs (if needed)
4. Verify data integrity
5. Restart services
#### Scenario 2: Complete Server Failure
**Detection:** Server unresponsive
**Recovery Time:** 4 hours
**Steps:**
1. Provision new QNAP server or VM
2. Install Docker & Container Station
3. Clone Git repository
4. Restore database backup
5. Restore file uploads
6. Deploy containers
7. Update DNS (if needed)
8. Verify functionality
#### Scenario 3: Ransomware Attack
**Detection:** Encrypted files, ransom note
**Recovery Time:** 6 hours
**Steps:**
1. **DO NOT pay ransom**
2. Isolate infected server
3. Provision clean environment
4. Restore from offsite backup
5. Scan restored backup for malware
6. Deploy and verify
7. Review security logs
8. Implement additional security measures
---
## ✅ Backup Verification
### Weekly Backup Testing
```bash
#!/bin/bash
# File: /scripts/test-backup.sh
# Create temporary test database
docker exec lcbp3-mariadb mysql -u root -p -e "
CREATE DATABASE IF NOT EXISTS test_restore;
"
# Restore latest backup to test database
LATEST_BACKUP=$(ls -t /backup/lcbp3-dms/database/*.sql.gz | head -1)
gunzip < "$LATEST_BACKUP" | \
sed 's/USE `lcbp3_dms`/USE `test_restore`/g' | \
docker exec -i lcbp3-mariadb mysql -u root -p
# Verify table counts
docker exec lcbp3-mariadb mysql -u root -p -e "
SELECT COUNT(*) FROM test_restore.users;
SELECT COUNT(*) FROM test_restore.correspondences;
"
# Cleanup
docker exec lcbp3-mariadb mysql -u root -p -e "
DROP DATABASE test_restore;
"
echo "Backup verification completed: $(date)"
```
### Monthly DR Drill
- Test full system restore on standby server
- Document time taken and issues encountered
- Update DR procedures based on findings
---
## 📊 Backup Monitoring
### Backup Status Dashboard
Monitor:
- ✅ Last successful backup timestamp
- ✅ Backup file size (detect anomalies)
- ✅ Backup success/failure rate
- ✅ Available backup storage space
### Alerts
Send alert if:
- ❌ Backup fails
- ❌ Backup file size < 50% of average (possible corruption)
- ❌ No backup in last 48 hours
- ❌ Backup storage < 20% free
---
## 🔧 Maintenance
### Optimize Backup Performance
```sql
-- Enable InnoDB compression for large tables
ALTER TABLE correspondences ROW_FORMAT=COMPRESSED;
ALTER TABLE workflow_history ROW_FORMAT=COMPRESSED;
-- Archive old audit logs
-- Move records older than 1 year to archive table
INSERT INTO audit_logs_archive
SELECT * FROM audit_logs
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
DELETE FROM audit_logs
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
```
---
## 📚 Backup Checklist
### Daily Tasks
- [ ] Verify automated backups completed
- [ ] Check backup log files for errors
- [ ] Monitor backup storage space
### Weekly Tasks
- [ ] Test restore from random backup
- [ ] Review backup size trends
- [ ] Verify offsite backups synced
### Monthly Tasks
- [ ] Full DR drill
- [ ] Review and update DR procedures
- [ ] Test backup restoration on different server
### Quarterly Tasks
- [ ] Audit backup access controls
- [ ] Review backup retention policies
- [ ] Update backup documentation
---
## 🔗 Related Documents
- [Deployment Guide](./deployment-guide.md)
- [Monitoring & Alerting](./monitoring-alerting.md)
- [Incident Response](./incident-response.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01

View File

View File

View File

@@ -0,0 +1,463 @@
# Environment Setup & Configuration
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document describes environment variables, configuration files, and secrets management for LCBP3-DMS deployment.
---
## 🔐 Environment Variables
### Backend (.env)
```bash
# File: backend/.env (DO NOT commit to Git)
# Application
NODE_ENV=production
APP_PORT=3000
APP_URL=https://lcbp3-dms.example.com
# Database
DB_HOST=lcbp3-mariadb
DB_PORT=3306
DB_USER=lcbp3_user
DB_PASS=<STRONG_PASSWORD>
DB_NAME=lcbp3_dms
# Redis
REDIS_HOST=lcbp3-redis
REDIS_PORT=6379
REDIS_PASSWORD=<STRONG_PASSWORD>
# JWT Authentication
JWT_SECRET=<RANDOM_256_BIT_SECRET>
JWT_EXPIRATION=1h
JWT_REFRESH_SECRET=<RANDOM_256_BIT_SECRET>
JWT_REFRESH_EXPIRATION=7d
# File Storage
UPLOAD_DIR=/app/uploads
TEMP_UPLOAD_DIR=/app/uploads/temp
MAX_FILE_SIZE=104857600 # 100MB
ALLOWED_FILE_TYPES=pdf,doc,docx,xls,xlsx,dwg,jpg,png
# SMTP Email
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=noreply@example.com
SMTP_PASS=<APP_PASSWORD>
SMTP_FROM="LCBP3-DMS System <noreply@example.com>"
# LINE Notify (Optional)
LINE_NOTIFY_ENABLED=true
# ClamAV Virus Scanner
CLAMAV_HOST=clamav
CLAMAV_PORT=3310
# Elasticsearch
ELASTICSEARCH_NODE=http://lcbp3-elasticsearch:9200
ELASTICSEARCH_INDEX_PREFIX=lcbp3_
# Logging
LOG_LEVEL=info
LOG_FILE_PATH=/app/logs
# Frontend URL (for email links)
FRONTEND_URL=https://lcbp3-dms.example.com
# Rate Limiting
RATE_LIMIT_TTL=60
RATE_LIMIT_MAX=100
```
### Frontend (.env.local)
```bash
# File: frontend/.env.local (DO NOT commit to Git)
# API Backend
NEXT_PUBLIC_API_URL=https://lcbp3-dms.example.com/api
# Application
NEXT_PUBLIC_APP_NAME=LCBP3-DMS
NEXT_PUBLIC_APP_VERSION=1.5.0
# Feature Flags
NEXT_PUBLIC_ENABLE_NOTIFICATIONS=true
NEXT_PUBLIC_ENABLE_LINE_NOTIFY=true
```
---
## 🐳 Docker Compose Configuration
### Production docker-compose.yml
```yaml
# File: docker-compose.yml
version: '3.8'
services:
# NGINX Reverse Proxy
nginx:
image: nginx:alpine
container_name: lcbp3-nginx
ports:
- '80:80'
- '443:443'
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./nginx/ssl:/etc/nginx/ssl:ro
- nginx-logs:/var/log/nginx
depends_on:
- backend
- frontend
restart: unless-stopped
networks:
- lcbp3-network
# NestJS Backend
backend:
image: lcbp3-backend:latest
container_name: lcbp3-backend
environment:
- NODE_ENV=production
env_file:
- ./backend/.env
volumes:
- uploads:/app/uploads
- backend-logs:/app/logs
depends_on:
- mariadb
- redis
- elasticsearch
restart: unless-stopped
networks:
- lcbp3-network
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:3000/health']
interval: 30s
timeout: 10s
retries: 3
# Next.js Frontend
frontend:
image: lcbp3-frontend:latest
container_name: lcbp3-frontend
environment:
- NODE_ENV=production
env_file:
- ./frontend/.env.local
restart: unless-stopped
networks:
- lcbp3-network
# MariaDB Database
mariadb:
image: mariadb:10.11
container_name: lcbp3-mariadb
environment:
MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASS}
MYSQL_DATABASE: ${DB_NAME}
MYSQL_USER: ${DB_USER}
MYSQL_PASSWORD: ${DB_PASS}
volumes:
- mariadb-data:/var/lib/mysql
- ./mariadb/init:/docker-entrypoint-initdb.d:ro
ports:
- '3306:3306'
restart: unless-stopped
networks:
- lcbp3-network
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
# Redis Cache & Queue
redis:
image: redis:7.2-alpine
container_name: lcbp3-redis
command: redis-server --requirepass ${REDIS_PASSWORD}
volumes:
- redis-data:/data
ports:
- '6379:6379'
restart: unless-stopped
networks:
- lcbp3-network
# Elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: lcbp3-elasticsearch
environment:
- discovery.type=single-node
- 'ES_JAVA_OPTS=-Xms512m -Xmx512m'
- xpack.security.enabled=false
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
ports:
- '9200:9200'
restart: unless-stopped
networks:
- lcbp3-network
# ClamAV (Optional - for virus scanning)
clamav:
image: clamav/clamav:latest
container_name: lcbp3-clamav
restart: unless-stopped
networks:
- lcbp3-network
networks:
lcbp3-network:
driver: bridge
volumes:
mariadb-data:
redis-data:
elasticsearch-data:
uploads:
backend-logs:
nginx-logs:
```
### Development docker-compose.override.yml
```yaml
# File: docker-compose.override.yml (Local development only)
# Add to .gitignore
version: '3.8'
services:
backend:
build:
context: ./backend
dockerfile: Dockerfile.dev
volumes:
- ./backend:/app
- /app/node_modules
environment:
- NODE_ENV=development
- LOG_LEVEL=debug
ports:
- '3000:3000'
- '9229:9229' # Node.js debugger
frontend:
build:
context: ./frontend
dockerfile: Dockerfile.dev
volumes:
- ./frontend:/app
- /app/node_modules
- /app/.next
ports:
- '3001:3000'
mariadb:
ports:
- '3307:3306' # Avoid conflict with local MySQL
redis:
ports:
- '6380:6379'
elasticsearch:
environment:
- 'ES_JAVA_OPTS=-Xms256m -Xmx256m' # Lower memory for dev
```
---
## 🔑 Secrets Management
### Using Docker Secrets (Recommended for Production)
```yaml
# docker-compose.yml
services:
backend:
secrets:
- db_password
- jwt_secret
environment:
DB_PASS_FILE: /run/secrets/db_password
JWT_SECRET_FILE: /run/secrets/jwt_secret
secrets:
db_password:
file: ./secrets/db_password.txt
jwt_secret:
file: ./secrets/jwt_secret.txt
```
### Generate Strong Secrets
```bash
# Generate JWT Secret
openssl rand -base64 64
# Generate Database Password
openssl rand -base64 32
# Generate Redis Password
openssl rand -base64 32
```
---
## 📁 Directory Structure
```
lcbp3/
├── backend/
│ ├── .env # Backend environment (DO NOT commit)
│ ├── .env.example # Example template (commit this)
│ └── ...
├── frontend/
│ ├── .env.local # Frontend environment (DO NOT commit)
│ ├── .env.example # Example template
│ └── ...
├── nginx/
│ ├── nginx.conf
│ └── ssl/
│ ├── cert.pem
│ └── key.pem
├── secrets/ # Docker secrets (DO NOT commit)
│ ├── db_password.txt
│ ├── jwt_secret.txt
│ └── redis_password.txt
├── docker-compose.yml # Production config
└── docker-compose.override.yml # Development config (DO NOT commit)
```
---
## ⚙️ Configuration Management
### Environment-Specific Configs
**Development:**
```bash
NODE_ENV=development
LOG_LEVEL=debug
DB_HOST=localhost
```
**Staging:**
```bash
NODE_ENV=staging
LOG_LEVEL=info
DB_HOST=staging-db.internal
```
**Production:**
```bash
NODE_ENV=production
LOG_LEVEL=warn
DB_HOST=prod-db.internal
```
### Configuration Validation
Backend validates environment variables at startup:
```typescript
// File: backend/src/config/env.validation.ts
import * as Joi from 'joi';
export const envValidationSchema = Joi.object({
NODE_ENV: Joi.string()
.valid('development', 'staging', 'production')
.required(),
DB_HOST: Joi.string().required(),
DB_PORT: Joi.number().default(3306),
DB_USER: Joi.string().required(),
DB_PASS: Joi.string().required(),
JWT_SECRET: Joi.string().min(32).required(),
// ...
});
```
---
## 🔒 Security Best Practices
### DO:
- ✅ Use strong, random passwords (minimum 32 characters)
- ✅ Rotate secrets every 90 days
- ✅ Use Docker secrets for production
- ✅ Add `.env` files to `.gitignore`
- ✅ Provide `.env.example` templates
- ✅ Validate environment variables at startup
### DON'T:
- ❌ Commit `.env` files to Git
- ❌ Use weak or default passwords
- ❌ Share production credentials via email/chat
- ❌ Reuse passwords across environments
- ❌ Hardcode secrets in source code
---
## 🛠️ Troubleshooting
### Common Issues
**Backend can't connect to database:**
```bash
# Check database container is running
docker ps | grep mariadb
# Check database logs
docker logs lcbp3-mariadb
# Verify credentials
docker exec lcbp3-backend env | grep DB_
```
**Redis connection refused:**
```bash
# Test Redis connection
docker exec lcbp3-redis redis-cli -a <PASSWORD> ping
# Should return: PONG
```
**Environment variable not loading:**
```bash
# Check if env file exists
ls -la backend/.env
# Check if backend loaded the env
docker exec lcbp3-backend env | grep NODE_ENV
```
---
## 📚 Related Documents
- [Deployment Guide](./deployment-guide.md)
- [Security Operations](./security-operations.md)
- [ADR-005: Technology Stack](../05-decisions/ADR-005-technology-stack.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01

View File

@@ -0,0 +1,483 @@
# Incident Response Procedures
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document outlines incident classification, response procedures, and post-incident reviews for LCBP3-DMS.
---
## 🚨 Incident Classification
### Severity Levels
| Severity | Description | Response Time | Examples |
| ----------------- | ---------------------------- | ----------------- | ----------------------------------------------- |
| **P0 - Critical** | Complete system outage | 15 minutes | Database down, All services unavailable |
| **P1 - High** | Major functionality impaired | 1 hour | Authentication failing, Cannot create documents |
| **P2 - Medium** | Degraded performance | 4 hours | Slow response time, Some features broken |
| **P3 - Low** | Minor issues | Next business day | UI glitch, Non-critical bug |
---
## 📞 Incident Response Team
### Roles & Responsibilities
**Incident Commander (IC)**
- Coordinates response efforts
- Makes final decisions
- Communicates with stakeholders
**Technical Lead (TL)**
- Diagnoses technical issues
- Implements fixes
- Coordinates with engineers
**Communications Lead (CL)**
- Updates stakeholders
- Manages internal/external communications
- Documents incident timeline
**On-Call Engineer**
- First responder
- Initial triage and investigation
- Escalates to appropriate team
---
## 🔄 Incident Response Workflow
```mermaid
flowchart TD
Start([Incident Detected]) --> Acknowledge[Acknowledge Incident]
Acknowledge --> Assess[Assess Severity]
Assess --> P0{Severity?}
P0 -->|P0/P1| Alert[Page Incident Commander]
P0 -->|P2/P3| Assign[Assign to On-Call]
Alert --> Investigate[Investigate Root Cause]
Assign --> Investigate
Investigate --> Mitigate[Implement Mitigation]
Mitigate --> Verify[Verify Resolution]
Verify --> Resolved{Resolved?}
Resolved -->|No| Escalate[Escalate/Re-assess]
Escalate --> Investigate
Resolved -->|Yes| Communicate[Communicate Resolution]
Communicate --> PostMortem[Schedule Post-Mortem]
PostMortem --> End([Close Incident])
```
---
## 📋 Incident Response Playbooks
### P0: Database Down
**Symptoms:**
- Backend returns 500 errors
- Cannot connect to database
- Health check fails
**Immediate Actions:**
1. **Verify Issue**
```bash
docker ps | grep mariadb
docker logs lcbp3-mariadb --tail=50
```
2. **Attempt Restart**
```bash
docker restart lcbp3-mariadb
```
3. **Check Database Process**
```bash
docker exec lcbp3-mariadb ps aux | grep mysql
```
4. **If Restart Fails:**
```bash
# Check disk space
df -h
# Check database logs for corruption
docker exec lcbp3-mariadb cat /var/log/mysql/error.log
# If corrupted, restore from backup
# See backup-recovery.md
```
5. **Escalate to DBA** if not resolved in 30 minutes
---
### P0: Complete System Outage
**Symptoms:**
- All services return 502/503
- Health checks fail
- Users cannot access system
**Immediate Actions:**
1. **Check Container Status**
```bash
docker-compose ps
# Identify which containers are down
```
2. **Restart All Services**
```bash
docker-compose restart
```
3. **Check QNAP Server Resources**
```bash
top
df -h
free -h
```
4. **Check Network**
```bash
ping 8.8.8.8
netstat -tlnp
```
5. **If Server Issue:**
- Reboot QNAP server
- Contact QNAP support
---
### P1: Authentication System Failing
**Symptoms:**
- Users cannot log in
- JWT validation fails
- 401 errors
**Immediate Actions:**
1. **Check Redis (Session Store)**
```bash
docker exec lcbp3-redis redis-cli ping
# Should return PONG
```
2. **Check JWT Secret Configuration**
```bash
docker exec lcbp3-backend env | grep JWT_SECRET
# Verify not empty
```
3. **Check Backend Logs**
```bash
docker logs lcbp3-backend --tail=100 | grep "JWT\|Auth"
```
4. **Temporary Mitigation:**
```bash
# Restart backend to reload config
docker restart lcbp3-backend
```
---
### P1: File Upload Failing
**Symptoms:**
- Users cannot upload files
- 500 errors on file upload
- "Disk full" errors
**Immediate Actions:**
1. **Check Disk Space**
```bash
df -h /var/lib/docker/volumes/lcbp3_uploads
```
2. **If Disk Full:**
```bash
# Clean up temp uploads
find /var/lib/docker/volumes/lcbp3_uploads/_data/temp \
-type f -mtime +1 -delete
```
3. **Check ClamAV (Virus Scanner)**
```bash
docker logs lcbp3-clamav --tail=50
docker restart lcbp3-clamav
```
4. **Check File Permissions**
```bash
docker exec lcbp3-backend ls -la /app/uploads
```
---
### P2: Slow Performance
**Symptoms:**
- Pages load slowly
- API response time > 2s
- Users complain about slowness
**Actions:**
1. **Check System Resources**
```bash
docker stats
# Identify high CPU/memory containers
```
2. **Check Database Performance**
```sql
-- Show slow queries
SHOW PROCESSLIST;
-- Check connections
SHOW STATUS LIKE 'Threads_connected';
```
3. **Check Redis**
```bash
docker exec lcbp3-redis redis-cli --stat
```
4. **Check Application Logs**
```bash
docker logs lcbp3-backend | grep "Slow request"
```
5. **Temporary Mitigation:**
- Restart slow containers
- Clear Redis cache if needed
- Kill long-running queries
---
### P2: Email Notifications Not Sending
**Symptoms:**
- Users not receiving emails
- Email queue backing up
**Actions:**
1. **Check Email Queue**
```bash
# Access BullMQ dashboard or check Redis
docker exec lcbp3-redis redis-cli LLEN bull:email:waiting
```
2. **Check Email Processor Logs**
```bash
docker logs lcbp3-backend | grep "email\|SMTP"
```
3. **Test SMTP Connection**
```bash
docker exec lcbp3-backend node -e "
const nodemailer = require('nodemailer');
const transport = nodemailer.createTransport({
host: process.env.SMTP_HOST,
port: process.env.SMTP_PORT,
auth: {
user: process.env.SMTP_USER,
pass: process.env.SMTP_PASS
}
});
transport.verify().then(console.log).catch(console.error);
"
```
4. **Check SMTP Credentials**
- Verify not expired
- Check firewall/network access
---
## 📝 Incident Documentation
### Incident Report Template
```markdown
# Incident Report: [Brief Description]
**Incident ID:** INC-YYYYMMDD-001
**Severity:** P1
**Status:** Resolved
**Incident Commander:** [Name]
## Timeline
| Time | Event |
| ----- | --------------------------------------------------------- |
| 14:00 | Alert: High error rate detected |
| 14:05 | On-call engineer acknowledged |
| 14:10 | Identified root cause: Database connection pool exhausted |
| 14:15 | Implemented mitigation: Increased pool size |
| 14:20 | Verified resolution |
| 14:30 | Incident resolved |
## Impact
- **Duration:** 30 minutes
- **Affected Users:** ~50 users
- **Affected Services:** Document creation, Search
- **Data Loss:** None
## Root Cause
Database connection pool was exhausted due to slow queries not releasing connections.
## Resolution
1. Increased connection pool size from 10 to 20
2. Optimized slow queries
3. Added connection pool monitoring
## Action Items
- [ ] Add connection pool size alert (Owner: DevOps, Due: Next Sprint)
- [ ] Implement automatic query timeouts (Owner: Backend, Due: 2025-12-15)
- [ ] Review all queries for optimization (Owner: DBA, Due: 2025-12-31)
## Lessons Learned
- Connection pool monitoring was insufficient
- Need automated remediation for common issues
```
---
## 🔍 Post-Incident Review (PIR)
### PIR Meeting Agenda
1. **Timeline Review** (10 min)
- What happened and when?
- What was the impact?
2. **Root Cause Analysis** (15 min)
- Why did it happen?
- What were the contributing factors?
3. **What Went Well** (10 min)
- What did we do right?
- What helped us resolve quickly?
4. **What Went Wrong** (15 min)
- What could we have done better?
- What slowed us down?
5. **Action Items** (10 min)
- What changes will prevent this?
- Who owns each action?
- When will they be completed?
### PIR Best Practices
- **Blameless Culture:** Focus on systems, not individuals
- **Actionable Outcomes:** Every PIR should produce concrete actions
- **Follow Through:** Track action items to completion
- **Share Learnings:** Distribute PIR summary to entire team
---
## 📊 Incident Metrics
### Track & Review Monthly
- **MTTR (Mean Time To Resolution):** Average time to resolve incidents
- **MTBF (Mean Time Between Failures):** Average time between incidents
- **Incident Frequency:** Number of incidents per month
- **Severity Distribution:** Breakdown by P0/P1/P2/P3
- **Repeat Incidents:** Same root cause occurring multiple times
---
## ✅ Incident Response Checklist
### During Incident
- [ ] Acknowledge incident in tracking system
- [ ] Assess severity and assign IC
- [ ] Create incident channel (Slack/Teams)
- [ ] Begin documenting timeline
- [ ] Investigate and implement mitigation
- [ ] Communicate status updates every 30 min (P0/P1)
- [ ] Verify resolution
- [ ] Communicate resolution to stakeholders
### After Incident
- [ ] Create incident report
- [ ] Schedule PIR within 48 hours
- [ ] Identify action items
- [ ] Assign owners and deadlines
- [ ] Update runbooks/playbooks
- [ ] Share learnings with team
---
## 🔗 Related Documents
- [Monitoring & Alerting](./monitoring-alerting.md)
- [Backup & Recovery](./backup-recovery.md)
- [Security Operations](./security-operations.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01

View File

@@ -0,0 +1,501 @@
# Maintenance Procedures
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document outlines routine maintenance tasks, update procedures, and optimization guidelines for LCBP3-DMS.
---
## 📅 Maintenance Schedule
### Daily Tasks
- Monitor system health and backups
- Review error logs
- Check disk space
### Weekly Tasks
- Database optimization
- Log rotation and cleanup
- Security patch review
- Performance monitoring review
### Monthly Tasks
- SSL certificate check
- Dependency updates (Security patches)
- Database maintenance
- Backup restoration test
### Quarterly Tasks
- Full system update
- Capacity planning review
- Security audit
- Disaster recovery drill
---
## 🔄 Update Procedures
### Application Updates
#### Backend Update
```bash
#!/bin/bash
# File: /scripts/update-backend.sh
# Step 1: Backup database
/scripts/backup-database.sh
# Step 2: Pull latest code
cd /app/lcbp3/backend
git pull origin main
# Step 3: Install dependencies
docker exec lcbp3-backend npm install
# Step 4: Run migrations
docker exec lcbp3-backend npm run migration:run
# Step 5: Build application
docker exec lcbp3-backend npm run build
# Step 6: Restart backend
docker restart lcbp3-backend
# Step 7: Verify health
sleep 10
curl -f http://localhost:3000/health || {
echo "Health check failed! Rolling back..."
docker exec lcbp3-backend npm run migration:revert
docker restart lcbp3-backend
exit 1
}
echo "Backend updated successfully"
```
#### Frontend Update
```bash
#!/bin/bash
# File: /scripts/update-frontend.sh
# Step 1: Pull latest code
cd /app/lcbp3/frontend
git pull origin main
# Step 2: Install dependencies
docker exec lcbp3-frontend npm install
# Step 3: Build application
docker exec lcbp3-frontend npm run build
# Step 4: Restart frontend
docker restart lcbp3-frontend
# Step 5: Verify
sleep 10
curl -f http://localhost:3001 || {
echo "Frontend failed to start!"
exit 1
}
echo "Frontend updated successfully"
```
### Zero-Downtime Deployment
```bash
#!/bin/bash
# File: /scripts/zero-downtime-deploy.sh
# Using blue-green deployment strategy
# Step 1: Start new "green" backend
docker-compose -f docker-compose.green.yml up -d backend
# Step 2: Wait for health check
for i in {1..30}; do
curl -f http://localhost:3002/health && break
sleep 2
done
# Step 3: Switch NGINX to green
docker exec lcbp3-nginx nginx -s reload
# Step 4: Stop old "blue" backend
docker stop lcbp3-backend-blue
echo "Deployment completed with zero downtime"
```
---
## 🗄️ Database Maintenance
### Weekly Database Optimization
```sql
-- File: /scripts/optimize-database.sql
-- Optimize tables
OPTIMIZE TABLE correspondences;
OPTIMIZE TABLE rfas;
OPTIMIZE TABLE workflow_instances;
OPTIMIZE TABLE attachments;
-- Analyze tables for query optimization
ANALYZE TABLE correspondences;
ANALYZE TABLE rfas;
-- Check for table corruption
CHECK TABLE correspondences;
CHECK TABLE rfas;
-- Rebuild indexes if fragmented
ALTER TABLE correspondences ENGINE=InnoDB;
```
```bash
#!/bin/bash
# File: /scripts/weekly-db-maintenance.sh
docker exec lcbp3-mariadb mysql -u root -p lcbp3_dms < /scripts/optimize-database.sql
echo "Database optimization completed: $(date)"
```
### Monthly Database Cleanup
```sql
-- Archive old audit logs (older than 1 year)
INSERT INTO audit_logs_archive
SELECT * FROM audit_logs
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
DELETE FROM audit_logs
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);
-- Clean up deleted notifications (older than 90 days)
DELETE FROM notifications
WHERE deleted_at IS NOT NULL
AND deleted_at < DATE_SUB(NOW(), INTERVAL 90 DAY);
-- Clean up expired temp uploads (older than 24h)
DELETE FROM temp_uploads
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 DAY);
-- Optimize after cleanup
OPTIMIZE TABLE audit_logs;
OPTIMIZE TABLE notifications;
OPTIMIZE TABLE temp_uploads;
```
---
## 📦 Dependency Updates
### Security Patch Updates (Monthly)
```bash
#!/bin/bash
# File: /scripts/update-dependencies.sh
cd /app/lcbp3/backend
# Check for security vulnerabilities
npm audit
# Update security patches only (no major versions)
npm audit fix
# Run tests
npm test
# If tests pass, commit and deploy
git add package*.json
git commit -m "chore: security patch updates"
git push origin main
```
### Major Version Updates (Quarterly)
```bash
# Check for outdated packages
npm outdated
# Update one major dependency at a time
npm install @nestjs/core@latest
# Test thoroughly
npm test
npm run test:e2e
# If successful, commit
git commit -am "chore: update @nestjs/core to vX.X.X"
```
---
## 🧹 Log Management
### Log Rotation Configuration
```bash
# File: /etc/logrotate.d/lcbp3-dms
/app/logs/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 node node
sharedscripts
postrotate
docker exec lcbp3-backend kill -USR1 1
endscript
}
```
### Manual Log Cleanup
```bash
#!/bin/bash
# File: /scripts/cleanup-logs.sh
# Delete logs older than 90 days
find /app/logs -name "*.log" -type f -mtime +90 -delete
# Compress logs older than 7 days
find /app/logs -name "*.log" -type f -mtime +7 -exec gzip {} \;
# Clean Docker logs
docker system prune -f --volumes --filter "until=720h"
echo "Log cleanup completed: $(date)"
```
---
## 🔐 SSL Certificate Renewal
### Check Certificate Expiry
```bash
#!/bin/bash
# File: /scripts/check-ssl-cert.sh
CERT_FILE="/app/nginx/ssl/cert.pem"
EXPIRY_DATE=$(openssl x509 -enddate -noout -in "$CERT_FILE" | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( ($EXPIRY_EPOCH - $NOW_EPOCH) / 86400 ))
echo "SSL certificate expires in $DAYS_LEFT days"
if [ $DAYS_LEFT -lt 30 ]; then
echo "WARNING: SSL certificate expires soon!"
# Send alert
/scripts/send-alert-email.sh "SSL Certificate Expiring" "Certificate expires in $DAYS_LEFT days"
fi
```
### Renew SSL Certificate (Let's Encrypt)
```bash
#!/bin/bash
# File: /scripts/renew-ssl.sh
# Renew certificate
certbot renew --webroot -w /app/nginx/html
# Copy new certificate
cp /etc/letsencrypt/live/lcbp3-dms.example.com/fullchain.pem /app/nginx/ssl/cert.pem
cp /etc/letsencrypt/live/lcbp3-dms.example.com/privkey.pem /app/nginx/ssl/key.pem
# Reload NGINX
docker exec lcbp3-nginx nginx -s reload
echo "SSL certificate renewed: $(date)"
```
---
## 🧪 Performance Optimization
### Database Query Optimization
```sql
-- Find slow queries
SELECT * FROM mysql.slow_log
ORDER BY query_time DESC
LIMIT 10;
-- Add indexes for frequently queried columns
CREATE INDEX idx_correspondences_status ON correspondences(status);
CREATE INDEX idx_rfas_workflow_status ON rfas(workflow_status);
CREATE INDEX idx_attachments_entity ON attachments(entity_type, entity_id);
-- Analyze query execution plan
EXPLAIN SELECT * FROM correspondences
WHERE status = 'PENDING'
AND created_at > DATE_SUB(NOW(), INTERVAL 30 DAY);
```
### Redis Cache Optimization
```bash
#!/bin/bash
# File: /scripts/optimize-redis.sh
# Check Redis memory usage
docker exec lcbp3-redis redis-cli INFO memory
# Set max memory policy
docker exec lcbp3-redis redis-cli CONFIG SET maxmemory 1gb
docker exec lcbp3-redis redis-cli CONFIG SET maxmemory-policy allkeys-lru
# Save configuration
docker exec lcbp3-redis redis-cli CONFIG REWRITE
# Clear stale cache (if needed)
docker exec lcbp3-redis redis-cli FLUSHDB
```
### Application Performance Tuning
```typescript
// Enable production optimizations in NestJS
// File: backend/src/main.ts
async function bootstrap() {
const app = await NestFactory.create(AppModule, {
logger:
process.env.NODE_ENV === 'production'
? ['error', 'warn']
: ['log', 'error', 'warn', 'debug'],
});
// Enable compression
app.use(compression());
// Enable caching
app.useGlobalInterceptors(new CacheInterceptor());
// Set global timeout
app.use(timeout('30s'));
await app.listen(3000);
}
```
---
## 🔒 Security Maintenance
### Monthly Security Tasks
```bash
#!/bin/bash
# File: /scripts/security-maintenance.sh
# Update system packages
apt-get update && apt-get upgrade -y
# Update ClamAV virus definitions
docker exec lcbp3-clamav freshclam
# Scan for rootkits
rkhunter --check --skip-keypress
# Check for unauthorized users
awk -F: '($3 >= 1000) {print $1}' /etc/passwd
# Review sudo access
cat /etc/sudoers
# Check firewall rules
iptables -L -n -v
echo "Security maintenance completed: $(date)"
```
---
## ✅ Maintenance Checklist
### Pre-Maintenance
- [ ] Announce maintenance window to users
- [ ] Backup database and files
- [ ] Document current system state
- [ ] Prepare rollback plan
### During Maintenance
- [ ] Put system in maintenance mode (if needed)
- [ ] Perform updates/changes
- [ ] Run smoke tests
- [ ] Monitor system health
### Post-Maintenance
- [ ] Verify all services running
- [ ] Run full test suite
- [ ] Monitor performance metrics
- [ ] Communicate completion to users
- [ ] Document changes made
---
## 🔧 Emergency Maintenance
### Unplanned Maintenance Procedures
1. **Assess Urgency**
- Can it wait for scheduled maintenance?
- Is it causing active issues?
2. **Communicate Impact**
- Notify stakeholders immediately
- Estimate downtime
- Provide updates every 30 minutes
3. **Execute Carefully**
- Always backup first
- Have rollback plan ready
- Test in staging if possible
4. **Post-Maintenance Review**
- Document what happened
- Identify preventive measures
- Update runbooks
---
## 📚 Related Documents
- [Deployment Guide](./deployment-guide.md)
- [Backup & Recovery](./backup-recovery.md)
- [Monitoring & Alerting](./monitoring-alerting.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01

View File

@@ -0,0 +1,443 @@
# Monitoring & Alerting
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document describes monitoring setup, health checks, and alerting rules for LCBP3-DMS.
---
## 🎯 Monitoring Objectives
- **Availability:** System uptime > 99.5%
- **Performance:** API response time < 500ms (P95)
- **Reliability:** Error rate < 1%
- **Capacity:** Resource utilization < 80%
---
## 📊 Key Metrics
### Application Metrics
| Metric | Target | Alert Threshold |
| ----------------------- | ------- | ------------------ |
| API Response Time (P95) | < 500ms | > 1000ms |
| Error Rate | < 1% | > 5% |
| Request Rate | N/A | Sudden ±50% change |
| Active Users | N/A | - |
| Queue Length (BullMQ) | < 100 | > 500 |
### Infrastructure Metrics
| Metric | Target | Alert Threshold |
| ------------ | ------ | ----------------- |
| CPU Usage | < 70% | > 90% |
| Memory Usage | < 80% | > 95% |
| Disk Usage | < 80% | > 90% |
| Network I/O | N/A | Anomaly detection |
### Database Metrics
| Metric | Target | Alert Threshold |
| --------------------- | ------- | --------------- |
| Query Time (P95) | < 100ms | > 500ms |
| Connection Pool Usage | < 80% | > 95% |
| Slow Queries | 0 | > 10/min |
| Replication Lag | 0s | > 30s |
---
## 🔍 Health Checks
### Backend Health Endpoint
```typescript
// File: backend/src/health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import {
HealthCheck,
HealthCheckService,
TypeOrmHealthIndicator,
DiskHealthIndicator,
} from '@nestjs/terminus';
@Controller('health')
export class HealthController {
constructor(
private health: HealthCheckService,
private db: TypeOrmHealthIndicator,
private disk: DiskHealthIndicator
) {}
@Get()
@HealthCheck()
check() {
return this.health.check([
// Database health
() => this.db.pingCheck('database'),
// Disk health
() =>
this.disk.checkStorage('storage', {
path: '/',
thresholdPercent: 0.9,
}),
// Redis health
async () => {
const redis = await this.redis.ping();
return { redis: { status: redis === 'PONG' ? 'up' : 'down' } };
},
]);
}
}
```
### Health Check Response
```json
{
"status": "ok",
"info": {
"database": {
"status": "up"
},
"storage": {
"status": "up",
"freePercent": 0.75
},
"redis": {
"status": "up"
}
},
"error": {},
"details": {
"database": {
"status": "up"
},
"storage": {
"status": "up",
"freePercent": 0.75
},
"redis": {
"status": "up"
}
}
}
```
---
## 🐳 Docker Container Monitoring
### Health Check in docker-compose.yml
```yaml
services:
backend:
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:3000/health']
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
mariadb:
healthcheck:
test: ['CMD', 'mysqladmin', 'ping', '-h', 'localhost']
interval: 30s
timeout: 10s
retries: 3
redis:
healthcheck:
test: ['CMD', 'redis-cli', 'ping']
interval: 30s
timeout: 10s
retries: 3
```
### Monitor Container Status
```bash
#!/bin/bash
# File: /scripts/monitor-containers.sh
# Check all containers are healthy
CONTAINERS=("lcbp3-backend" "lcbp3-frontend" "lcbp3-mariadb" "lcbp3-redis")
for CONTAINER in "${CONTAINERS[@]}"; do
HEALTH=$(docker inspect --format='{{.State.Health.Status}}' $CONTAINER 2>/dev/null)
if [ "$HEALTH" != "healthy" ]; then
echo "ALERT: $CONTAINER is $HEALTH"
# Send alert (email, Slack, etc.)
fi
done
```
---
## 📈 Application Performance Monitoring (APM)
### Log-Based Monitoring (MVP Phase)
```typescript
// File: backend/src/common/interceptors/performance.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { logger } from 'src/config/logger.config';
@Injectable()
export class PerformanceInterceptor implements NestInterceptor {
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const start = Date.now();
return next.handle().pipe(
tap({
next: () => {
const duration = Date.now() - start;
logger.info('Request completed', {
method: request.method,
url: request.url,
statusCode: context.switchToHttp().getResponse().statusCode,
duration: `${duration}ms`,
userId: request.user?.user_id,
});
// Alert on slow requests
if (duration > 1000) {
logger.warn('Slow request detected', {
method: request.method,
url: request.url,
duration: `${duration}ms`,
});
}
},
error: (error) => {
const duration = Date.now() - start;
logger.error('Request failed', {
method: request.method,
url: request.url,
duration: `${duration}ms`,
error: error.message,
});
},
})
);
}
}
```
---
## 🚨 Alerting Rules
### Critical Alerts (Immediate Action Required)
| Alert | Condition | Action |
| --------------- | ------------------------------------------- | --------------------------- |
| Service Down | Health check fails for 3 consecutive checks | Page on-call engineer |
| Database Down | Cannot connect to database | Page DBA + on-call engineer |
| Disk Full | Disk usage > 95% | Page operations team |
| High Error Rate | Error rate > 10% for 5 min | Page on-call engineer |
### Warning Alerts (Review Within 1 Hour)
| Alert | Condition | Action |
| ------------- | ----------------------- | ---------------------- |
| High CPU | CPU > 90% for 10 min | Notify operations team |
| High Memory | Memory > 95% for 10 min | Notify operations team |
| Slow Queries | > 50 slow queries/min | Notify DBA |
| Queue Backlog | BullMQ queue > 500 jobs | Notify backend team |
### Info Alerts (Review During Business Hours)
| Alert | Condition | Action |
| ------------------ | ------------------------------------ | --------------------- |
| Backup Failed | Daily backup job failed | Email operations team |
| SSL Expiring | SSL certificate expires in < 30 days | Email operations team |
| Disk Space Warning | Disk usage > 80% | Email operations team |
---
## 📧 Alert Notification Channels
### Email Alerts
```bash
#!/bin/bash
# File: /scripts/send-alert-email.sh
TO="ops-team@example.com"
SUBJECT="$1"
MESSAGE="$2"
echo "$MESSAGE" | mail -s "[LCBP3-DMS] $SUBJECT" "$TO"
```
### Slack Alerts
```bash
#!/bin/bash
# File: /scripts/send-alert-slack.sh
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
MESSAGE="$1"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"🚨 LCBP3-DMS Alert: $MESSAGE\"}" \
"$WEBHOOK_URL"
```
---
## 📊 Monitoring Dashboard
### Metrics to Display
**System Overview:**
- Service status (up/down)
- Overall system health score
- Active user count
- Request rate (req/s)
**Performance:**
- API response time (P50, P95, P99)
- Database query time
- Queue processing time
**Resources:**
- CPU usage %
- Memory usage %
- Disk usage %
- Network I/O
**Business Metrics:**
- Documents created today
- Workflows completed today
- Active correspondences
- Pending approvals
---
## 🔧 Log Aggregation
### Centralized Logging with Docker
```bash
# Configure Docker logging driver
# File: /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3",
"labels": "service,environment"
}
}
```
### View Aggregated Logs
```bash
# View all LCBP3 container logs
docker-compose logs -f --tail=100
# View specific service logs
docker logs lcbp3-backend -f --since=1h
# Search logs
docker logs lcbp3-backend 2>&1 | grep "ERROR"
# Export logs for analysis
docker logs lcbp3-backend > backend-logs.txt
```
---
## 📈 Performance Baseline
### Establish Baselines
Run load tests to establish performance baselines:
```bash
# Install Apache Bench
apt-get install apache2-utils
# Test API endpoint
ab -n 1000 -c 10 \
-H "Authorization: Bearer <TOKEN>" \
https://lcbp3-dms.example.com/api/correspondences
# Results to record:
# - Requests per second
# - Mean response time
# - P95 response time
# - Error rate
```
### Regular Performance Testing
- **Weekly:** Quick health check (100 requests)
- **Monthly:** Full load test (10,000 requests)
- **Quarterly:** Stress test (find breaking point)
---
## ✅ Monitoring Checklist
### Daily
- [ ] Check service health dashboard
- [ ] Review error logs
- [ ] Verify backup completion
- [ ] Check disk space
### Weekly
- [ ] Review performance metrics trends
- [ ] Analyze slow query log
- [ ] Check SSL certificate expiry
- [ ] Review security alerts
### Monthly
- [ ] Capacity planning review
- [ ] Update monitoring thresholds
- [ ] Test alert notifications
- [ ] Review and tune performance
---
## 🔗 Related Documents
- [Backup & Recovery](./backup-recovery.md)
- [Incident Response](./incident-response.md)
- [ADR-010: Logging Strategy](../05-decisions/ADR-010-logging-monitoring-strategy.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01

View File

View File

@@ -0,0 +1,444 @@
# Security Operations
**Project:** LCBP3-DMS
**Version:** 1.5.0
**Last Updated:** 2025-12-01
---
## 📋 Overview
This document outlines security monitoring, access control management, vulnerability management, and security incident response for LCBP3-DMS.
---
## 🔒 Access Control Management
### User Access Review
**Monthly Tasks:**
```bash
#!/bin/bash
# File: /scripts/audit-user-access.sh
# Export active users
docker exec lcbp3-mariadb mysql -u root -p -e "
SELECT user_id, username, email, primary_organization_id, is_active, last_login_at
FROM lcbp3_dms.users
WHERE is_active = 1
ORDER BY last_login_at DESC;
" > /reports/active-users-$(date +%Y%m%d).csv
# Find dormant accounts (no login > 90 days)
docker exec lcbp3-mariadb mysql -u root -p -e "
SELECT user_id, username, email, last_login_at,
DATEDIFF(NOW(), last_login_at) AS days_inactive
FROM lcbp3_dms.users
WHERE is_active = 1
AND (last_login_at IS NULL OR last_login_at < DATE_SUB(NOW(), INTERVAL 90 DAY));
"
echo "User access audit completed: $(date)"
```
### Role & Permission Audit
```sql
-- Review users with elevated permissions
SELECT u.username, u.email, r.role_name, r.scope
FROM users u
JOIN user_assignments ua ON u.user_id = ua.user_id
JOIN roles r ON ua.role_id = r.role_id
WHERE r.role_name IN ('Superadmin', 'Document Controller', 'Project Manager')
ORDER BY r.role_name, u.username;
-- Review Global scope roles (highest privilege)
SELECT u.username, r.role_name
FROM users u
JOIN user_assignments ua ON u.user_id = ua.user_id
JOIN roles r ON ua.role_id = r.role_id
WHERE r.scope = 'Global';
```
---
## 🛡️ Security Monitoring
### Log Monitoring for Security Events
```bash
#!/bin/bash
# File: /scripts/monitor-security-events.sh
# Check for failed login attempts
docker logs lcbp3-backend | grep "Failed login" | tail -20
# Check for unauthorized access attempts (403)
docker logs lcbp3-backend | grep "403" | tail -20
# Check for unusual activity patterns
docker logs lcbp3-backend | grep -E "DELETE|DROP|TRUNCATE" | tail -20
# Check for SQL injection attempts
docker logs lcbp3-backend | grep -i "SELECT.*FROM.*WHERE" | grep -v "legitimate" | tail -20
```
### Failed Login Monitoring
```sql
-- Find accounts with multiple failed login attempts
SELECT username, failed_attempts, locked_until
FROM users
WHERE failed_attempts >= 3
ORDER BY failed_attempts DESC;
-- Unlock user account after verification
UPDATE users
SET failed_attempts = 0, locked_until = NULL
WHERE user_id = ?;
```
---
## 🔐 Secrets & Credentials Management
### Password Rotation Schedule
| Credential | Rotation Frequency | Owner |
| ---------------------- | ------------------------ | ------------ |
| Database Root Password | Every 90 days | DBA |
| Database App Password | Every 90 days | DevOps |
| JWT Secret | Every 180 days | Backend Team |
| Redis Password | Every 90 days | DevOps |
| SMTP Password | When provider requires | Operations |
| SSL Private Key | With certificate renewal | Operations |
### Password Rotation Procedure
```bash
#!/bin/bash
# File: /scripts/rotate-db-password.sh
# Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)
# Update database user password
docker exec lcbp3-mariadb mysql -u root -p -e "
ALTER USER 'lcbp3_user'@'%' IDENTIFIED BY '$NEW_PASSWORD';
FLUSH PRIVILEGES;
"
# Update application .env file
sed -i "s/^DB_PASS=.*/DB_PASS=$NEW_PASSWORD/" /app/backend/.env
# Restart backend to apply new password
docker restart lcbp3-backend
# Verify connection
sleep 10
curl -f http://localhost:3000/health || {
echo "FAILED: Backend cannot connect with new password"
# Rollback procedure...
exit 1
}
echo "Database password rotated successfully: $(date)"
# Store password securely (e.g., password manager)
```
---
## 🚨 Vulnerability Management
### Dependency Vulnerability Scanning
```bash
#!/bin/bash
# File: /scripts/scan-vulnerabilities.sh
# Backend dependencies
cd /app/backend
npm audit --production
# Critical/High vulnerabilities
VULNERABILITIES=$(npm audit --production --json | jq '.metadata.vulnerabilities.high + .metadata.vulnerabilities.critical')
if [ "$VULNERABILITIES" -gt 0 ]; then
echo "WARNING: $VULNERABILITIES critical/high vulnerabilities found!"
npm audit --production > /reports/security-audit-$(date +%Y%m%d).txt
# Send alert
/scripts/send-alert-email.sh "Security Vulnerabilities Detected" "Found $VULNERABILITIES critical/high vulnerabilities"
fi
# Frontend dependencies
cd /app/frontend
npm audit --production
```
### Container Image Scanning
```bash
#!/bin/bash
# File: /scripts/scan-images.sh
# Install Trivy (if not installed)
# wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | apt-key add -
# echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | tee -a /etc/apt/sources.list.d/trivy.list
# apt-get update && apt-get install trivy
# Scan Docker images
trivy image --severity HIGH,CRITICAL lcbp3-backend:latest
trivy image --severity HIGH,CRITICAL lcbp3-frontend:latest
trivy image --severity HIGH,CRITICAL mariadb:10.11
trivy image --severity HIGH,CRITICAL redis:7.2-alpine
```
---
## 🔍 Security Hardening
### Server Hardening Checklist
- [ ] Disable root SSH login
- [ ] Use SSH key authentication only
- [ ] Configure firewall (allow only necessary ports)
- [ ] Enable automatic security updates
- [ ] Remove unnecessary services
- [ ] Configure fail2ban for brute-force protection
- [ ] Enable SELinux/AppArmor
- [ ] Regular security patch updates
### Docker Security
```yaml
# docker-compose.yml - Security best practices
services:
backend:
# Run as non-root user
user: 'node:node'
# Read-only root filesystem
read_only: true
# No new privileges
security_opt:
- no-new-privileges:true
# Limit capabilities
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
# Resource limits
deploy:
resources:
limits:
cpus: '2'
memory: 2G
reservations:
memory: 512M
```
### Database Security
```sql
-- Remove anonymous users
DELETE FROM mysql.user WHERE User='';
-- Remove test database
DROP DATABASE IF EXISTS test;
-- Remove remote root login
DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1');
-- Create dedicated backup user with minimal privileges
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'STRONG_PASSWORD';
GRANT SELECT, LOCK TABLES, SHOW VIEW, EVENT, TRIGGER ON lcbp3_dms.* TO 'backup_user'@'localhost';
-- Enable SSL for database connections
-- GRANT USAGE ON *.* TO 'lcbp3_user'@'%' REQUIRE SSL;
FLUSH PRIVILEGES;
```
---
## 🚨 Security Incident Response
### Incident Classification
| Type | Examples | Response Time |
| ----------------------- | ---------------------------- | ---------------- |
| **Data Breach** | Unauthorized data access | Immediate (< 1h) |
| **Account Compromise** | Stolen credentials | Immediate (< 1h) |
| **DDoS Attack** | Service unavailable | Immediate (< 1h) |
| **Malware/Ransomware** | Infected systems | Immediate (< 1h) |
| **Unauthorized Access** | Failed authentication spikes | High (< 4h) |
| **Suspicious Activity** | Unusual patterns | Medium (< 24h) |
### Data Breach Response
**Immediate Actions:**
1. **Contain the breach**
```bash
# Block suspicious IPs at firewall level
iptables -A INPUT -s <SUSPICIOUS_IP> -j DROP
# Disable compromised user accounts
docker exec lcbp3-mariadb mysql -u root -p -e "
UPDATE lcbp3_dms.users
SET is_active = 0
WHERE user_id = <COMPROMISED_USER_ID>;
"
```
2. **Assess impact**
```sql
-- Check audit logs for unauthorized access
SELECT * FROM audit_logs
WHERE user_id = <COMPROMISED_USER_ID>
AND created_at >= '<SUSPECTED_START_TIME>'
ORDER BY created_at DESC;
-- Check what documents were accessed
SELECT DISTINCT entity_id, entity_type, action
FROM audit_logs
WHERE user_id = <COMPROMISED_USER_ID>;
```
3. **Notify stakeholders**
- Security officer
- Management
- Affected users (if applicable)
- Legal team (if required by law)
4. **Document everything**
- Timeline of events
- Data accessed/compromised
- Actions taken
- Lessons learned
### Account Compromise Response
```bash
#!/bin/bash
# File: /scripts/respond-account-compromise.sh
USER_ID=$1
# 1. Immediately disable account
docker exec lcbp3-mariadb mysql -u root -p -e "
UPDATE lcbp3_dms.users
SET is_active = 0,
locked_until = DATE_ADD(NOW(), INTERVAL 24 HOUR)
WHERE user_id = $USER_ID;
"
# 2. Invalidate all sessions
docker exec lcbp3-redis redis-cli DEL "session:user:$USER_ID:*"
# 3. Generate audit report
docker exec lcbp3-mariadb mysql -u root -p -e "
SELECT * FROM lcbp3_dms.audit_logs
WHERE user_id = $USER_ID
AND created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
ORDER BY created_at DESC;
" > /reports/compromise-audit-user-$USER_ID-$(date +%Y%m%d).txt
# 4. Notify security team
/scripts/send-alert-email.sh "Account Compromise" "User ID $USER_ID has been compromised and disabled"
echo "Account compromise response completed for User ID: $USER_ID"
```
---
## 📊 Security Metrics & KPIs
### Monthly Security Report
| Metric | Target | Actual |
| --------------------------- | --------- | ------ |
| Failed Login Attempts | < 100/day | Track |
| Locked Accounts | < 5/month | Track |
| Critical Vulnerabilities | 0 | Track |
| High Vulnerabilities | < 5 | Track |
| Unpatched Systems | 0 | Track |
| Security Incidents | 0 | Track |
| Mean Time To Detect (MTTD) | < 1 hour | Track |
| Mean Time To Respond (MTTR) | < 4 hours | Track |
---
## 🔐 Compliance & Audit
### Audit Log Retention
- **Access Logs:** 1 year
- **Security Events:** 2 years
- **Admin Actions:** 3 years
- **Data Changes:** 7 years (as required)
### Compliance Checklist
- [ ] Regular security audits (quarterly)
- [ ] Penetration testing (annually)
- [ ] Access control reviews (monthly)
- [ ] Encryption at rest and in transit
- [ ] Secure password policies enforced
- [ ] Multi-factor authentication (if required)
- [ ] Data backup and recovery tested
- [ ] Incident response plan documented and tested
---
## ✅ Security Operations Checklist
### Daily
- [ ] Review security alerts and logs
- [ ] Monitor failed login attempts
- [ ] Check for unusual access patterns
- [ ] Verify backup completion
### Weekly
- [ ] Review user access logs
- [ ] Scan for vulnerabilities
- [ ] Update virus definitions
- [ ] Review firewall logs
### Monthly
- [ ] User access audit
- [ ] Role and permission review
- [ ] Security patch application
- [ ] Compliance review
### Quarterly
- [ ] Full security audit
- [ ] Penetration testing
- [ ] Disaster recovery drill
- [ ] Update security policies
---
## 🔗 Related Documents
- [Incident Response](./incident-response.md)
- [Monitoring & Alerting](./monitoring-alerting.md)
- [ADR-004: RBAC Implementation](../05-decisions/ADR-004-rbac-implementation.md)
---
**Version:** 1.5.0
**Last Review:** 2025-12-01
**Next Review:** 2026-03-01