diff --git a/specs/002-infra-ops/checklists/requirements.md b/specs/002-infra-ops/checklists/requirements.md new file mode 100644 index 0000000..8207ffe --- /dev/null +++ b/specs/002-infra-ops/checklists/requirements.md @@ -0,0 +1,34 @@ +# Specification Quality Checklist: Infrastructure Operations & Deployment Automation + +**Purpose**: Validate specification completeness and quality before proceeding to planning +**Created**: 2026-04-20 +**Feature**: [Infrastructure Operations & Deployment Automation](../spec.md) + +## Content Quality + +- [x] No implementation details (languages, frameworks, APIs) +- [x] Focused on user value and business needs +- [x] Written for non-technical stakeholders +- [x] All mandatory sections completed + +## Requirement Completeness + +- [x] No [NEEDS CLARIFICATION] markers remain +- [x] Requirements are testable and unambiguous +- [x] Success criteria are measurable +- [x] Success criteria are technology-agnostic (no implementation details) +- [x] All acceptance scenarios are defined +- [x] Edge cases are identified +- [x] Scope is clearly bounded +- [x] Dependencies and assumptions identified + +## Feature Readiness + +- [x] All functional requirements have clear acceptance criteria +- [x] User scenarios cover primary flows +- [x] Feature meets measurable outcomes defined in Success Criteria +- [x] No implementation details leak into specification + +## Notes + +- Items marked incomplete require spec updates before `/speckit-clarify` or `/speckit-plan` diff --git a/specs/002-infra-ops/contracts/infrastructure-api.yaml b/specs/002-infra-ops/contracts/infrastructure-api.yaml new file mode 100644 index 0000000..786cf65 --- /dev/null +++ b/specs/002-infra-ops/contracts/infrastructure-api.yaml @@ -0,0 +1,500 @@ +openapi: 3.0.3 +info: + title: Infrastructure Operations API + description: API for managing infrastructure operations, deployments, and monitoring + version: 1.0.0 + contact: + name: Infrastructure Team + email: infra@np-dms.work + +paths: + /deployments: + get: + summary: List all deployments + description: Retrieve status of all deployment environments + tags: + - Deployments + responses: + '200': + description: List of deployments retrieved successfully + content: + application/json: + schema: + type: object + properties: + deployments: + type: array + items: + $ref: '#/components/schemas/Deployment' + + post: + summary: Create new deployment + description: Initiate a new deployment to specified environment + tags: + - Deployments + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/DeploymentRequest' + responses: + '201': + description: Deployment initiated successfully + content: + application/json: + schema: + $ref: '#/components/schemas/Deployment' + '400': + description: Invalid deployment request + '409': + description: Deployment already in progress + + /deployments/{deploymentId}: + get: + summary: Get deployment details + description: Retrieve detailed information about a specific deployment + tags: + - Deployments + parameters: + - name: deploymentId + in: path + required: true + schema: + type: string + format: uuid + responses: + '200': + description: Deployment details retrieved successfully + content: + application/json: + schema: + $ref: '#/components/schemas/Deployment' + '404': + description: Deployment not found + + patch: + summary: Update deployment status + description: Update deployment status or trigger rollback + tags: + - Deployments + parameters: + - name: deploymentId + in: path + required: true + schema: + type: string + format: uuid + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/DeploymentUpdate' + responses: + '200': + description: Deployment updated successfully + content: + application/json: + schema: + $ref: '#/components/schemas/Deployment' + '404': + description: Deployment not found + '409': + description: Invalid state transition + + /backups: + get: + summary: List backup archives + description: Retrieve list of available backup archives + tags: + - Backups + parameters: + - name: status + in: query + schema: + type: string + enum: [completed, in_progress, failed, validated] + - name: environment + in: query + schema: + type: string + responses: + '200': + description: List of backup archives retrieved successfully + content: + application/json: + schema: + type: object + properties: + backups: + type: array + items: + $ref: '#/components/schemas/BackupArchive' + + post: + summary: Create backup + description: Initiate a new backup operation + tags: + - Backups + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/BackupRequest' + responses: + '201': + description: Backup initiated successfully + content: + application/json: + schema: + $ref: '#/components/schemas/BackupArchive' + '409': + description: Backup already in progress + + /backups/{backupId}/restore: + post: + summary: Restore from backup + description: Initiate restore operation from specified backup + tags: + - Backups + parameters: + - name: backupId + in: path + required: true + schema: + type: string + format: uuid + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/RestoreRequest' + responses: + '202': + description: Restore operation initiated + content: + application/json: + schema: + $ref: '#/components/schemas/RestoreOperation' + '404': + description: Backup not found + '409': + description: Restore operation already in progress + + /monitoring/metrics: + get: + summary: Get monitoring metrics + description: Retrieve current monitoring metrics for all services + tags: + - Monitoring + parameters: + - name: service + in: query + schema: + type: string + - name: metric + in: query + schema: + type: string + - name: timeRange + in: query + schema: + type: string + enum: [1h, 6h, 24h, 7d, 30d] + responses: + '200': + description: Metrics retrieved successfully + content: + application/json: + schema: + type: object + properties: + metrics: + type: array + items: + $ref: '#/components/schemas/MonitoringMetric' + + /monitoring/alerts: + get: + summary: Get active alerts + description: Retrieve list of active monitoring alerts + tags: + - Monitoring + parameters: + - name: severity + in: query + schema: + type: string + enum: [critical, warning, info] + - name: status + in: query + schema: + type: string + enum: [active, acknowledged, resolved] + responses: + '200': + description: Alerts retrieved successfully + content: + application/json: + schema: + type: object + properties: + alerts: + type: array + items: + $ref: '#/components/schemas/Alert' + + post: + summary: Acknowledge alert + description: Acknowledge an active alert + tags: + - Monitoring + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/AlertAcknowledgment' + responses: + '200': + description: Alert acknowledged successfully + '404': + description: Alert not found + +components: + schemas: + Deployment: + type: object + properties: + id: + type: string + format: uuid + environment: + type: string + enum: [blue, green, staging, production] + status: + type: string + enum: [planned, in_progress, testing, live, failed, decommissioned] + version: + type: string + services: + type: array + items: + type: string + createdAt: + type: string + format: date-time + updatedAt: + type: string + format: date-time + healthStatus: + type: string + enum: [healthy, unhealthy, unknown] + + DeploymentRequest: + type: object + required: + - environment + - version + properties: + environment: + type: string + enum: [blue, green, staging, production] + version: + type: string + services: + type: array + items: + type: string + rollbackPlan: + type: boolean + healthCheckTimeout: + type: integer + format: int32 + + DeploymentUpdate: + type: object + properties: + status: + type: string + enum: [testing, live, failed, decommissioned] + rollback: + type: boolean + reason: + type: string + + BackupArchive: + type: object + properties: + id: + type: string + format: uuid + type: + type: string + enum: [full, incremental, differential] + status: + type: string + enum: [scheduled, in_progress, completed, failed, validated, expired] + environment: + type: string + size: + type: integer + format: int64 + compressionRatio: + type: number + format: float + encrypted: + type: boolean + validated: + type: boolean + createdAt: + type: string + format: date-time + expiresAt: + type: string + format: date-time + retentionDays: + type: integer + format: int32 + + BackupRequest: + type: object + required: + - type + - environment + properties: + type: + type: string + enum: [full, incremental, differential] + environment: + type: string + include: + type: array + items: + type: string + enum: [databases, files, configurations, logs] + compression: + type: boolean + encryption: + type: boolean + validation: + type: boolean + + RestoreRequest: + type: object + required: + - targetEnvironment + properties: + targetEnvironment: + type: string + include: + type: array + items: + type: string + enum: [databases, files, configurations, logs] + confirm: + type: boolean + reason: + type: string + + RestoreOperation: + type: object + properties: + id: + type: string + format: uuid + backupId: + type: string + format: uuid + targetEnvironment: + type: string + status: + type: string + enum: [pending, in_progress, completed, failed] + progress: + type: integer + format: int32 + estimatedCompletion: + type: string + format: date-time + startedAt: + type: string + format: date-time + + MonitoringMetric: + type: object + properties: + id: + type: string + format: uuid + service: + type: string + metric: + type: string + value: + type: number + format: float + unit: + type: string + timestamp: + type: string + format: date-time + labels: + type: object + additionalProperties: + type: string + + Alert: + type: object + properties: + id: + type: string + format: uuid + rule: + type: string + severity: + type: string + enum: [critical, warning, info] + status: + type: string + enum: [active, acknowledged, resolved] + service: + type: string + message: + type: string + triggeredAt: + type: string + format: date-time + acknowledgedAt: + type: string + format: date-time + acknowledgedBy: + type: string + resolvedAt: + type: string + format: date-time + + AlertAcknowledgment: + type: object + required: + - alertId + properties: + alertId: + type: string + format: uuid + acknowledgedBy: + type: string + note: + type: string + + securitySchemes: + BearerAuth: + type: http + scheme: bearer + bearerFormat: JWT + +security: + - BearerAuth: [] diff --git a/specs/002-infra-ops/data-model.md b/specs/002-infra-ops/data-model.md new file mode 100644 index 0000000..ea03aab --- /dev/null +++ b/specs/002-infra-ops/data-model.md @@ -0,0 +1,249 @@ +# Data Model: Infrastructure Operations & Deployment Automation + +**Date**: 2026-04-20 +**Feature**: Infrastructure Operations & Deployment Automation +**Status**: Complete + +## Infrastructure Entities + +### Docker Compose Configuration + +**Description**: Infrastructure as code definitions for all services, environments, and deployments +**Key Attributes**: +- Configuration ID (unique identifier) +- Environment (development/staging/production) +- Service definitions and dependencies +- Network configurations +- Volume mappings +- Environment variables (secrets excluded) +- Health check definitions +- Resource limits +- Security policies (user, capabilities, read-only) + +**Validation Rules**: +- All services must have health checks +- All containers must specify non-root user where possible +- All secrets must use external env files +- All images must use specific tags (no :latest) +- Resource limits must be defined for CPU and memory + +### Backup Archive + +**Description**: Complete system snapshots including databases, files, and configurations with metadata +**Key Attributes**: +- Archive ID (unique identifier) +- Timestamp (creation time) +- Backup type (full/incremental) +- Source environment +- Data sources (databases, files, configs) +- Compression status +- Encryption status +- Validation status +- Retention period +- Storage location + +**Validation Rules**: +- All archives must be encrypted +- All archives must have integrity validation +- Backup frequency: daily for critical data +- Retention: 30 days daily, 90 days weekly, 1 year monthly +- Must include database consistency checks + +### Monitoring Metric + +**Description**: Performance and health data points collected from all infrastructure components +**Key Attributes**: +- Metric ID (unique identifier) +- Source service/container +- Metric name and type +- Value and timestamp +- Labels and dimensions +- Threshold definitions +- Alert status +- Aggregation rules + +**Validation Rules**: +- All services must expose health metrics +- Critical metrics must have alert thresholds +- Data retention: 90 days detailed, 1 year aggregated +- Metrics must include CPU, memory, disk, network +- Application-specific metrics for business logic + +### Security Policy + +**Description**: Container hardening rules and compliance requirements for all deployments +**Key Attributes**: +- Policy ID (unique identifier) +- Policy type (user, capabilities, filesystem) +- Rule definitions +- Applicable services +- Compliance status +- Violation tracking +- Remediation procedures + +**Validation Rules**: +- All containers must run with non-root users +- All containers must drop unnecessary capabilities +- All containers must use read-only filesystems where possible +- All containers must have security options defined +- Regular vulnerability scanning required + +### Deployment Environment + +**Description**: Isolated runtime spaces with consistent configurations +**Key Attributes**: +- Environment ID (unique identifier) +- Environment type (blue/green) +- Service instances +- Network configuration +- Storage configuration +- Access controls +- Deployment status +- Health status + +**Validation Rules**: +- Blue and green environments must be identical +- Network isolation between environments +- Consistent configuration across environments +- Automated health checks required +- Traffic switching must be atomic + +### Alert Rule + +**Description**: Threshold-based conditions that trigger notifications when system metrics exceed limits +**Key Attributes**: +- Rule ID (unique identifier) +- Metric source +- Threshold conditions +- Severity levels +- Notification channels +- Escalation rules +- Suppression rules +- Acknowledgment status + +**Validation Rules**: +- All critical services must have alert rules +- Alert response time must be < 30 seconds +- Must include escalation paths +- Must define recovery procedures +- Regular alert testing required + +### Secret Configuration + +**Description**: Sensitive information managed outside version control +**Key Attributes**: +- Secret ID (unique identifier) +- Secret type (password, key, certificate) +- Usage context +- Access controls +- Rotation schedule +- Expiration date +- Compliance requirements + +**Validation Rules**: +- No secrets in version control +- All secrets must be encrypted at rest +- Access must be role-based +- Regular rotation required +- Audit trail for all access + +### Service Instance + +**Description**: Running container with specific configuration and health status +**Key Attributes**: +- Instance ID (unique identifier) +- Service name and version +- Container configuration +- Resource allocation +- Health status +- Start time +- Network endpoints +- Log configuration + +**Validation Rules**: +- All instances must have health checks +- Resource limits must be enforced +- Restart policies must be defined +- Log aggregation must be configured +- Performance monitoring required + +### Infrastructure Change + +**Description**: Version-controlled modification to system configuration or deployment +**Key Attributes**: +- Change ID (unique identifier) +- Change type (configuration, deployment, security) +- Description and rationale +- Approval status +- Implementation status +- Rollback plan +- Impact assessment +- Compliance validation + +**Validation Rules**: +- All changes must be version-controlled +- Changes require approval before production +- Rollback plans must be tested +- Impact assessment required +- Compliance validation mandatory + +### Recovery Point + +**Description**: Validated backup state that can be restored for disaster recovery +**Key Attributes**: +- Recovery point ID (unique identifier) +- Archive reference +- Validation status +- Recovery time objective +- Recovery procedures +- Test results +- Dependencies + +**Validation Rules**: +- All recovery points must be tested +- RTO must be < 4 hours +- Recovery procedures must be documented +- Regular testing required +- Success rate must be > 95% + +## State Transitions + +### Deployment Lifecycle +``` +Planned -> In Progress -> Testing -> Live -> Decommissioned +``` + +### Backup Lifecycle +``` +Scheduled -> In Progress -> Completed -> Validated -> Expired +``` + +### Alert Lifecycle +``` +Triggered -> Acknowledged -> Resolved -> Closed +``` + +### Change Management +``` +Requested -> Approved -> Implemented -> Validated -> Closed +``` + +## Relationships + +- **Environment** contains many **Service Instances** +- **Service Instance** generates **Monitoring Metrics** +- **Backup Archive** contains data from **Service Instances** +- **Alert Rule** monitors **Monitoring Metrics** +- **Security Policy** applies to **Service Instances** +- **Infrastructure Change** modifies **Deployment Environments** +- **Recovery Point** references **Backup Archive** +- **Secret Configuration** used by **Service Instances** + +## Data Integrity Constraints + +- All entities must have unique identifiers +- All timestamps must be UTC +- All audit fields must be immutable +- Foreign key relationships must be validated +- All sensitive data must be encrypted +- All changes must be auditable diff --git a/specs/002-infra-ops/plan.md b/specs/002-infra-ops/plan.md new file mode 100644 index 0000000..146d716 --- /dev/null +++ b/specs/002-infra-ops/plan.md @@ -0,0 +1,105 @@ +# Implementation Plan: [FEATURE] + +**Branch**: `[###-feature-name]` | **Date**: [DATE] | **Spec**: [link] +**Input**: Feature specification from `/specs/[###-feature-name]/spec.md` + +**Note**: This template is filled in by the `/speckit.plan` command. See `.specify/templates/commands/plan.md` for the execution workflow. + +## Summary + +[Extract from feature spec: primary requirement + technical approach from research] + +## Technical Context + + + +**Language/Version**: [e.g., Python 3.11, Swift 5.9, Rust 1.75 or NEEDS CLARIFICATION] +**Primary Dependencies**: [e.g., FastAPI, UIKit, LLVM or NEEDS CLARIFICATION] +**Storage**: [if applicable, e.g., PostgreSQL, CoreData, files or N/A] +**Testing**: [e.g., pytest, XCTest, cargo test or NEEDS CLARIFICATION] +**Target Platform**: [e.g., Linux server, iOS 15+, WASM or NEEDS CLARIFICATION] +**Project Type**: [single/web/mobile - determines source structure] +**Performance Goals**: [domain-specific, e.g., 1000 req/s, 10k lines/sec, 60 fps or NEEDS CLARIFICATION] +**Constraints**: [domain-specific, e.g., <200ms p95, <100MB memory, offline-capable or NEEDS CLARIFICATION] +**Scale/Scope**: [domain-specific, e.g., 10k users, 1M LOC, 50 screens or NEEDS CLARIFICATION] + +## Constitution Check + +_GATE: Must pass before Phase 0 research. Re-check after Phase 1 design._ + +[Gates determined based on constitution file] + +## Project Structure + +### Documentation (this feature) + +```text +specs/[###-feature]/ +├── plan.md # This file (/speckit.plan command output) +├── research.md # Phase 0 output (/speckit.plan command) +├── data-model.md # Phase 1 output (/speckit.plan command) +├── quickstart.md # Phase 1 output (/speckit.plan command) +├── contracts/ # Phase 1 output (/speckit.plan command) +└── tasks.md # Phase 2 output (/speckit.tasks command - NOT created by /speckit.plan) +``` + +### Source Code (repository root) + + + +```text +# [REMOVE IF UNUSED] Option 1: Single project (DEFAULT) +src/ +├── models/ +├── services/ +├── cli/ +└── lib/ + +tests/ +├── contract/ +├── integration/ +└── unit/ + +# [REMOVE IF UNUSED] Option 2: Web application (when "frontend" + "backend" detected) +backend/ +├── src/ +│ ├── models/ +│ ├── services/ +│ └── api/ +└── tests/ + +frontend/ +├── src/ +│ ├── components/ +│ ├── pages/ +│ └── services/ +└── tests/ + +# [REMOVE IF UNUSED] Option 3: Mobile + API (when "iOS/Android" detected) +api/ +└── [same as backend above] + +ios/ or android/ +└── [platform-specific structure: feature modules, UI flows, platform tests] +``` + +**Structure Decision**: [Document the selected structure and reference the real +directories captured above] + +## Complexity Tracking + +> **Fill ONLY if Constitution Check has violations that must be justified** + +| Violation | Why Needed | Simpler Alternative Rejected Because | +| -------------------------- | ------------------ | ------------------------------------ | +| [e.g., 4th project] | [current need] | [why 3 projects insufficient] | +| [e.g., Repository pattern] | [specific problem] | [why direct DB access insufficient] | diff --git a/specs/002-infra-ops/quickstart.md b/specs/002-infra-ops/quickstart.md new file mode 100644 index 0000000..6246439 --- /dev/null +++ b/specs/002-infra-ops/quickstart.md @@ -0,0 +1,293 @@ +# Quick Start Guide: Infrastructure Operations & Deployment Automation + +**Purpose**: Get started with the Infrastructure Operations & Deployment Automation feature +**Date**: 2026-04-20 +**Target Audience**: DevOps Engineers, System Administrators + +## Prerequisites + +### Hardware Requirements +- QNAP NAS (192.168.10.8) with Docker support +- ASUSTOR NAS (192.168.10.9) with Docker support +- SSH access between NAS devices configured +- Minimum 100GB storage for backups + +### Software Requirements +- Docker 20.10+ +- Docker Compose 2.0+ +- Bash 5.0+ or PowerShell 7.2+ +- Git client +- SSH key authentication + +### Network Requirements +- Static IP addresses for both NAS devices +- Open ports: 22 (SSH), 80/443 (HTTP/HTTPS), 8080 (applications) +- VPN or secure network connection for remote access + +## Initial Setup + +### 1. Repository Configuration + +```bash +# Clone the repository +git clone https://git.np-dms.work/np-dms/lcbp3.git +cd lcbp3 + +# Switch to the infrastructure branch +git checkout 002-infra-ops +``` + +### 2. SSH Key Authentication + +Ensure SSH keys are configured between QNAP and ASUSTOR: + +```bash +# Test SSH connectivity +ssh admin@192.168.10.8 "docker --version" +ssh admin@192.168.10.9 "docker --version" +``` + +### 3. Environment Configuration + +Copy and configure environment files: + +```bash +# QNAP environments +cp specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app/.env.example \ + specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app/.env + +# ASUSTOR environments +cp specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/registry/.env.example \ + specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/registry/.env +``` + +Edit the `.env` files with your specific configurations: +- Database passwords +- SSL certificate paths +- Backup storage locations +- Monitoring endpoints + +## Core Services Deployment + +### 1. Database Services (QNAP) + +```bash +# Navigate to QNAP database directory +cd specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/mariadb + +# Deploy MariaDB with phpMyAdmin +docker-compose -f docker-compose-lcbp3-db.yml up -d + +# Verify deployment +docker-compose -f docker-compose-lcbp3-db.yml ps +``` + +### 2. Application Services (QNAP) + +```bash +# Navigate to QNAP app directory +cd specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app + +# Deploy backend, frontend, and ClamAV +docker-compose -f docker-compose-app.yml up -d + +# Verify deployment +docker-compose -f docker-compose-app.yml ps +``` + +### 3. Reverse Proxy (QNAP) + +```bash +# Navigate to Nginx Proxy Manager directory +cd specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/npm + +# Deploy reverse proxy +docker-compose -f docker-compose.yml up -d + +# Access Nginx Proxy Manager +# URL: http://192.168.10.8:81 +# Default: admin@example.com / changeme +``` + +### 4. Monitoring Stack (ASUSTOR) + +```bash +# Navigate to ASUSTOR monitoring directory +cd specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/monitoring + +# Deploy Prometheus, Grafana, and supporting services +docker-compose -f docker-compose.yml up -d + +# Verify deployment +docker-compose -f docker-compose.yml ps +``` + +## SSL Certificate Setup + +### 1. Initial Certificate Generation + +```bash +# On QNAP, generate Let's Encrypt certificates +cd specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/npm + +# Run certbot for initial certificate +docker-compose exec npm certbot --nginx -d your-domain.com +``` + +### 2. Automated Renewal + +Add to crontab for automatic renewal: + +```bash +# Edit crontab +crontab -e + +# Add renewal task (runs daily at 2 AM) +0 2 * * * cd /path/to/npm && docker-compose exec npm certbot renew +``` + +## Backup Configuration + +### 1. Initial Backup Setup + +```bash +# Navigate to backup scripts directory +cd specs/04-Infrastructure-OPS/04-02-backup-recovery + +# Configure backup destinations +cp backup-config.example.yml backup-config.yml + +# Edit backup-config.yml with your storage locations +nano backup-config.yml +``` + +### 2. Automated Backup Schedule + +```bash +# Add backup cron job (runs daily at 1 AM) +0 1 * * * /path/to/backup-scripts/daily-backup.sh + +# Add backup validation (runs weekly on Sunday at 3 AM) +0 3 * * 0 /path/to/backup-scripts/validate-backups.sh +``` + +## Monitoring Configuration + +### 1. Grafana Dashboard Access + +1. Access Grafana: `http://192.168.10.9:3000` +2. Default credentials: `admin / admin` (change on first login) +3. Import dashboards from `specs/04-Infrastructure-OPS/04-03-monitoring/dashboards/` + +### 2. Alert Configuration + +1. Access AlertManager: `http://192.168.10.9:9093` +2. Configure notification channels (email, Slack, etc.) +3. Test alert rules to ensure notifications work + +## Blue-Green Deployment + +### 1. Environment Setup + +```bash +# Create blue environment (current production) +cd specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app +docker-compose -f docker-compose-app.yml -p app-blue up -d + +# Create green environment (new version) +docker-compose -f docker-compose-app.yml -p app-green up -d +``` + +### 2. Traffic Switching + +```bash +# Switch traffic to green environment +# Update Nginx Proxy Manager upstream configuration +# Point to green environment containers +# Test green environment functionality +``` + +### 3. Rollback Procedure + +```bash +# If issues detected, rollback to blue +# Update Nginx Proxy Manager upstream configuration +# Point back to blue environment containers +# Stop green environment containers +``` + +## Security Hardening + +### 1. Container Security Scan + +```bash +# Install Trivy +curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin + +# Scan all running containers +trivy image --severity HIGH,CRITICAL $(docker ps --format "table {{.Image}}" | tail -n +2) +``` + +### 2. Security Policy Validation + +```bash +# Run security validation script +cd specs/04-Infrastructure-OPS/04-06-security-operations +./validate-security-policies.sh +``` + +## Troubleshooting + +### Common Issues + +1. **Container won't start** + ```bash + # Check logs + docker-compose logs [service-name] + + # Check resource usage + docker stats + ``` + +2. **Backup failures** + ```bash + # Check backup logs + tail -f /var/log/backup.log + + # Test connectivity to backup storage + ping backup-storage-host + ``` + +3. **Monitoring alerts not working** + ```bash + # Check Prometheus targets + curl http://192.168.10.9:9090/api/v1/targets + + # Test AlertManager + curl http://192.168.10.9:9093/api/v1/alerts + ``` + +### Health Checks + +```bash +# Check all services health +curl -f http://192.168.10.8:3000/health || echo "Backend unhealthy" +curl -f http://192.168.10.8/health || echo "Frontend unhealthy" +curl -f http://192.168.10.9:9090/-/healthy || echo "Prometheus unhealthy" +``` + +## Next Steps + +1. **Configure automated monitoring alerts** for your specific thresholds +2. **Set up backup retention policies** based on your compliance requirements +3. **Implement disaster recovery testing** on a regular schedule +4. **Configure log aggregation** for centralized monitoring +5. **Set up automated security scanning** in your CI/CD pipeline + +## Support + +For issues and questions: +- Check the troubleshooting section above +- Review logs in `/var/log/` directories +- Consult the full documentation in `specs/04-Infrastructure-OPS/` +- Contact the infrastructure team for escalated issues diff --git a/specs/002-infra-ops/research.md b/specs/002-infra-ops/research.md new file mode 100644 index 0000000..57fb2ae --- /dev/null +++ b/specs/002-infra-ops/research.md @@ -0,0 +1,82 @@ +# Phase 0 Research: Infrastructure Operations & Deployment Automation + +**Date**: 2026-04-20 +**Feature**: Infrastructure Operations & Deployment Automation +**Status**: Complete + +## Research Findings + +### Blue-Green Deployment Strategy + +**Decision**: Docker Compose with Nginx Proxy Manager for traffic switching +**Rationale**: Provides zero-downtime deployments by maintaining two identical production environments (blue/green) and switching traffic via reverse proxy configuration updates +**Alternatives Considered**: Kubernetes (too complex for current scale), Docker Swarm (limited networking features), Manual deployment scripts (prone to human error) + +### Backup & Recovery Solution + +**Decision**: Restic for encrypted backups + MariaDB dump scripts + automated validation +**Rationale**: Restic provides deduplication, encryption, and cloud storage support. Combined with native database dumps ensures complete system state capture +**Alternatives Considered**: Borg Backup (steeper learning curve), rsync only (no encryption/deduplication), commercial solutions (cost constraints) + +### Monitoring Stack + +**Decision**: Prometheus + Grafana + AlertManager + Node Exporter + cAdvisor +**Rationale**: Industry-standard monitoring stack with extensive community support, flexible alerting rules, and container-native metrics collection +**Alternatives Considered**: Zabbix (more complex setup), Nagios (older architecture), Datadog (commercial cost) + +### Container Security Hardening + +**Decision**: Docker security hardening with non-root users, read-only filesystems, capability dropping, and Trivy scanning +**Rationale**: Provides defense-in-depth security while maintaining functionality. Trivy offers comprehensive vulnerability scanning +**Alternatives Considered**: Podman (better security but ecosystem compatibility issues), Kubernetes security policies (overkill for current scale) + +### Multi-NAS Architecture + +**Decision**: QNAP for primary services, ASUSTOR for backup/monitoring registry +**Rationale**: Leverages existing hardware investment, provides geographic separation for critical services, and maintains established SSH key authentication +**Alternatives Considered**: Cloud hosting (recurring costs, data sovereignty concerns), Single NAS (single point of failure) + +### SSL Certificate Management + +**Decision**: Certbot with Let's Encrypt + automated renewal via cron jobs +**Rationale**: Free, automated certificate management with established reliability. Integration with Nginx Proxy Manager simplifies deployment +**Alternatives Considered**: Commercial CAs (cost), Self-signed certificates (browser warnings), Cloudflare certificates (dependency on external service) + +### Secrets Management + +**Decision**: Environment files with .gitignore + SSH key authentication +**Rationale**: Simple, secure approach that works across both NAS environments. No additional infrastructure required +**Alternatives Considered**: HashiCorp Vault (complex setup), Docker Swarm secrets (limited to single host), Infisical/SOPS (additional learning curve) + +## Technical Decisions Summary + +1. **Docker Compose** as primary orchestration tool +2. **Blue-Green deployment** pattern for zero downtime +3. **Restic** for backup encryption and deduplication +4. **Prometheus/Grafana** stack for monitoring +5. **Nginx Proxy Manager** for reverse proxy and SSL termination +6. **Trivy** for container vulnerability scanning +7. **Environment files** for secrets management +8. **SSH key authentication** for cross-NAS communication + +## Implementation Constraints + +- Must maintain existing QNAP/ASUSTOR IP addresses (192.168.10.8/9) +- Must preserve current data storage locations +- Must integrate with existing Gitea Actions CI/CD pipeline +- Must comply with ADR-016 security requirements +- Must support Thai language documentation per project standards + +## Success Metrics Alignment + +All technical decisions support the success criteria defined in the specification: + +- 99.9% uptime through redundant infrastructure +- 30-second alert generation via Prometheus monitoring +- 4-hour RTO through automated backup validation +- Zero-downtime deployments via blue-green strategy +- 100% security compliance via container hardening + +## Next Steps + +Proceed to Phase 1: Design & Contracts with these technical foundations established. diff --git a/specs/002-infra-ops/spec.md b/specs/002-infra-ops/spec.md new file mode 100644 index 0000000..a6bcd96 --- /dev/null +++ b/specs/002-infra-ops/spec.md @@ -0,0 +1,187 @@ +# Feature Specification: Infrastructure Operations & Deployment Automation + +**Feature Branch**: `002-infra-ops` +**Created**: 2026-04-20 +**Status**: Draft +**Input**: User description: "Infrastructure operations and deployment automation including Docker Compose configurations, container orchestration, monitoring, backup/recovery, and maintenance procedures for the NAP-DMS system" + +## Clarifications + +### Session 2026-04-20 + +- Q: Which services are included in Infrastructure Operations scope beyond NAP-DMS applications? +- A: All services in Docker Compose stacks including Gitea, n8n, RocketChat, and supporting services + +- Q: What is the expected data volume and annual growth rate for all services? +- A: 500GB current data with 20% annual growth + +- Q: What external services or third-party integrations are required beyond internal services? +- A: Email SMTP for notifications and Let's Encrypt for SSL certificates + +- Q: What are the concurrent user count and performance targets for response time? +- A: 100 concurrent users with 2-second average response time + +- Q: What technical constraints exist (budget, hardware, compliance requirements)? +- A: Must work with existing QNAP/ASUSTOR hardware infrastructure + +## User Scenarios & Testing _(mandatory)_ + + + +### User Story 1 - Zero-Downtime Deployment (Priority: P1) + +As a DevOps engineer, I need to deploy updates for all services (NAP-DMS applications, databases, monitoring, Gitea, n8n, RocketChat, and supporting services) without interrupting user access to any system components. + +**Why this priority**: Critical for business continuity - system cannot afford downtime during regular maintenance windows. + +**Independent Test**: Can be fully tested by deploying a test application version using blue-green containers and verifying traffic switches seamlessly without user session interruption. + +**Acceptance Scenarios**: + +1. **Given** a running production environment, **When** I deploy a new version, **Then** users continue accessing the system without interruption +2. **Given** a deployment failure, **When** the rollback is triggered, **Then** the system immediately switches back to the previous stable version + +--- + +### User Story 2 - Automated Backup & Recovery (Priority: P1) + +As a system administrator, I need automated daily backups of all services data (NAP-DMS applications, databases, monitoring, Gitea, n8n, RocketChat, configurations, and supporting services) and the ability to restore the entire system within 4 hours of a catastrophic failure. + +**Why this priority**: Essential for data protection and business continuity compliance with document management regulations. + +**Independent Test**: Can be fully tested by running backup procedures and performing a full system restore in a test environment to verify all data is recoverable. + +**Acceptance Scenarios**: + +1. **Given** the backup schedule is configured, **When** the daily backup runs, **Then** all databases, files, and configurations are successfully backed up +2. **Given** a system failure occurs, **When** I initiate recovery, **Then** the entire system is restored to its last known good state within 4 hours + +--- + +### User Story 3 - Real-time Monitoring & Alerting (Priority: P1) + +As an on-call engineer, I need to receive immediate alerts when any system components (NAP-DMS applications, databases, monitoring, Gitea, n8n, RocketChat, and supporting services) fail or performance degrades below acceptable thresholds. + +**Why this priority**: Prevents minor issues from becoming major outages and ensures rapid response to system problems. + +**Independent Test**: Can be fully tested by simulating various failure scenarios and verifying appropriate alerts are generated and delivered to the correct channels. + +**Acceptance Scenarios**: + +1. **Given** monitoring is active, **When** a service becomes unresponsive, **Then** an alert is sent within 30 seconds +2. **Given** system resources exceed 80% utilization, **When** the threshold is crossed, **Then** a performance alert is generated with actionable diagnostics + +--- + +### User Story 4 - Container Security Hardening (Priority: P2) + +As a security administrator, I need all containers (NAP-DMS applications, databases, monitoring, Gitea, n8n, RocketChat, and supporting services) to run with minimal privileges and no exposed secrets to maintain compliance with security policies. + +**Why this priority**: Prevents privilege escalation attacks and protects sensitive configuration data. + +**Independent Test**: Can be fully tested by running security scans on all containers and verifying they meet hardening requirements. + +**Acceptance Scenarios**: + +1. **Given** containers are deployed, **When** I run a security audit, **Then** all containers pass privilege escalation and secret exposure checks +2. **Given** new containers are added, **When** they are deployed, **Then** they automatically inherit security hardening policies + +--- + +### User Story 5 - Infrastructure as Code Management (Priority: P2) + +As a DevOps engineer, I need to manage all infrastructure configurations (NAP-DMS applications, databases, monitoring, Gitea, n8n, RocketChat, and supporting services) through version-controlled code files rather than manual server changes. + +**Why this priority**: Ensures consistency across environments and enables reproducible infrastructure deployments. + +**Independent Test**: Can be fully tested by deploying a complete environment from code and verifying it matches the production configuration. + +**Acceptance Scenarios**: + +1. **Given** infrastructure code changes, **When** I apply the changes, **Then** the environment configuration matches exactly what's defined in the code +2. **Given** a new environment is needed, **When** I deploy from code, **Then** the environment is created with all required services and configurations + +### Edge Cases + +- What happens when network connectivity between QNAP and ASUSTOR fails during backup operations? +- How does system handle container registry authentication failures during deployment? +- What happens when Docker Compose files contain syntax errors during environment startup? +- How does system handle SSL certificate expiration for reverse proxy services? +- What happens when monitoring services become unavailable while system is running? +- How does system handle storage space exhaustion on production servers? +- What happens when multiple deployment processes are initiated simultaneously? +- How does system handle database connection pool exhaustion during high load? +- What happens when automated security updates conflict with custom container configurations? +- How does system handle partial backup failures where some services complete but others fail? +- How does system handle Email SMTP service failures for alert notifications? +- What happens when Let's Encrypt certificate renewal fails due to network issues? + +## Requirements _(mandatory)_ + + + +### Functional Requirements + +- **FR-001**: System MUST support blue-green deployment strategy for zero-downtime updates of all services (NAP-DMS applications, databases, monitoring, Gitea, n8n, RocketChat, and supporting services) +- **FR-002**: System MUST automate daily backups of all services data including databases, application files, configurations, and supporting service data +- **FR-003**: System MUST provide complete disaster recovery capabilities with 4-hour RTO (Recovery Time Objective) +- **FR-004**: System MUST monitor all infrastructure components (all services) and generate alerts for failures or performance degradation +- **FR-005**: System MUST enforce container security hardening including non-root users, privilege dropping, and read-only filesystems for all services +- **FR-006**: System MUST manage all infrastructure configurations through version-controlled Docker Compose files for all services +- **FR-007**: System MUST support automated SSL certificate management and renewal for all web services +- **FR-008**: System MUST provide centralized logging aggregation for all containers and services +- **FR-009**: System MUST implement resource limits and health checks for all containers +- **FR-010**: System MUST support multi-environment deployments (development, staging, production) with consistent configurations +- **FR-011**: System MUST provide automated vulnerability scanning for all container images +- **FR-012**: System MUST support infrastructure secrets management without exposing them in version control +- **FR-013**: System MUST implement backup validation procedures to ensure data integrity +- **FR-014**: System MUST provide rollback capabilities for failed deployments +- **FR-015**: System MUST generate audit trails for all infrastructure changes and deployments + +### Key Entities _(include if feature involves data)_ + +- **Docker Compose Configuration**: Infrastructure as code definitions for all services, environments, and deployments +- **Backup Archive**: Complete system snapshots including databases, files, and configurations with metadata (500GB current data, 20% annual growth) +- **Monitoring Metric**: Performance and health data points collected from all infrastructure components +- **Security Policy**: Container hardening rules and compliance requirements for all deployments +- **Deployment Environment**: Isolated runtime spaces (development, staging, production) with consistent configurations (constrained by existing QNAP/ASUSTOR hardware) +- **Alert Rule**: Threshold-based conditions that trigger notifications when system metrics exceed limits +- **Secret Configuration**: Sensitive information (passwords, keys, certificates) managed outside version control +- **Service Instance**: Running container with specific configuration, resource limits, and health status +- **Infrastructure Change**: Version-controlled modification to system configuration or deployment +- **Recovery Point**: Validated backup state that can be restored for disaster recovery + +## Success Criteria _(mandatory)_ + + + +### Measurable Outcomes + +- **SC-001**: Deployments complete with zero user-visible downtime in 99.9% of attempts +- **SC-002**: System recovery from backup completes within 4 hours with 100% data integrity +- **SC-003**: Critical system alerts are generated and delivered within 30 seconds of failure detection +- **SC-004**: All containers pass security hardening compliance checks with 100% success rate +- **SC-005**: Infrastructure changes are applied from version-controlled code with 100% consistency across environments +- **SC-006**: SSL certificates are renewed automatically with 0 expiration incidents per year +- **SC-007**: Backup validation procedures achieve 99.9% success rate with automated integrity verification +- **SC-008**: Failed deployments are automatically rolled back within 60 seconds with 100% success rate +- **SC-009**: System uptime exceeds 99.9% monthly availability target +- **SC-010**: Infrastructure audit trail captures 100% of configuration changes and deployments +- **SC-011**: System supports 100 concurrent users with 2-second average response time under normal load diff --git a/specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/gitea-runner/docker-compose.yml b/specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/gitea-runner/docker-compose.yml index 7fb273c..933a05b 100644 --- a/specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/gitea-runner/docker-compose.yml +++ b/specs/04-Infrastructure-OPS/04-00-docker-compose/ASUSTOR/gitea-runner/docker-compose.yml @@ -1,4 +1,5 @@ # File: /volume1/np-dms/gitea-runner/docker-compose.yml +# DMS Container v1.8.6: Application name: lcbp3-gitea-runner # Deploy on: ASUSTOR AS5403T # เชื่อมต่อกับ Gitea บน QNAP ผ่าน Domain URL # diff --git a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app/docker-compose-app.yml b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app/docker-compose-app.yml index 9cc9b66..6813f4b 100644 --- a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app/docker-compose-app.yml +++ b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/app/docker-compose-app.yml @@ -61,7 +61,7 @@ services: cpus: '0.5' memory: 512M env_file: - - .env + - /share/np-dms/app/.env environment: TZ: 'Asia/Bangkok' NODE_ENV: 'production' @@ -142,7 +142,7 @@ services: cpus: '0.25' memory: 512M env_file: - - .env + - /share/np-dms/app/.env environment: TZ: 'Asia/Bangkok' NODE_ENV: 'production' diff --git a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/gitea/docker-compose.yml b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/gitea/docker-compose.yml index 772f3e4..19b7b5c 100644 --- a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/gitea/docker-compose.yml +++ b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/gitea/docker-compose.yml @@ -1,5 +1,5 @@ -# File: /share/np-dms/git/docker-compose.yml -# DMS Container v1.8.6 — Application: git, Service: gitea +# File: /share/np-dms/gitea/docker-compose.yml +# DMS Container v1.8.6 — Application name: lcbp3-git, Service: gitea x-restart: &restart_policy restart: unless-stopped @@ -21,8 +21,17 @@ networks: services: gitea: <<: [*restart_policy, *default_logging] - image: gitea/gitea:latest-rootless + image: gitea/gitea:1.26.0-rootless container_name: gitea + # M4: container hardening (Gitea rootless runs as 'git' user) + # user: '1000:1000' + # tmpfs: + # - /tmp:rw,noexec,nosuid,size=256m + # - /var/run/gitea:rw,size=128m + # security_opt: + # - no-new-privileges:true + # cap_drop: + # - ALL deploy: resources: limits: @@ -31,10 +40,8 @@ services: reservations: cpus: '0.25' memory: 512M - security_opt: - - no-new-privileges:true env_file: - - .env + - /share/np-dms/gitea/.env environment: # ---- File ownership in QNAP ---- USER_UID: '1000' @@ -78,13 +85,13 @@ services: - /etc/timezone:/etc/timezone:ro - /etc/localtime:/etc/localtime:ro ports: - - '3003:3000' # HTTP (ไปหลัง NPM) - - '2222:22' # SSH สำหรับ git clone/push + - '3003:3000' # HTTP (to NPM) + - '2222:22' # SSH for git clone/push networks: - lcbp3 - giteanet healthcheck: - test: ['CMD', 'wget', '--spider', '-q', 'http://localhost:3000/api/healthz'] + test: ['CMD', 'curl', '-f', 'http://localhost:3000/api/healthz'] interval: 30s timeout: 10s retries: 3 diff --git a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/monitoring/docker-compose.yml.bak b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/monitoring/docker-compose.yml.bak new file mode 100644 index 0000000..c898042 --- /dev/null +++ b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/monitoring/docker-compose.yml.bak @@ -0,0 +1,56 @@ +# File: /share/np-dms/monitoring/docker-compose.yml (QNAP) +# เฉพาะ exporters เท่านั้น - metrics ถูก scrape โดย Prometheus บน ASUSTOR +# Application name lcbp3-monitoring-exporter +version: '3.8' + +networks: + lcbp3: + external: true + +services: + node-exporter: + image: prom/node-exporter:v1.7.0 + container_name: node-exporter + restart: unless-stopped + command: + - '--path.procfs=/host/proc' + - '--path.sysfs=/host/sys' + - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' + ports: + - "9100:9100" + networks: + - lcbp3 + volumes: + - /proc:/host/proc:ro + - /sys:/host/sys:ro + - /:/rootfs:ro + + cadvisor: + image: gcr.io/cadvisor/cadvisor:v0.47.2 + container_name: cadvisor + restart: unless-stopped + privileged: true + ports: + - "8088:8080" + networks: + - lcbp3 + volumes: + - /:/rootfs:ro + - /var/run:/var/run:ro + - /sys:/sys:ro + - /var/lib/docker/:/var/lib/docker:ro + - /sys/fs/cgroup:/sys/fs/cgroup:ro + + mysqld-exporter: + image: prom/mysqld-exporter:v0.15.0 + container_name: mysqld-exporter + restart: unless-stopped + user: root + command: + - '--config.my-cnf=/etc/mysql/my.cnf' + ports: + - "9104:9104" + networks: + - lcbp3 + volumes: + - "/share/np-dms/monitoring/mysqld-exporter/.my.cnf:/etc/mysql/my.cnf:ro" diff --git a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/service/docker-compose.yml b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/service/docker-compose.yml index cc818ca..22a1e0b 100644 --- a/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/service/docker-compose.yml +++ b/specs/04-Infrastructure-OPS/04-00-docker-compose/QNAP/service/docker-compose.yml @@ -31,7 +31,7 @@ services: # ---------------------------------------------------------------- cache: <<: [*restart_policy, *default_logging] - image: redis:7-alpine # ใช้ Alpine image เพื่อให้มีขน + image: redis:7-alpine # ใช้ Alpine image เพื่อให้มีขนาดเล็ก container_name: cache deploy: resources: @@ -86,7 +86,7 @@ services: deploy: resources: limits: - cpus: '2.0' # Elasticsearch ใช้ CPU และ Memory ค่อนข้างห + cpus: '2.0' # Elasticsearch ใช้ CPU และ Memory ค่อนข้างหนัก memory: 4G reservations: cpus: '0.5' diff --git a/specs/04-Infrastructure-OPS/04-00-docker-compose/README.md b/specs/04-Infrastructure-OPS/04-00-docker-compose/README.md index a91aec4..222ecbd 100644 --- a/specs/04-Infrastructure-OPS/04-00-docker-compose/README.md +++ b/specs/04-Infrastructure-OPS/04-00-docker-compose/README.md @@ -62,6 +62,48 @@ services: Otherwise, keep the inline anchor pattern (current repo-wide convention). +## Image Pinning Strategy + +The LCBP3 platform uses a **hybrid image pinning approach**: + +### Infrastructure Services (Pinned) +All infrastructure services use **explicitly pinned versions** for stability: + +```yaml +# Examples +redis:7-alpine +elasticsearch:8.11.1 +mariadb:11.8 +gitea/gitea:1.22.3-rootless +n8nio/n8n:1.66.0 +``` + +**Rationale:** +- Infrastructure services evolve independently +- Breaking changes in Redis/Elasticsearch/MariaDB can cause data corruption +- Pinned versions ensure predictable behavior across deployments + +### Application Services (Variable) +Application images use **environment variable tags** for CI/CD flexibility: + +```yaml +backend: + image: lcbp3-backend:${BACKEND_IMAGE_TAG:-latest} +frontend: + image: lcbp3-frontend:${FRONTEND_IMAGE_TAG:-latest} +``` + +**Rationale:** +- Application code changes frequently with each release +- CI pipelines inject SHA-specific tags per release +- `:latest` fallback enables local development +- Environment variable allows rollback to specific versions + +### Version Control +- **Infrastructure versions** updated manually in compose files +- **Application versions** controlled via CI/CD pipeline environment variables +- **Release policy** documented in `04-08-release-management-policy.md` + ## Secret Management Roadmap (S1) Current: `env_file: .env` (gitignored) per stack.