Files
lcbp3/specs/100-Infrastructures/141-server-consolidation/tasks.md
T
admin a80ebef285
CI / CD Pipeline / build (push) Successful in 7m37s
CI / CD Pipeline / deploy (push) Failing after 20m15s
refactor(ai): OCR sidecar canonical naming cleanup — typhoon→np-dms, remove hardcoded keys, asyncio.to_thread, ADR-040/041
2026-06-20 16:37:04 +07:00

15 KiB

// File: specs/100-Infrastructures/141-server-consolidation/tasks.md // Change Log: // - 2026-06-20: Initial task list for Single-Host Server Consolidation // - 2026-06-20: Fix C1-C5 from analysis: backend env var update, port conflict, GPU residency, ollama-metrics port, n8n endpoints

Tasks: Single-Host Server Consolidation

Input: Design documents from /specs/100-Infrastructures/141-server-consolidation/ Prerequisites: plan.md, spec.md, research.md, data-model.md, contracts/ Related ADRs: ADR-041, ADR-040, ADR-016, ADR-023A, ADR-034

Organization: Tasks are grouped by user story to enable independent implementation and testing of each story.

Format: [ID] [P?] [Story] Description

  • [P]: Can run in parallel (different files, no dependencies)
  • [Story]: Which user story this task belongs to (e.g., US1, US2, US3)
  • Include exact file paths in descriptions

Phase 1: Setup (Shared Infrastructure)

Purpose: Create directory structure and initial files for the new host deployment

  • T001 Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/ directory structure with subdirectories: ocr-sidecar/, scripts/
  • T002 [P] Create .env.template at specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/.env.template with all required env vars from contracts
  • T003 [P] Create README.md at specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/README.md with deployment overview

Phase 2: Foundational (Blocking Prerequisites)

Purpose: Provision the new host OS and create the unified Docker Compose stack — MUST be complete before any user story can proceed

⚠️ CRITICAL: No user story work can begin until this phase is complete

  • T004 Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/scripts/provision-host.sh — installs Docker Engine, Docker Compose v2, NVIDIA drivers, nvidia-container-toolkit, CIFS utils, creates directory structure
  • T005 Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/docker-compose.new-host.yml — unified compose with all 10 services, 2 networks (dms-internal, dms-frontend), CIFS volume, named volumes, memory limits per data-model.md. Backend publishes 3001:3000 to LAN (NPM routes backend.np-dms.work → :3001); Frontend publishes 3000:3000; ollama-metrics publishes 9924:9924 to LAN for Prometheus scraping from ASUSTOR
  • T006 [P] Copy OCR sidecar code from specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/ to specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/ocr-sidecar/ — adapt OLLAMA_API_URL to http://ollama:11434 (Docker DNS), remove ports mapping, use expose only
  • T007 [P] Update specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/ocr-sidecar/Dockerfile — verify GPU access via nvidia-container-toolkit, ensure poppler-utils installed
  • T008 [P] Update specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/ocr-sidecar/requirements.txt — verify typhoon-ocr, PyMuPDF, httpx, fastapi versions match Desk-5439
  • T008b Update backend environment variables for renamed service names: REDIS_HOST=redis (was cache), ELASTICSEARCH_HOST=elasticsearch (was search) in specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/.env.template and specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/docker-compose.new-host.yml backend environment section — these service names changed from QNAP compose where Redis was cache and ES was search

Checkpoint: New host directory structure and unified compose file ready — user story implementation can now begin


Phase 3: User Story 1 - Provision and Deploy on New Host (Priority: P1) 🎯 MVP

Goal: Administrator provisions the new host, mounts ASUSTOR CIFS, and deploys all services with Docker internal network isolation

Independent Test: Run docker compose up -d on the new host and verify all containers are healthy via docker ps and health check endpoints

Implementation for User Story 1

  • T009 [US1] Run provision-host.sh on new host — verify Docker, NVIDIA, CIFS mount at /mnt/uploads
  • T010 [US1] Pull Ollama models on new host: ollama pull np-dms-ai:latest, ollama pull np-dms-ocr:latest, ollama pull nomic-embed-text:latest — verify with ollama list
  • T011 [US1] Copy .env.template to .env, fill in all secrets from QNAP .env (DB passwords, JWT secrets, Redis password, ASUSTOR CIFS credentials)
  • T012 [US1] Run docker compose --env-file .env -f docker-compose.new-host.yml up -d and verify all 10 containers start
  • T013 [US1] Verify network isolation: nmap -p 11434 <new-host-ip> from another VLAN 10 machine should show closed/refused; nmap -p 8765 should show closed/refused; nmap -p 3000 (frontend) and nmap -p 3001 (backend) should show open; nmap -p 9924 (ollama-metrics) should show open for Prometheus
  • T014 [US1] Verify health checks: curl http://localhost:3001/health (backend on published port 3001), curl http://localhost:3000/ (frontend), curl http://ocr-sidecar:8765/health (from inside backend container via Docker DNS)

Checkpoint: All services running on new host with correct network isolation — MVP achieved


Phase 4: User Story 2 - Migrate Data from QNAP to New Host (Priority: P2)

Goal: Migrate MariaDB and Elasticsearch data from QNAP to the new host with zero data loss

Independent Test: Compare row counts and index document counts between QNAP (source) and new host (destination) after migration

Implementation for User Story 2

  • T015 [P] [US2] Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/scripts/migrate-mariadb.sh — dump from QNAP MariaDB 11.8 via mariadb-dump --single-transaction --routines --triggers, pipe to new host container
  • T016 [P] [US2] Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/scripts/migrate-elasticsearch.sh — create snapshot on QNAP ES, transfer files, register repo on new host, restore
  • T017 [US2] Run migrate-mariadb.sh — verify all table row counts match between QNAP and new host
  • T018 [US2] Run migrate-elasticsearch.sh — verify all index document counts match between QNAP and new host
  • T019 [US2] Create and run specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/scripts/verify-data-parity.sh — automated row count + document count comparison script
  • T020 [US2] Verify CIFS file access: list files in /app/uploads/temp and /app/uploads/permanent from backend container, compare with ASUSTOR share

Checkpoint: All data migrated and verified — new host has complete production data


Phase 5: User Story 3 - Cutover and Smoke Test (Priority: P3)

Goal: Perform production cutover from old 2-host architecture to new single host, verify all DMS functions work end-to-end

Independent Test: Access application via new host IP, perform core DMS operations (login, document upload, search, AI inference)

Implementation for User Story 3

  • T021 [P] [US3] Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/scripts/smoke-test.sh — automated tests for: backend health, frontend accessible, login flow, document list, OCR endpoint, AI inference, full-text search
  • T022 [US3] Update Gitea secrets: HOST → new host IP, COMPOSE_FILEspecs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/docker-compose.new-host.yml
  • T023 [US3] Update scripts/deploy.sh — change COMPOSE_FILE path to New-Host directory
  • T024 [US3] Update NPM (Nginx Proxy Manager) on QNAP: lcbp3.np-dms.work → new host IP:3000 (frontend), backend.np-dms.work → new host IP:3001 (backend)
  • T024b [US3] Update n8n workflow endpoints on QNAP: change all backend API URLs from http://192.168.10.8:3000/api (QNAP) to http://<new-host-ip>:3001/api (new host) — n8n stays on QNAP but must reach backend on new host via LAN port 3001
  • T025 [US3] Run smoke-test.sh on new host — verify all 7 smoke tests pass
  • T026 [US3] Verify from external machine on VLAN 10: access https://lcbp3.np-dms.work, login, create a test Correspondence, upload a PDF, trigger OCR, perform search

Checkpoint: New host is production-active — all DMS functions verified end-to-end


Phase 6: User Story 4 - Remove X-API-Key and Verify Network-Only Auth (Priority: P4)

Goal: Remove X-API-Key authentication from sidecar and backend, relying solely on Docker-internal network isolation per ADR-040 D5

Independent Test: Attempt to access sidecar from outside Docker network (should fail); verify backend calls sidecar without API key (should succeed)

Implementation for User Story 4

  • T027 [P] [US4] Remove OCR_SIDECAR_API_KEY from specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/docker-compose.new-host.yml ocr-sidecar environment
  • T028 [P] [US4] Remove API key validation from specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/ocr-sidecar/app.py — remove X-API-Key header check middleware
  • T029 [US4] Remove X-API-Key header from backend/src/modules/ai/services/ocr.service.ts — remove API key from HTTP client headers
  • T030 [US4] Remove OCR_SIDECAR_API_KEY from backend/.env.example and any backend config that sets it
  • T031 [US4] Rebuild and redeploy sidecar + backend containers — verify backend can call sidecar without API key
  • T032 [US4] Verify external access blocked: curl http://<new-host-ip>:8765/health from VLAN 10 machine should fail (connection refused)

Checkpoint: Network-only auth verified — no API key needed, Docker isolation sufficient


Phase 7: User Story 5 - Decommission Old Hosts (Priority: P5)

Goal: Stop services on QNAP (becomes backup) and retire Desk-5439, completing the consolidation

Independent Test: Verify QNAP services stopped (except backup), Desk-5439 powered off, new host unaffected

Implementation for User Story 5

  • T033 [P] [US5] Create specs/04-Infrastructure-OPS/04-00-docker-compose/New-Host/scripts/rollback.sh — emergency rollback: stop new host, restore QNAP + Desk-5439 services, revert DNS, revert CI/CD
  • T034 [US5] Monitor new host for 24-48 hours: RAM usage (docker stats), VRAM usage (nvidia-smi), container health, application logs
  • T034b [US5] Verify Adaptive OCR Residency (ADR-040 D3) on new RTX 5060 Ti: load np-dms-ai and np-dms-ocr concurrently, confirm calculate_ocr_residency() unloads OCR model when LLM needs VRAM; verify CPU Fallback Retrieval (ADR-040 D4) activates for BGE-M3/Reranker when GPU is occupied by LLM
  • T035 [US5] Stop QNAP app services: ssh admin@192.168.10.8 'cd /share/np-dms/app && docker compose down'
  • T036 [US5] Stop QNAP service stack: ssh admin@192.168.10.8 'cd /share/np-dms/services && docker compose down'
  • T037 [US5] Retire Desk-5439: ssh user@192.168.10.100 'sudo shutdown -h now' (or repurpose)
  • T038 [US5] Verify new host still fully operational after old hosts decommissioned — re-run smoke-test.sh
  • T039 [US5] Take QNAP backup snapshot: mariadb-dump on QNAP MariaDB (if still running) or verify existing backup is current

Checkpoint: Consolidation complete — single host is sole production, old hosts decommissioned


Phase 8: Polish & Cross-Cutting Concerns

Purpose: Documentation, monitoring, and final verification

  • T040 [P] Update specs/04-Infrastructure-OPS/04-00-docker-compose/README.md — add New-Host section, mark QNAP as backup, mark Desk-5439 as retired
  • T041 [P] Update CONTEXT.md — update infrastructure topology to reflect single-host architecture
  • T042 [P] Update AGENTS.md — update infrastructure references (Desk-5439 → New Host, QNAP → backup)
  • T043 Update specs/04-Infrastructure-OPS/04-00-docker-compose/.env.template — add ASUSTOR_USER, ASUSTOR_PASS, NEW_HOST_IP variables
  • T044 [P] Update Prometheus/Grafana scrape config on ASUSTOR — update ollama-metrics target from 192.168.10.100:9924 to new host internal or host-published port
  • T045 Run quickstart.md validation — follow all steps end-to-end on a fresh provision
  • T046 [P] Document disaster recovery procedure — backup schedule, restore from QNAP backup, estimated RTO/RPO

Dependencies & Execution Order

Phase Dependencies

  • Setup (Phase 1): No dependencies — can start immediately
  • Foundational (Phase 2): Depends on Setup — BLOCKS all user stories
  • US1 (Phase 3): Depends on Foundational — requires physical access to new host
  • US2 (Phase 4): Depends on US1 (services must be running to receive migrated data)
  • US3 (Phase 5): Depends on US1 + US2 (services running + data migrated for cutover)
  • US4 (Phase 6): Depends on US3 (cutover complete, network isolation verified)
  • US5 (Phase 7): Depends on US3 + US4 (stable production before decommissioning)
  • Polish (Phase 8): Can start after US3; some tasks depend on US5

User Story Dependencies

  • US1 (P1): Foundational → US1 — no dependencies on other stories
  • US2 (P2): US1 → US2 — needs running services to receive data
  • US3 (P3): US1 + US2 → US3 — needs running services + migrated data
  • US4 (P4): US3 → US4 — needs cutover complete to verify network isolation in production
  • US5 (P5): US3 + US4 → US5 — needs stable production before decommissioning

Parallel Opportunities

  • T002, T003 can run in parallel (different files)
  • T006, T007, T008 can run in parallel (sidecar files, no dependencies)
  • T015, T016 can run in parallel (different migration scripts)
  • T027, T028 can run in parallel (different files: compose vs app.py)
  • T040, T041, T042, T044 can run in parallel (different doc files)
  • T027, T028, T030 can run in parallel (different files: compose, app.py, .env.example)

Implementation Strategy

MVP First (User Story 1 Only)

  1. Complete Phase 1: Setup (create directory structure)
  2. Complete Phase 2: Foundational (provision host + create compose)
  3. Complete Phase 3: User Story 1 (deploy services)
  4. STOP and VALIDATE: All containers healthy, network isolation verified
  5. Demo to stakeholders if ready

Incremental Delivery

  1. Setup + Foundational → Infrastructure ready
  2. Add US1 → Services deployed → Validate (MVP!)
  3. Add US2 → Data migrated → Validate parity
  4. Add US3 → Cutover complete → Validate end-to-end
  5. Add US4 → Security hardened → Validate network-only auth
  6. Add US5 → Old hosts retired → Validate stability
  7. Polish → Documentation updated → Final validation

Notes

  • This is an infrastructure task — most work is shell scripts, Docker Compose YAML, and manual operations
  • Physical access to the new host is required for US1
  • Data migration (US2) requires SSH access to QNAP
  • Cutover (US3) requires DNS/NPM access and coordination with users
  • Decommission (US5) should only proceed after 24-48 hours of stable monitoring
  • Rollback plan must be tested before cutover
  • All env secrets must come from .env (gitignored) — never commit real secrets