Files
lcbp3/specs/100-Infrastructures/141-server-consolidation/research.md
T
admin a80ebef285
CI / CD Pipeline / build (push) Successful in 7m37s
CI / CD Pipeline / deploy (push) Failing after 20m15s
refactor(ai): OCR sidecar canonical naming cleanup — typhoon→np-dms, remove hardcoded keys, asyncio.to_thread, ADR-040/041
2026-06-20 16:37:04 +07:00

6.3 KiB

// File: specs/100-Infrastructures/141-server-consolidation/research.md // Change Log: // - 2026-06-20: Initial research for Single-Host Server Consolidation

Research: Single-Host Server Consolidation

Branch: 141-server-consolidation | Date: 2026-06-20

R1: Docker Network Isolation Strategy

Decision: Use two Docker bridge networks — dms-internal (all services) and dms-frontend (Frontend + Backend only, for LAN publish).

Rationale: Docker bridge networks provide L2 isolation. Services on dms-internal without ports mapping are unreachable from LAN. Only Frontend (3000) and Backend (3000) need LAN access. This replaces VLAN/firewall ACL reliance with Docker-native isolation.

Alternatives Considered:

  • Single bridge network + iptables rules — more complex, error-prone
  • Docker Swarm overlay network — overkill for single host
  • Host network mode — no isolation, security risk

R2: CIFS Mount Strategy for ASUSTOR

Decision: Use Docker named volume with CIFS driver to mount ASUSTOR share //192.168.10.9/np-dms-as/data/uploads as asustor_uploads volume, mounted at /mnt/uploads in sidecar and /app/uploads in backend.

Rationale: Docker CIFS volume driver handles mount lifecycle with container start/stop. Credentials in .env (gitignored). Both backend and sidecar see the same files via the same CIFS mount point.

Alternatives Considered:

  • Host-level mount -t cifs then bind mount — requires host OS config, not portable
  • SSHFS — slower than CIFS for file operations
  • Sync files to local SSD — adds complexity, storage duplication

Key Consideration: Previous Desk-5439 setup had issues with Docker Desktop WSL2 + CIFS (see memory). On Linux host, CIFS volume driver works natively without WSL2 layer.

R3: MariaDB Migration Strategy

Decision: Use mariadb-dump (logical dump) from QNAP MariaDB 11.8, pipe directly to new host MariaDB 11.8 container.

Rationale: Same MariaDB version (11.8) on both hosts → logical dump is safest. Database is small enough (<10GB estimated) that dump/restore completes within maintenance window.

Alternatives Considered:

  • mariabackup (physical backup) — faster but requires same filesystem layout
  • Replication (binlog) — overkill for one-time migration
  • Copy raw data files — risky, requires same version + config

Migration Command:

# From QNAP (source) — dump all databases
mariadb-dump --single-transaction --routines --triggers \
  -h 127.0.0.1 -u root -p"$DB_ROOT_PASSWORD" \
  --all-databases > qnap-full-dump.sql

# On new host — restore
docker exec -i lcbp3-mariadb mariadb -u root -p"$DB_ROOT_PASSWORD" < qnap-full-dump.sql

R4: Elasticsearch Migration Strategy

Decision: Use ES snapshot/restore API — create snapshot on QNAP ES, transfer to new host, restore.

Rationale: ES snapshot API is the official migration path. Handles index mappings, settings, and data. Works across same ES version (8.11.x).

Alternatives Considered:

  • Copy raw data directory — risky, requires identical ES config
  • Re-index from MariaDB — slow, loses search index tuning
  • Logstash pipeline — overkill for one-time migration

Migration Steps:

  1. Register shared filesystem repo on QNAP ES
  2. Create snapshot of all indices
  3. Copy snapshot files to new host ES data volume
  4. Register repo on new host ES
  5. Restore snapshot

R5: GPU VRAM Management on Single Host

Decision: Rely on ADR-040 D3 (Adaptive OCR Residency via calculate_ocr_residency()) and ADR-040 D4 (CPU Fallback Retrieval). LLM-First GPU Ownership from CONTEXT.md.

Rationale: RTX 5060 Ti 16GB must serve:

  • np-dms-ai (Typhoon-2.5 ~7-8B): ~6-8GB VRAM
  • np-dms-ocr (Typhoon OCR): ~5GB VRAM
  • nomic-embed-text: ~0.5GB VRAM
  • CUDA overhead: ~1.5GB
  • Total: ~13-15GB → tight but feasible with adaptive residency

Key Policy: When LLM (np-dms-ai) needs to load, OCR model is unloaded first (keep_alive=0 for OCR). BGE-M3 + Reranker use CPU fallback when GPU is occupied.

Alternatives Considered:

  • Force GPU-resident for all models — OOM risk (15.5GB > 16GB with overhead)
  • CPU-only for all AI — too slow for production
  • Second GPU — not available on new host

R6: RAM Budget Allocation

Decision: Per-container memory limits in Docker Compose:

Service Memory Limit Notes
MariaDB 8G Largest consumer, tune innodb_buffer_pool
Elasticsearch 4G ES_JAVA_OPTS=-Xms2g -Xmx2g
Backend (NestJS) 2G Node.js + BullMQ workers
Frontend (Next.js) 1G Standalone mode
Redis 1G In-memory + AOF
Qdrant 1G Vector DB
OCR Sidecar 1G Python + PyMuPDF
Ollama 2G Model loading + inference
ClamAV 2G Virus definitions
ollama-metrics 256M Lightweight proxy
Total ~22.3G Leaves ~9.7G for OS + swap

Rationale: 32GB total - 22.3GB containers = ~9.7GB for OS kernel + page cache + swap. Comfortable margin.

Alternatives Considered:

  • No limits — risk of OOM killer affecting critical services
  • Tighter limits — may cause ES/MariaDB instability

R7: CI/CD Pipeline Update

Decision: Update Gitea Actions ci-deploy.yml to SSH-deploy to new host IP instead of QNAP IP. ASUSTOR Gitea runner stays unchanged.

Rationale: Gitea runner on ASUSTOR (192.168.10.9) can reach new host via VLAN 10. Only the deploy target IP changes. deploy.sh path to compose file updates to New-Host/docker-compose.new-host.yml.

Alternatives Considered:

  • Move Gitea runner to new host — unnecessary, runner works remotely
  • Manual deployment — not sustainable for ongoing releases

R8: Rollback Strategy

Decision: Multi-step rollback plan documented in rollback.sh:

  1. Stop services on new host (docker compose down)
  2. Restore services on QNAP (start existing containers with old data)
  3. Restore services on Desk-5439 (start Ollama + sidecar)
  4. Revert DNS/NPM to point to QNAP
  5. Revert Gitea CI/CD deploy target to QNAP
  6. Re-enable X-API-Key in sidecar + backend

Rationale: QNAP retains all data (MariaDB, ES, Redis, files) until verified stable. Rollback is fast (<2 hours) because old infrastructure is intact.

Alternatives Considered:

  • No rollback (accept SPOF) — too risky for production DMS
  • Hot failover with replication — overkill for current scale