np-dms/lcbp3

Fork 0

Files

T

admin a80ebef285

CI / CD Pipeline / build (push) Successful in 7m37s

Details

CI / CD Pipeline / deploy (push) Failing after 20m15s

Details

refactor(ai): OCR sidecar canonical naming cleanup — typhoon→np-dms, remove hardcoded keys, asyncio.to_thread, ADR-040/041

2026-06-20 16:37:04 +07:00

6.3 KiB

Raw Blame History

// File: specs/100-Infrastructures/141-server-consolidation/research.md // Change Log: // - 2026-06-20: Initial research for Single-Host Server Consolidation

Research: Single-Host Server Consolidation

Branch: 141-server-consolidation | Date: 2026-06-20

R1: Docker Network Isolation Strategy

Decision: Use two Docker bridge networks — dms-internal (all services) and dms-frontend (Frontend + Backend only, for LAN publish).

Rationale: Docker bridge networks provide L2 isolation. Services on dms-internal without ports mapping are unreachable from LAN. Only Frontend (3000) and Backend (3000) need LAN access. This replaces VLAN/firewall ACL reliance with Docker-native isolation.

Alternatives Considered:

Single bridge network + iptables rules — more complex, error-prone
Docker Swarm overlay network — overkill for single host
Host network mode — no isolation, security risk

R2: CIFS Mount Strategy for ASUSTOR

Decision: Use Docker named volume with CIFS driver to mount ASUSTOR share //192.168.10.9/np-dms-as/data/uploads as asustor_uploads volume, mounted at /mnt/uploads in sidecar and /app/uploads in backend.

Rationale: Docker CIFS volume driver handles mount lifecycle with container start/stop. Credentials in .env (gitignored). Both backend and sidecar see the same files via the same CIFS mount point.

Alternatives Considered:

Host-level mount -t cifs then bind mount — requires host OS config, not portable
SSHFS — slower than CIFS for file operations
Sync files to local SSD — adds complexity, storage duplication

Key Consideration: Previous Desk-5439 setup had issues with Docker Desktop WSL2 + CIFS (see memory). On Linux host, CIFS volume driver works natively without WSL2 layer.

R3: MariaDB Migration Strategy

Decision: Use mariadb-dump (logical dump) from QNAP MariaDB 11.8, pipe directly to new host MariaDB 11.8 container.

Rationale: Same MariaDB version (11.8) on both hosts → logical dump is safest. Database is small enough (<10GB estimated) that dump/restore completes within maintenance window.

Alternatives Considered:

mariabackup (physical backup) — faster but requires same filesystem layout
Replication (binlog) — overkill for one-time migration
Copy raw data files — risky, requires same version + config

Migration Command:

# From QNAP (source) — dump all databases
mariadb-dump --single-transaction --routines --triggers \
  -h 127.0.0.1 -u root -p"$DB_ROOT_PASSWORD" \
  --all-databases > qnap-full-dump.sql

# On new host — restore
docker exec -i lcbp3-mariadb mariadb -u root -p"$DB_ROOT_PASSWORD" < qnap-full-dump.sql

R4: Elasticsearch Migration Strategy

Decision: Use ES snapshot/restore API — create snapshot on QNAP ES, transfer to new host, restore.

Rationale: ES snapshot API is the official migration path. Handles index mappings, settings, and data. Works across same ES version (8.11.x).

Alternatives Considered:

Copy raw data directory — risky, requires identical ES config
Re-index from MariaDB — slow, loses search index tuning
Logstash pipeline — overkill for one-time migration

Migration Steps:

Register shared filesystem repo on QNAP ES
Create snapshot of all indices
Copy snapshot files to new host ES data volume
Register repo on new host ES
Restore snapshot

R5: GPU VRAM Management on Single Host

Decision: Rely on ADR-040 D3 (Adaptive OCR Residency via calculate_ocr_residency()) and ADR-040 D4 (CPU Fallback Retrieval). LLM-First GPU Ownership from CONTEXT.md.

Rationale: RTX 5060 Ti 16GB must serve:

np-dms-ai (Typhoon-2.5 ~7-8B): ~6-8GB VRAM
np-dms-ocr (Typhoon OCR): ~5GB VRAM
nomic-embed-text: ~0.5GB VRAM
CUDA overhead: ~1.5GB
Total: ~13-15GB → tight but feasible with adaptive residency

Key Policy: When LLM (np-dms-ai) needs to load, OCR model is unloaded first (keep_alive=0 for OCR). BGE-M3 + Reranker use CPU fallback when GPU is occupied.

Alternatives Considered:

Force GPU-resident for all models — OOM risk (15.5GB > 16GB with overhead)
CPU-only for all AI — too slow for production
Second GPU — not available on new host

R6: RAM Budget Allocation

Decision: Per-container memory limits in Docker Compose:

Service	Memory Limit	Notes
MariaDB	8G	Largest consumer, tune innodb_buffer_pool
Elasticsearch	4G	ES_JAVA_OPTS=-Xms2g -Xmx2g
Backend (NestJS)	2G	Node.js + BullMQ workers
Frontend (Next.js)	1G	Standalone mode
Redis	1G	In-memory + AOF
Qdrant	1G	Vector DB
OCR Sidecar	1G	Python + PyMuPDF
Ollama	2G	Model loading + inference
ClamAV	2G	Virus definitions
ollama-metrics	256M	Lightweight proxy
Total	~22.3G	Leaves ~9.7G for OS + swap

Rationale: 32GB total - 22.3GB containers = ~9.7GB for OS kernel + page cache + swap. Comfortable margin.

Alternatives Considered:

No limits — risk of OOM killer affecting critical services
Tighter limits — may cause ES/MariaDB instability

R7: CI/CD Pipeline Update

Decision: Update Gitea Actions ci-deploy.yml to SSH-deploy to new host IP instead of QNAP IP. ASUSTOR Gitea runner stays unchanged.

Rationale: Gitea runner on ASUSTOR (192.168.10.9) can reach new host via VLAN 10. Only the deploy target IP changes. deploy.sh path to compose file updates to New-Host/docker-compose.new-host.yml.

Alternatives Considered:

Move Gitea runner to new host — unnecessary, runner works remotely
Manual deployment — not sustainable for ongoing releases

R8: Rollback Strategy

Decision: Multi-step rollback plan documented in rollback.sh:

Stop services on new host (docker compose down)
Restore services on QNAP (start existing containers with old data)
Restore services on Desk-5439 (start Ollama + sidecar)
Revert DNS/NPM to point to QNAP
Revert Gitea CI/CD deploy target to QNAP
Re-enable X-API-Key in sidecar + backend

Rationale: QNAP retains all data (MariaDB, ES, Redis, files) until verified stable. Rollback is fast (<2 hours) because old infrastructure is intact.

Alternatives Considered:

No rollback (accept SPOF) — too risky for production DMS
Hot failover with replication — overkill for current scale

6.3 KiB Raw Blame History

Research: Single-Host Server Consolidation

R1: Docker Network Isolation Strategy

R2: CIFS Mount Strategy for ASUSTOR

R3: MariaDB Migration Strategy

R4: Elasticsearch Migration Strategy

R5: GPU VRAM Management on Single Host

R6: RAM Budget Allocation

R7: CI/CD Pipeline Update

R8: Rollback Strategy

6.3 KiB

Raw Blame History