690524:1919 ADR-028-228-migration #04

2026-05-24 19:19:46 +07:00
parent 93fd95a6b3
commit 1564f8648d
22 changed files with 1422 additions and 255 deletions
@@ -9,7 +9,7 @@

 ### 1. `ai_audit_logs` (existing table — verify against schema)

-ไม่มีการเพิ่ม column ใหม่ — ใช้ `model_name` column ที่มีอยู่แล้วบันทึก `gemma4:e4b` แทน `gemma4:9b`
+ไม่มีการเพิ่ม column ใหม่ — ใช้ `model_name` column ที่มีอยู่แล้วบันทึก `gemma4:e2b` แทน `gemma4:9b`

 **Key fields (existing)**:
 ```
@@ -18,7 +18,7 @@ public_id        BINARY(16) → UUIDv7
 document_id      INT FK documents.id
 project_id       INT FK projects.id
 job_type         VARCHAR(50)     -- 'classification', 'tagging', 'rag', 'embed'
-model_name       VARCHAR(100)    -- 'gemma4:e4b', 'nomic-embed-text'
+model_name       VARCHAR(100)    -- 'gemma4:e2b', 'nomic-embed-text'
 confidence_score DECIMAL(5,4)    -- 0.0000 – 1.0000
 ai_suggestion_json   JSON
 human_override_json  JSON NULL
@@ -127,7 +127,7 @@ interface AiRealtimeJobData {
  projectPublicId: string;    // UUIDv7 — required
  userId: number;             // INT internal ID (for audit only)
  payload: {
-    // ai-suggest: { pdfPath: string; pages: 1-3 }
+    // ai-suggest: { pdfPath: string; pages: 1-5 }
    // rag-query:  { question: string; topK: number }
  };
  idempotencyKey: string;
@@ -142,7 +142,7 @@ interface AiBatchJobData {
  projectPublicId: string;    // UUIDv7 — required
  payload: {
    // ocr:              { pdfPath: string }
-    // extract-metadata: { textContent: string; maxPages: 3 }
+    // extract-metadata: { textContent: string; maxPages: 5 }
    // embed-document:   { pdfPath: string; chunkSize: 512; overlap: 64 }
  };
  batchId?: string;           // สำหรับ Legacy Migration เท่านั้น
@@ -7,7 +7,7 @@

 ## Summary

-Implement ADR-023A AI Architecture Revision: เปลี่ยน model stack จาก 3-model (gemma4:9b + Typhoon + nomic-embed-text) เป็น 2-model (gemma4:e4b Q8_0 + nomic-embed-text), แยก BullMQ เป็น 2 queues (`ai-realtime`/`ai-batch`), เพิ่ม OCR auto-detection, enforce multi-tenant QdrantService, implement Legacy Migration pipeline และ migration_review_queue, และลบ Typhoon Cloud API ออกจาก codebase ทั้งหมด
+Implement ADR-023A AI Architecture Revision: เปลี่ยน model stack จาก 3-model (gemma4:9b + Typhoon + nomic-embed-text) เป็น 2-model (gemma4:e2b + nomic-embed-text), แยก BullMQ เป็น 2 queues (`ai-realtime`/`ai-batch`), เพิ่ม OCR auto-detection, enforce multi-tenant QdrantService, implement Legacy Migration pipeline และ migration_review_queue, และลบ Typhoon Cloud API ออกจาก codebase ทั้งหมด

 ---

@@ -22,7 +22,7 @@ Implement ADR-023A AI Architecture Revision: เปลี่ยน model stack
 **Testing**: Jest (NestJS unit/integration)
 **Target Platform**: QNAP NAS (NestJS container) + Admin Desktop Desk-5439 (Ollama)
 **Performance Goals**: ai-suggest < 30s; rag-query < 10s (p95 dequeue-to-response)
-**Constraints**: VRAM ≤ 5GB peak, concurrency=1 per queue (prevent GPU overflow)
+**Constraints**: VRAM ≤ 3GB peak, concurrency=1 per queue (prevent GPU overflow)
 **Scale/Scope**: ~20,000 legacy docs (migration), ~50 new docs/day (production)

 ---
@@ -82,7 +82,7 @@ backend/src/modules/ai/
 │   ├── ai-realtime.processor.ts         # new: ai-realtime consumer
 │   └── ai-batch.processor.ts            # new: ai-batch consumer (replaces existing)
 ├── services/
-│   ├── ollama.service.ts                # update: model → gemma4:e4b
+│   ├── ollama.service.ts                # update: model → gemma4:e2b
 │   ├── qdrant.service.ts                # update: enforce projectPublicId param
 │   ├── ocr.service.ts                   # new: OCR auto-detect + PaddleOCR routing
 │   ├── migration.service.ts             # new: Legacy Migration pipeline
@@ -123,7 +123,7 @@ Tasks: T001–T008

 ### Phase 1: Core AI Pipeline

-**Goal**: OCR auto-detect, gemma4:e4b integration, ai-suggest + embed-document flows
+**Goal**: OCR auto-detect, gemma4:e2b integration, ai-suggest + embed-document flows

 Tasks: T009–T022

@@ -11,9 +11,9 @@

 **Requirements:**
 - **OS**: Windows 10/11 หรือ Linux (Desk-5439)
- **GPU**: NVIDIA GPU ที่รองรับ CUDA 11.8+ (VRAM ≥ 6GB แนะนำ)
+- **GPU**: NVIDIA GPU ที่รองรับ CUDA 11.8+ (VRAM ≥ 4GB แนะนำ)
 - **Ollama Version**: ≥ 0.5.0
- **Models**: `gemma4:e2b` (Q4_K_M quantization) + `nomic-embed-text`
+- **Models**: `gemma4:e2b` (Q4 quantization) + `nomic-embed-text`

 **Verification Steps:**

@@ -30,7 +30,7 @@ nvidia-smi
 ollama list
 # Expected output:
 # NAME                    ID              SIZE      MODIFIED
-# gemma4:e2b              <hash>          2.4 GB    <timestamp>
+# gemma4:e2b              <hash>          2.0 GB    <timestamp>
 # nomic-embed-text        <hash>          274 MB    <timestamp>

 # 4. Test model inference (quick test)
@@ -54,7 +54,7 @@ ollama pull nomic-embed-text

 # Verify VRAM usage during inference
 nvidia-smi --query-gpu=memory.used --format=csv,noheader
-# Expected: < 5120 MB (5GB threshold per SC-003)
+# Expected: < 3072 MB (3GB threshold per SC-003)
 ```

 **Troubleshooting:**
@@ -285,7 +285,7 @@ curl http://192.168.10.XX:8765/health
 ### 7. GPU Resource Monitoring (Critical for SC-003)

 **Requirements:**
- **VRAM Limit**: ≤ 5GB peak (per SC-003)
+- **VRAM Limit**: ≤ 3GB peak (per SC-003)
 - **Concurrency**: 1 job per queue (enforced by BullMQ)

 **Verification Commands:**
@@ -303,12 +303,12 @@ nvidia-smi --query-gpu=timestamp,memory.used,utilization.gpu \
 ```

 **Expected Behavior:**
- **ai-batch job**: VRAM peaks at ~2.5GB (gemma4:e2b Q4_K_M)
- **ai-realtime job**: VRAM peaks at ~2.5GB (same model)
+- **ai-batch job**: VRAM peaks at ~2.0GB (gemma4:e2b Q4)
+- **ai-realtime job**: VRAM peaks at ~2.0GB (same model)
 - **No concurrent jobs**: ai-batch pauses when ai-realtime active (GPU protection)

 **Troubleshooting:**
- **VRAM overflow (>5GB)**: Reduce model quantization or increase GPU memory
+- **VRAM overflow (>3GB)**: Reduce model quantization or increase GPU memory
 - **GPU contention**: Verify BullMQ concurrency=1 enforcement
 - **Slow inference**: Check GPU utilization, consider faster model quantization

@@ -399,7 +399,7 @@ grep -r "typhoon" backend/src --include="*.ts"

 # 2. Measure VRAM peak during job run (verify SC-003):
 nvidia-smi --query-gpu=memory.used --format=csv,noheader
-# Expected: value < 5120 MB (5GB threshold per SC-003)
+# Expected: value < 3072 MB (3GB threshold per SC-003)
 # Repeat during both ai-batch and ai-realtime jobs to verify peak
 ```

@@ -8,13 +8,13 @@

 ## Decision 1: Model Stack Reduction

- **Decision**: ใช้ 2-model stack: `gemma4:e4b Q8_0` + `nomic-embed-text` แทน 3-model stack เดิม
- **Rationale**: VRAM budget RTX 2060 Super 8GB — 3-model stack (gemma4:9b + Typhoon + nomic-embed-text) ใช้ ~7.8GB ไม่มี headroom; 2-model stack ใช้ ~4.5GB peak มี headroom ~3.5GB
+- **Decision**: ใช้ 2-model stack: `gemma4:e2b` + `nomic-embed-text` แทน 3-model stack เดิม
+- **Rationale**: VRAM budget RTX 2060 Super 8GB — 3-model stack (gemma4:9b + Typhoon + nomic-embed-text) ใช้ ~7.8GB ไม่มี headroom; 2-model stack ใช้ ~2.5GB peak มี headroom ~5.5GB
 - **Alternatives considered**:
  - gemma4:9b + nomic-embed-text (ไม่มี Typhoon): ยังเกิน budget ~6.8GB
-  - gemma4:e4b Q4_K_M (quantize ต่ำกว่า): ประหยัด VRAM มากกว่าแต่คุณภาพต่ำกว่า Q8_0
+  - gemma4:e4b Q8_0: ใช้ VRAM ~4.5GB แต่ context window น้อยกว่า
  - ย้ายไปใช้ Cloud AI: ขัดกับ ADR-023 (INTERNAL data — ห้าม Cloud)
- **VRAM Detail**: gemma4:e4b Q8_0 = ~4.0GB weights + ~0.2GB KV Cache (จำกัดโดย 3-page input limit) + nomic-embed-text ~0.3GB = **~4.5GB peak**
+- **VRAM Detail**: gemma4:e2b Q4 = ~2GB weights + ~0.2GB KV Cache (จำกัดโดย 5-page input limit) + nomic-embed-text ~0.3GB = **~2.5GB peak**

 ---

@@ -24,7 +24,7 @@
 - **Rationale**: Single queue ทำให้ RAG Q&A (interactive, p95 < 10s) ถูก block โดย OCR/Embed batch jobs (ไม่มี SLA); 2-queue ให้ priority separation โดยไม่เพิ่ม Worker ที่ทำให้ VRAM overflow
 - **Alternatives considered**:
  - Single queue + priority field: priority ใน BullMQ ไม่ป้องกัน long-running job ที่กำลังรันอยู่ block queue ถัดไป
-  - 2 Queues + 2 Workers พร้อมกัน: VRAM overflow เมื่อทั้งคู่ใช้ gemma4:e4b พร้อมกัน
+  - 2 Queues + 2 Workers พร้อมกัน: VRAM overflow เมื่อทั้งคู่ใช้ gemma4:e2b พร้อมกัน
 - **Implementation**: BullMQ `active` event บน `ai-realtime` → pause `ai-batch`; `completed`/`failed` → resume `ai-batch`

 ---
@@ -72,7 +72,7 @@
 ## Decision 7: Threshold Recalibration Policy

 - **Decision**: ใช้ค่าเริ่มต้น 0.85/0.60 สำหรับ Migration Phase แรก แล้ว recalibrate หลัง 100-500 ฉบับแรก
- **Rationale**: ค่าเดิมถูกกำหนดในยุค gemma4:9b — distribution อาจเปลี่ยนไปกับ gemma4:e4b; recalibrate จาก real data ดีกว่า hardcode ค่าใหม่โดยไม่มีข้อมูล
+- **Rationale**: ค่าเดิมถูกกำหนดในยุค gemma4:9b — distribution อาจเปลี่ยนไปกับ gemma4:e2b; recalibrate จาก real data ดีกว่า hardcode ค่าใหม่โดยไม่มีข้อมูล
 - **Trigger**: REJECTED rate > 30% หรือ Admin override rate > 40% → ปรับลด threshold

 ---
@@ -81,6 +81,6 @@

 | Assumption | Risk | Mitigation |
 |-----------|------|-----------|
-| gemma4:e4b Q8_0 รองรับภาษาไทยได้ดีเพียงพอ | HIGH — ไม่มีหลักฐานเชิงคุณภาพ | ทดสอบ 50-100 ฉบับก่อน Go-live; เตรียม Prompt Engineering ชดเชย |
-| 3-page limit เพียงพอสำหรับ metadata extraction | MEDIUM — บางเอกสารอาจมี title block หน้า 4+ | ตรวจสอบตัวอย่างเอกสาร 20 ฉบับก่อน implementation |
+| gemma4:e2b รองรับภาษาไทยได้ดีเพียงพอ | HIGH — ไม่มีหลักฐานเชิงคุณภาพ | ทดสอบ 50-100 ฉบับก่อน Go-live; เตรียม Prompt Engineering ชดเชย |
+| 5-page limit เพียงพอสำหรับ metadata extraction | MEDIUM — บางเอกสารอาจมี title block หน้า 6+ | ตรวจสอบตัวอย่างเอกสาร 20 ฉบับก่อน implementation |
 | RTX 2060 Super VRAM ใช้ได้ 8GB เต็ม | LOW — GPU อาจมี overhead จาก OS และ driver | monitor จริงด้วย `nvidia-smi` ระหว่าง UAT |
@@ -95,7 +95,7 @@ Admin สามารถดู AI Performance metrics จาก ai_audit_logs (c
 ### Functional Requirements

 - **FR-001**: ระบบ MUST ตรวจจับประเภท PDF (Digital vs Scanned) อัตโนมัติโดยใช้ `extracted_chars > OCR_CHAR_THRESHOLD` โดยไม่ให้ User เลือก
- **FR-002**: ระบบ MUST ส่ง PDF เข้า gemma4:e4b สูงสุด 3 หน้าแรกเท่านั้น สำหรับงาน Classification และ Tagging
+- **FR-002**: ระบบ MUST ส่ง PDF เข้า gemma4:e2b สูงสุด 5 หน้าแรกเท่านั้น สำหรับงาน Classification และ Tagging
 - **FR-003**: ระบบ MUST ฝัง Vector จากเอกสารทั้งฉบับ (full-document chunking) สำหรับ RAG — ไม่จำกัด 3 หน้า
 - **FR-004**: AI Inference ทั้งหมด MUST ผ่าน BullMQ Worker บน NestJS — ห้าม n8n เรียก Ollama โดยตรง
 - **FR-005**: `QdrantService.search()` MUST รับ `projectPublicId: string` เป็น required parameter เสมอ
@@ -129,7 +129,7 @@ Admin สามารถดู AI Performance metrics จาก ai_audit_logs (c

 - **SC-001**: AI Suggestion ปรากฏบนฟอร์มภายใน 30 วินาที สำหรับ Digital PDF และ 90 วินาที สำหรับ Scanned PDF (p95)
 - **SC-002**: RAG Q&A ตอบกลับภายใน 10 วินาที (p95 นับจาก dequeue จาก `ai-realtime`)
- **SC-003**: VRAM peak ไม่เกิน 5GB เมื่อรัน 2 models พร้อมกัน (gemma4:e4b + nomic-embed-text) — วัดด้วย `nvidia-smi --query-gpu=memory.used --format=csv,noheader` ระหว่าง job run (ดู verification ใน quickstart.md Scenario 6, QuizMe 2026-05-15)
+- **SC-003**: VRAM peak ไม่เกิน 3GB เมื่อรัน 2 models พร้อมกัน (gemma4:e2b + nomic-embed-text) — วัดด้วย `nvidia-smi --query-gpu=memory.used --format=csv,noheader` ระหว่าง job run (ดู verification ใน quickstart.md Scenario 6, QuizMe 2026-05-15)
 - **SC-004**: ไม่มี data leak ข้ามโครงการใน RAG — ทุก Qdrant query มี `project_public_id` filter (ตรวจสอบได้จาก query log)
 - **SC-005**: Legacy Migration Batch 20,000 ฉบับ ประมวลผลสำเร็จโดยไม่มี duplicate record (ตรวจสอบด้วย Idempotency-Key)
 - **SC-006**: admin_override_rate < 40% หลัง Calibration Phase (100-500 ฉบับแรก)
@@ -140,7 +140,7 @@ Admin สามารถดู AI Performance metrics จาก ai_audit_logs (c

 ## Assumptions

- Desk-5439 พร้อมใช้งานและมี Ollama ที่ติดตั้ง `gemma4:e4b Q8_0` และ `nomic-embed-text` แล้ว
+- Desk-5439 พร้อมใช้งานและมี Ollama ที่ติดตั้ง `gemma4:e2b` และ `nomic-embed-text` แล้ว
 - Qdrant instance พร้อมใช้งานและ accessible จาก NestJS backend
 - n8n instance สามารถ call DMS API ผ่าน HTTP ได้
 - PaddleOCR ติดตั้งบน Desk-5439 พร้อมรองรับภาษาไทย
@@ -150,7 +150,7 @@ Admin สามารถดู AI Performance metrics จาก ai_audit_logs (c
 ## Clarifications

 ### Session 2026-05-15
- Q: RAG embedding scope — embed ทั้งฉบับหรือแค่ 3 หน้า? → A: ทั้งฉบับ (chunked 512t/64t overlap) — 3-page limit ใช้เฉพาะ Classification/Tagging
+- Q: RAG embedding scope — embed ทั้งฉบับหรือแค่ 5 หน้า? → A: ทั้งฉบับ (chunked 512t/64t overlap) — 5-page limit ใช้เฉพาะ Classification/Tagging
 - Q: embed-document trigger timing → A: AUTO ทันทีหลัง commit (parallel กับ AI Suggestion), ไม่รอ Human confirm
 - Q: n8n role → A: n8n call DMS API เท่านั้น (`POST /api/ai/jobs`) — ไม่เรียก Ollama/Qdrant โดยตรง
 - Q: QdrantService enforcement → A: `projectPublicId: string` เป็น required param — ไม่มี optional fallback