690611:1705 ADR-035-235 #00 [skip CI]

2026-06-11 17:05:17 +07:00
parent cd7d20ccd4
commit 71c5e88181
14 changed files with 1422 additions and 682 deletions
@@ -200,6 +200,24 @@ _Avoid_: Throw exception from tool, Untyped error
 | **Thai-Optimized Model** | โมเดล AI ที่ถูก fine-tune มาสำหรับภาษาไทยโดยเฉพาะ (เช่น Typhoon series จาก SCB10X) | Generic model, English-only model |
 | **Model Unload/Load** | กระบวนการยกเลิกโหลดโมเดลจาก VRAM และโหลดโมเดลใหม่เข้าไปแทน เพื่อสลับการใช้งานระหว่างโมเดลต่างๆ | Model switching (ambiguous), Hot swap |
 | **Cold Start Penalty** | ความล่าช้า 5-15 วินาทีที่เกิดจากการโหลดโมเดล weights เข้า VRAM หลังจากโมเดลถูก unload (keep_alive: 0) | Initial delay, First-run latency |
 | **Canonical AI Model Identity** | ชื่อโมเดลหลักที่ระบบ backend, admin console และเอกสารสถาปัตยกรรมใช้อ้างอิงร่วมกันเป็น source of truth เดียว | Alias-only model name, temporary deploy tag |
 | **Adaptive OCR Residency** | นโยบาย keep_alive ของ OCR model ที่ปรับตาม VRAM headroom และ active model ขณะนั้น แทนการค้างหรือ unload แบบตายตัว | Fixed keep_alive, always-resident OCR |
 | **Execution Profile** | สัญญาณเชิงนโยบายที่ caller ส่งมาเพื่อบอกระดับความเร็ว/ความแม่นยำ/บริบทที่ต้องการ โดย backend map ต่อไปเป็น model และ parameters ที่อนุญาต | Free-form model key, direct model override |
 | **Canonical Profile Set** | ชุดค่า `Execution Profile` มาตรฐานที่คงที่ระดับ contract เช่น `fast`, `balanced`, `thai-accurate`, `large-context` แทนการแตก profile ตาม internal pipeline | Job-specific routing key, per-endpoint profile taxonomy |
 | **Policy-Enforced Profile Override** | กฎที่ backend มีสิทธิ์บังคับ profile สำหรับงานที่มีผลต่อข้อมูลหรือ metadata โดยไม่ยึดค่าที่ caller ส่งมา | Caller-controlled quality for write-affecting jobs, advisory-only governance |
 | **LLM-First GPU Ownership** | นโยบายจัดลำดับสิทธิ์ VRAM ที่ให้ main LLM และ OCR path มาก่อน embedding/reranking; retrieval side ใช้ GPU ได้เฉพาะเมื่อมี headroom ผ่าน policy | Flat shared GPU pool, equal-priority GPU consumers |
 | **CPU Fallback Retrieval** | พฤติกรรม degrade ของ embedding/reranking ที่สลับกลับไปใช้ CPU ทันทีเมื่อ GPU headroom ไม่พอ โดยไม่รอคิว GPU | GPU wait queue for retrieval, hard failure on low VRAM |
 | **Selective Realtime Concurrency** | นโยบายเพิ่ม concurrency ของ `ai-realtime` ได้เฉพาะ job type ที่ไม่แตะ OCR path หรือ model switching; pause/resume coordination หลักยังคงอยู่ | Global realtime concurrency uplift, scheduler rewrite |
 | **Lightweight Realtime Job** | งานใน `ai-realtime` ที่ไม่เรียก OCR, ไม่บังคับ model switch, และไม่พึ่ง GPU-heavy generation path จึงมีสิทธิ์อยู่ใน concurrency uplift set | RAG query, OCR-triggering job, GPU-heavy generation |
 | **Generation-Centric RAG Query** | การจัดประเภท `rag-query` ว่าเป็นงาน generation เป็นหลัก โดย retrieval ทำหน้าที่เตรียม context และยอม degrade ได้ | Retrieval-first RAG, search-only job |
 | **Restricted Large-Context Profile** | โปรไฟล์ `large-context` เป็นความสามารถพิเศษที่จำกัดใช้เฉพาะ admin หรือ special workflows ที่ backend อนุญาต ไม่ใช่ตัวเลือกทั่วไปของ `rag-query` | Public long-context option, caller-driven context inflation |
 | **Big Bang AI Runtime Rollout** | การเปลี่ยน runtime policy, model identity, และ GPU scheduling หลายส่วนพร้อมกันในรอบ deploy เดียว เพราะระบบยังไม่เปิด production | Phase-gated rollout, incremental policy cutover |
 | **Big Bang Cutover Gate** | เกณฑ์ผ่านก่อน cutover ที่บังคับให้ policy contract, model switching, adaptive OCR residency, และ RAG fallback ต้องผ่านครบทั้งชุด ไม่รับ partial success | Best-effort rollout, partial completion gate |
 | **Executable-First Verification** | เกณฑ์ยืนยันผลหลักของ AI runtime rollout ต้องอิง test, log, metric, หรือ trace ที่รันซ้ำได้ แต่แต่ละแกนต้องมี manual validation path สำหรับยืนยันพฤติกรรมเชิงใช้งานจริงประกบเสมอ | Manual-only signoff, unverifiable smoke check |
 | **Single-Name Canonical Model Policy** | เมื่อประกาศ canonical model identity ใหม่ ชื่อเดียวกันต้องถูกใช้สอดคล้องกันทุกชั้นของระบบที่ผู้ใช้และนักพัฒนาเห็น ส่วนชื่อ base runtime จริงเป็น implementation detail ใน ops/runtime internals เท่านั้น | Dual naming, mixed canonical and base model labels |
 | **Canonical OCR Identity** | OCR model ต้องใช้ชื่อ canonical เดียวทุกชั้นของระบบเช่น `np-dms-ocr` โดยไม่เปิดชื่อ runtime เดิมเป็น public/internal contract หลัก | Legacy OCR runtime label as primary name, mixed OCR naming |
 | **Profile-Only Parameter Governance** | API caller ส่งได้เพียง `Execution Profile`; ค่า temperature, top_p, max tokens และ runtime parameters จริงถูกกำหนดโดย backend policy เท่านั้น | Caller parameter override, free-form runtime tuning |
 | **Integrated Retrieval Acceleration Policy** | การเร่งความเร็ว retrieval เช่น BGE embedding/reranking บน GPU เป็นส่วนหนึ่งของ AI runtime resource policy เดียวกับ main model และ OCR ไม่ใช่งาน optimization แยกอิสระ | Standalone retrieval tuning, separate GPU policy for RAG only |
 ---
@@ -226,6 +244,24 @@ _Avoid_: Throw exception from tool, Untyped error
 - **"AI = Document Controller"** — resolved: ใช้ **AI Document Assistant** (Suggest + Insight) แทน เพื่อกัน scope creep ไปทาง autonomous agent
 - **OpenRAG vs ADR-023A** — resolved: **ADR-023A เป็น canonical source** — ใช้ Qdrant + nomic-embed-text สำหรับ vector search; Elasticsearch ใช้สำหรับ keyword/full-text เท่านั้น; `specs/03-Data-and-Storage/03-07-OpenRAG.md` เป็นเอกสาร reference แต่ไม่ใช่ active spec
 - **".agents/ กับ Production AI"** — resolved: `.agents/` คือ Dev AI toolkit (ช่วยเขียนโค้ด); Production AI คือ AI Gateway + n8n + Ollama — เป็นคนละ layer กัน
 - **"np-dms-ai" vs `typhoon2.5-np-dms:latest`** — resolved: ถ้าเดินตาม AI refactor ใหม่ `np-dms-ai` คือ **Canonical AI Model Identity** ใหม่ของระบบ ไม่ใช่แค่ deploy alias
 - **"OCR keep_alive"** — resolved: policy ใหม่ควรถูกอธิบายเป็น **Adaptive OCR Residency** ตาม VRAM headroom และ active model ไม่ใช่ fixed `0` หรือ fixed `300`
 - **"`model.key` ใน API job request"** — resolved: caller ไม่ควรเลือกชื่อโมเดลตรง ๆ; ควรส่ง **Execution Profile** แล้วให้ backend policy เป็นคน map ไป model/parameters ที่อนุญาต
 - **"profile names"** — resolved: ใช้ **Canonical Profile Set** แบบเล็กและเสถียร (`fast`, `balanced`, `thai-accurate`, `large-context`) แทนการแตกชื่อ profile ตาม job ภายใน
 - **"profile สำหรับ migrate-document / auto-fill-document / OCR extraction"** — resolved: ใช้ **Policy-Enforced Profile Override**; backend บังคับ profile เองสำหรับงานที่มีผลต่อข้อมูล ไม่เปิดให้ caller เลือกคุณภาพอย่างอิสระ
 - **"BGE-M3 / Reranker บน GPU"** — resolved: ถ้าย้ายขึ้น GPU ต้องอยู่ใต้ **LLM-First GPU Ownership**; LLM/OCR มี priority สูงกว่า retrieval path เสมอ
 - **"embed/rerank ตอน VRAM ไม่พอ"** — resolved: ใช้ **CPU Fallback Retrieval**; retrieval path ต้อง degrade ไป CPU ทันที ไม่รอ GPU queue
 - **"`ai-realtime = 2`"** — resolved: ใช้ **Selective Realtime Concurrency**; เพิ่มได้เฉพาะงาน realtime ที่ไม่ชนกับ OCR/model switching และยังคง pause/resume model เดิมเป็นแกนหลัก
 - **"งานไหนได้สิทธิ์ realtime concurrency 2"** — resolved: จำกัดเฉพาะ **Lightweight Realtime Job**; ไม่รวม `rag-query`
 - **"`rag-query` ควรถูกมองเป็นอะไร"** — resolved: ใช้ **Generation-Centric RAG Query**; main model path เป็น policy หลัก ส่วน retrieval เป็นขั้นเตรียม context ที่ fallback CPU ได้
 - **"`large-context` ใช้กับอะไร"** — resolved: ใช้ **Restricted Large-Context Profile**; จำกัดเฉพาะ admin/special workflows ไม่เปิดเป็นตัวเลือกทั่วไปของ `rag-query`
 - **"rollout ของ AI refactor"** — resolved: ใช้ **Big Bang AI Runtime Rollout** แม้มีหลาย runtime policy changes พร้อมกัน เพราะระบบยังไม่เปิด production
 - **"อะไรคือเกณฑ์ผ่านของ big bang"** — resolved: ใช้ **Big Bang Cutover Gate**; ต้องผ่านครบทั้ง policy contract, model switching, adaptive OCR residency และ RAG fallback
 - **"evidence แบบไหนนับว่าผ่าน gate"** — resolved: ใช้ **Executable-First Verification** เป็นหลัก แต่ต้องมี manual validation path ควบคู่ในแต่ละแกน
 - **"`np-dms-ai` ควรตั้งชื่ออย่างไรในระบบ"** — resolved: ใช้ **Single-Name Canonical Model Policy**; `np-dms-ai` เป็นชื่อเดียวทุกชั้นที่ผู้ใช้และนักพัฒนาเห็น
 - **"`np-dms-ocr` ควรเดินตาม naming policy เดียวกันไหม"** — resolved: ใช้ **Canonical OCR Identity**; `np-dms-ocr` เป็นชื่อ canonical เดียวทุกชั้นเหมือน `np-dms-ai`
 - **"`temperature/topP/maxTokens` ใครคุม"** — resolved: ใช้ **Profile-Only Parameter Governance**; caller ส่งได้แค่ profile ส่วน runtime parameters จริงให้ backend policy คุมทั้งหมด
 - **"BGE GPU uplift อยู่ใน scope เดียวกันไหม"** — resolved: ใช้ **Integrated Retrieval Acceleration Policy**; retrieval acceleration เป็นส่วนหนึ่งของ runtime resource policy เดียวกัน
 ## ADRs ที่เกี่ยวข้องกับ AI Runtime Layer
@@ -0,0 +1,12 @@
 # AI Runtime Policy Refactor for RTX 5060 Ti 16GB
 ระบบ AI runtime ของ LCBP3-DMS จะเปลี่ยนไปใช้ canonical identities `np-dms-ai` และ `np-dms-ocr`, ใช้ `executionProfile` เป็น policy-level contract แทน model key/parameter overrides, และรวม GPU scheduling ของ main model, OCR, embedding, และ reranking ไว้ใต้ policy เดียวกัน. การตัดสินใจนี้รองรับการอัปเกรดเป็น RTX 5060 Ti 16GB โดยยังรักษา AI governance เดิมของระบบ: backend policy เป็นผู้ตัดสิน model/parameters จริง, `rag-query` เป็น generation-centric job, retrieval ใช้ GPU ได้ภายใต้ LLM-first ownership เท่านั้นและต้อง fallback CPU ได้, ส่วน rollout ใช้ big bang cutover พร้อม executable-first verification และ manual validation path สำหรับทุกแกนสำคัญ.
 ## Considered Options
 - เก็บชื่อ canonical เดิม (`typhoon2.5-np-dms:latest` / `typhoon-np-dms-ocr:latest`) แล้วใช้ alias เฉพาะ deploy
 - เปิดให้ caller ส่ง `model.key` และ runtime parameters มาใน job request
 - ใช้ shared GPU pool แบบสิทธิ์เท่ากันระหว่าง LLM, OCR, embed, rerank
 - phase-gated rollout แยก naming, residency, retrieval acceleration, queue policy เป็นหลายรอบ
 เราไม่เลือกแนวทางเหล่านี้เพราะทำให้ governance ซ้ำซ้อน, เปิดช่อง bypass policy กลาง, หรือแยก resource policy ที่จริงผูกกันอยู่ให้กลายเป็นคนละเรื่อง. สำหรับ refactor รอบนี้ ระบบจะใช้ single-name canonical model policy, profile-only parameter governance, adaptive OCR residency, LLM-first GPU ownership, CPU fallback retrieval, selective realtime concurrency เฉพาะ lightweight realtime jobs และ big bang cutover gate ที่ต้องผ่านครบทั้ง contract, model switching, OCR residency, และ RAG fallback.
@@ -1,537 +1,315 @@
-# AI Refactor
+# AI Runtime Refactor
-เนื่องจากการอัพเกรด จาก RTX2060 SUPER 8GB เป็น ASUS DUAL **RTX5060 Ti 16GB**
+
 เอกสารนี้สรุปผล grilling session สำหรับการ refactor AI runtime หลังอัปเกรด GPU จาก RTX 2060 SUPER 8GB เป็น ASUS DUAL RTX 5060 Ti 16GB
 เอกสารอ้างอิง:
 - [ADR-033](../specs/06-Decision-Records/ADR-033-active-model-and-ocr-management.md)
 - [ADR-034](../specs/06-Decision-Records/ADR-034-AI-model-change.md)
 - [ADR ใหม่: AI Runtime Policy Refactor](./adr/0001-ai-runtime-policy-refactor.md)
 - [CONTEXT.md](../CONTEXT.md)
 ## เป้าหมาย
 ปรับปรุงประสิทธิภาพการประมวลผล AI โดยใช้ทรัพยากรใหม่ให้เหมาะสม, รวมถึงปรับปรุงขั้นตอนการทำงานให้เหมาะสมกับทรัพยากรใหม่
-```text
+- เปลี่ยนชื่อโมเดลหลักและ OCR ไปเป็น canonical identities ใหม่
-Typhoon OCR 1.5
+- ย้ายสัญญา API จาก caller-driven model selection ไปเป็น policy-driven `executionProfile`
-Typhoon2.5-Qwen3-4B
+- รวมการจัดการ VRAM ของ main model, OCR, embedding, และ reranking ไว้ใน policy เดียว
-BGE-M3
+- ใช้ big bang rollout แบบมีกติกา cutover และ verification ที่รันซ้ำได้
 การตั้งค่าระบบคิว (BullMQ) ร่วมกับ AI
 ```
 ## Model
-|Model Name|Size|Base FROM|PARAMETER|File|
+## Decision Summary
 |-|-|-|-|-|
 |np-dms-ocr|2.9GB|scb10x/typhoon-ocr1.5-3b:latest|num_ctx 8192|np-dms-ocr-model.md|
 |np-dms-typhoon2.5|3.6GB|scb10x/typhoon2.5-qwen3-4b:latest|num_ctx 8192|np-dms-typhoon2.5.model.md|
 |np-dms-llama3.1-typhoon2-8b|5.5GB|scb10x/llama3.1-typhoon2-8b-instruct|num_ctx 8192|np-dms-llama3.1-typhoon2-8b.model.md|
 |np-dms-gemma4-4eb|3.2GB|gemma4:e4b|num_ctx 8192|np-dms-gemma4-4eb.model.md|
 |np-dms-openthaigpt-7b|8GB|promptnow/openthaigpt1.5-7b-instruct-q4_k_m|num_ctx 8192|np-dms-openthaigpt-7b.model.md|
 |np-dms-openthaigpt-14b|9.7GB|promptnow/openthaigpt1.5-14b-instruct-q4_k_m|num_ctx 8192|np-dms-openthaigpt-14b.model.md|
 ### 1. Canonical naming
 - ใช้ `np-dms-ai` เป็น canonical model identity เดียวทุกชั้นที่ผู้ใช้และนักพัฒนาเห็น
 - ใช้ `np-dms-ocr` เป็น canonical OCR identity เดียวทุกชั้น
 - ชื่อ runtime/base model จริงเป็น implementation detail ใน Modelfile, deploy script, หรือ ops internals เท่านั้น
-ollama create np-dms-typhoon2.5 -f np-dms-typhoon2.5.model.md
+### 2. API contract
-ollama create np-dms-llama3.1-typhoon2-8b -f np-dms-llama3.1-typhoon2-8b.model.md
+- caller ส่งได้เพียง `executionProfile`
 - caller ห้ามส่ง `model.key`
 - caller ห้าม override `temperature`, `top_p`, `maxTokens`, หรือ runtime parameters อื่นโดยตรง
 - backend policy เป็นผู้ map `executionProfile` ไปยัง canonical model, runtime parameters, และ keep_alive policy
-ollama create np-dms-gemma4-4eb -f np-dms-gemma4-4eb.model.md
+### 3. Canonical profile set
-ollama create np-dms-openthaigpt-7b -f np-dms-openthaigpt-7b.model.md
+โปรไฟล์ระดับ contract มีแค่:
-ollama create np-dms-openthaigpt-14b -f np-dms-openthaigpt-14b.model.md
+- `fast`
 - `balanced`
 - `thai-accurate`
 - `large-context`
---
+กฎเพิ่ม:
-## Architecture Decisions (RTX5060 Ti 16GB Optimized)
+- `large-context` จำกัดเฉพาะ admin/special workflows
 - งานที่มีผลต่อข้อมูล เช่น `migrate-document`, `auto-fill-document`, OCR extraction ใช้ backend override profile เอง
-> สรุปการตัดสินใจจาก grilling session — อัปเกรดจาก RTX2060 SUPER 8GB
+### 4. Runtime resource policy
-### VRAM Budget
+- `np-dms-ai` เป็น workload หลักของ generation path
 - `np-dms-ocr` ใช้ adaptive residency แทน fixed `keep_alive`
 - retrieval acceleration (`BGE-M3`, `BGE-Reranker-Large`) อยู่ใน policy เดียวกับ main/OCR
 - GPU ownership ใช้หลัก LLM-first
 - ถ้า VRAM headroom ไม่พอ retrieval ต้อง fallback CPU ทันที
-| คอมโพเนนต์ | VRAM | หมายเหตุ |
+### 5. Queue policy
 |-----------|------|----------|
 | `typhoon2.5-np-dms` | 3.6GB | โหลดค้างตลอด (resident) |
 | `typhoon-np-dms-ocr` | 2.9GB | transient (load on-demand) |
 | BGE-M3 | 2.3GB | ย้ายเข้า GPU (Sidecar device='cuda') |
 | BGE-Reranker-Large | 1.5GB | ย้ายเข้า GPU (Sidecar device='cuda') |
 | **รวมสูงสุด** | **~10.3GB** | เหลือ headroom ~5.7GB |
-### BullMQ Concurrency
+- คงโครง `ai-realtime` / `ai-batch` และ pause/resume coordination เดิมเป็นแกน
 - อนุญาต `ai-realtime = 2` ได้เฉพาะ lightweight realtime jobs
 - `rag-query` ไม่ใช่ lightweight realtime job
 - `rag-query` เป็น generation-centric job: retrieval เป็นขั้นเตรียม context และ fallback CPU ได้
-| Queue | Concurrency | เหตุผล |
+### 6. Rollout policy
 |-------|-------------|--------|
 | `ai-realtime` | **2** | VRAM เหลือเยอะ, response เร็วขึ้น |
 | `ai-batch` | **1** | background job, ป้องกัน VRAM overflow |
-### Model Loading Strategy
+- rollout ใช้ `Big Bang`
 - cutover จะถือว่าสำเร็จต่อเมื่อผ่านครบทั้ง:
  - policy contract
  - model switching
  - adaptive OCR residency
  - RAG fallback
-| โมเดล | กลยุทธ์ | keep_alive |
+## Canonical Models
 |-------|---------|------------|
 | `typhoon2.5-np-dms` | โหลดค้างตลอด (ไม่ unload) | — |
 | `typhoon-np-dms-ocr` | โหลดตาม demand, unload อัตโนมัติหลัง 5 นาที | 300 |
-### Sidecar Changes (port 8765)
+| Canonical Name | บทบาท | Residency policy | หมายเหตุ |
 |---|---|---|---|
 | `np-dms-ai` | main generation model | resident by default | backend policy คุม runtime parameters |
 | `np-dms-ocr` | OCR model | adaptive | ใช้ policy ตาม VRAM headroom และ active workload |
-```diff
+หมายเหตุ:
-# ปัจจุบัน (CPU RAM)
+- เอกสารนี้ไม่บังคับว่าฐานจริงต้องเป็น model family ใดเสมอไป
-POST /embed   → BGE-M3 (CPU)
+- การเปลี่ยน base runtime model ในอนาคตไม่ควรเปลี่ยน canonical API/UI name ถ้า semantics เดิมยังอยู่
 POST /rerank  → BGE-Reranker (CPU)
-# หลังอัปเกรด (GPU)
+## Execution Profile Contract
 POST /embed   → BGE-M3 (GPU via device='cuda')
 POST /rerank  → BGE-Reranker (GPU via device='cuda')
 POST /ocr-upload  → Typhoon OCR (Ollama) ← ไม่เปลี่ยน
 POST /normalize   → PyThaiNLP (CPU)      ← ไม่เปลี่ยน
 ```
-### Implementation Tasks
+### Request DTO
- [ ] แก้ไข Sidecar Dockerfile — เพิ่ม CUDA runtime
+```typescript
- [ ] แก้ไข Sidecar app.py — เปลี่ยน `device='cuda'` สำหรับ BGE models
+interface CreateAiJobRequest {
- [ ] แก้ไข docker-compose.yml — เพิ่ม NVIDIA Container Toolkit
+  type: 'auto-fill-document' | 'migrate-document' | 'rag-query';
- [ ] อัปเดต BullMQ concurrency config (ai-realtime=2)
+  documentId?: string;
- [ ] อัปเดต OCR keep_alive จาก 0 เป็น 300
+  attachmentId?: string;
- [ ] ตรวจสอบ OllamaService รองรับ resident model
+  executionProfile?: 'fast' | 'balanced' | 'thai-accurate' | 'large-context';
 - [ ] ทดสอบ VRAM usage จริงกับเอกสารขนาดใหญ่
 ### Rollout Strategy
 **Big Bang** — ระบบยังไม่เปิดใช้งาน production ทำการเปลี่ยนแปลงทั้งหมดในครั้งเดียว
 ---
 # Phase 1 : Foundation
 ## 1. Infrastructure
 ### AI Services
 ```text
 Ollama
 ├── Typhoon OCR 1.5
 ├── Typhoon2.5-Qwen3-4B
 └── BGE-M3
 ```
 ### Database
 ```text
 Qdrant
 ```
 ### Storage AI
 ```text
 File Serv
 ├── OCR Output
 └── Processed Data
 ```
 ---
 # Phase 2 : Ingestion Pipeline
 ## Step 1 Upload
 ```text
 PDF Upload
 ↓
 Store Original File
 ↓
 Create Job
 ```
 ---
 ## Step 2 OCR
 ### Input
 ```text
 PDF
 ```
 ### Process
 ```text
 Typhoon OCR
 ```
 ### Output
 ```json
 {
  "page": 1,
  "content": "..."
 }
 ```
-Store
+### Policy rules
-```text
+- `migrate-document`: backend override เป็น profile ที่ deterministic สูงเสมอ
-raw_ocr
+- `auto-fill-document`: backend override ได้ตาม data-affecting policy
-```
+- `rag-query`: ปกติใช้ `balanced` หรือ policy ที่ backend กำหนด
 - `large-context`: ใช้ได้เฉพาะ admin/special workflows ที่ backend whitelist
-Table
+### Forbidden contract
-```sql
+สิ่งต่อไปนี้ต้องไม่มีใน public contract:
 document_pages
 ```
-```sql
+```typescript
-document_id
+model: {
-page_no
+  key: string;
-raw_text
+  parameters: {
-```
+    temperature?: number;
-
+    topP?: number;
---
+    maxTokens?: number;
-
+  };
 ## Step 3 Structure
 ### Input
 ```text
 Raw OCR Text
 ```
 ### Process
 ```text
 Typhoon2.5
 ```
 Prompt
 ```text
 จัดโครงสร้างเอกสาร
 แยก Heading
 Section
 Metadata
 ห้ามสรุป
 ```
 Output
 ```json
 {
  "document_type": "ITP",
  "project": "...",
  "heading": "...",
  "content": "..."
 }
 ```
-Store
+เหตุผล:
 - caller bypass governance ได้
 - verification matrix โตเกินจำเป็น
 - profile abstraction หมดความหมายทันที
 ## Adaptive OCR Residency
 หลักการ:
 - `np-dms-ocr` ไม่ใช้ fixed `keep_alive: 0` หรือ fixed `keep_alive: 300` ตายตัว
 - backend policy คำนวณ residency จาก VRAM headroom และ active model/workload ปัจจุบัน
 - ถ้า active workload กิน VRAM สูง หรือ profile ปัจจุบันเสี่ยงชน headroom ให้ fallback เป็น `keep_alive: 0`
 - ถ้า headroom เหลือและไม่มี contention สำคัญ อนุญาต residency window ชั่วคราวได้
 ตัวอย่าง policy:
 ```text
-structured_document
+if active_profile == 'large-context' => OCR keep_alive = 0
 if active_main_model_pressure == high => OCR keep_alive = 0
 if headroom >= policy threshold => OCR keep_alive = short residency window
 ```
---
+## LLM-First GPU Ownership
-## Step 4 Chunking
+ลำดับสิทธิ์ VRAM:
-### ไม่ใช้ LLM
+1. `np-dms-ai`
 2. `np-dms-ocr`
 3. `BGE-M3`
 4. `BGE-Reranker-Large`
-ใช้
+ผลเชิงพฤติกรรม:
 - retrieval path ใช้ GPU ได้เฉพาะเมื่อ policy ระบุว่ามี headroom จริง
 - retrieval path ไม่มีสิทธิ์บังคับรอ GPU เพื่อแย่ง resource จาก main/OCR path
 - หาก headroom ไม่พอ `embed` และ `rerank` ต้อง fallback CPU ทันที
 ## Retrieval Acceleration
 ### Scope
 เอกสารนี้ถือว่า retrieval acceleration เป็นส่วนหนึ่งของ runtime resource policy เดียวกัน ไม่ใช่ tuning แยก
 ### Sidecar policy
 ปัจจุบัน:
 ```text
-Markdown Header Splitter
+POST /embed   -> CPU
-+
+POST /rerank  -> CPU
 Recursive Splitter
 ```
-Config
+เป้าหมาย:
 ```yaml
 chunk_size: 800
 chunk_overlap: 120
 ```
 Output
 ```json
 {
  "chunk_id": "...",
  "heading": "...",
  "content": "...",
  "page": 12
 }
 ```
 ---
 ## Step 5 Embedding
 ### Input
 ```text
-Chunk
+POST /embed   -> GPU เมื่อ headroom ผ่าน policy, ไม่เช่นนั้นใช้ CPU
 POST /rerank  -> GPU เมื่อ headroom ผ่าน policy, ไม่เช่นนั้นใช้ CPU
 POST /ocr-upload -> OCR path ตาม adaptive OCR residency
 POST /normalize  -> CPU
 ```
-### Process
+### Retrieval fallback rule
-```text
+- ห้าม queue รอ GPU เพื่อให้ retrieval ได้ acceleration
-BGE-M3
+- ห้าม fail hard เพียงเพราะ GPU ไม่พอ
-```
+- ให้ degrade ไป CPU แล้วตอบงานต่อ
-### Output
+## Queue and Scheduling
-```text
+### Baseline
 Vector
 ```
---
+- `ai-batch` ยังสามารถถูก pause/resume โดย realtime path ตาม coordination model เดิม
 - `ai-realtime = 1` ยังคงเป็น baseline สำหรับงาน generation-heavy
-## Step 6 Index
+### Selective realtime uplift
-Store in
+อนุญาต `ai-realtime = 2` เฉพาะกลุ่มงานที่เป็น lightweight realtime jobs เช่น:
-```text
+- intent classification ที่ไม่เรียก OCR
-Qdrant
+- tool-only suggestion path ที่ไม่บังคับ model switching
-```
+- metadata-free chat steps ที่ไม่ใช้ GPU-heavy generation
-Payload
+ไม่รวม:
-```json
+- `rag-query`
-{
+- OCR-triggering jobs
-  "document_id": "...",
+- งานที่บังคับ model switching
-  "page": 12,
+- generation-heavy jobs
  "document_type": "ITP",
  "heading": "Inspection",
  "content": "..."
 }
 ```
---
+## Big Bang Rollout
-# Phase 3 : Retrieval
+### Decision
-## Step 1 User Query
+refactor รอบนี้ใช้ big bang rollout เพราะระบบยังไม่เปิด production
-```text
+### Consequence
 Slump Test สำหรับงานพื้นชั้น 2 คืออะไร
 ```
---
+ห้ามใช้เกณฑ์ partial success แบบ "บางแกนผ่านก็ถือว่าปล่อยได้"
-## Step 2 Query Embedding
+### Cutover gate
-```text
+ต้องผ่านครบทุกแกน:
 BGE-M3
 ```
---
+1. policy contract ใหม่ทำงานจริง
 2. canonical naming ใหม่ทำงานจริง
 3. model switching และ OCR residency ตรง policy ใหม่
 4. retrieval GPU/CPU fallback ทำงานจริง
-## Step 3 Search
+## Verification
-```text
+ใช้แนวทาง executable-first แต่ทุกแกนต้องมี manual validation path ประกบ
 Qdrant
 ```
-Top K
+### 1. Policy contract
-```text
+Executable:
 10-20
 ```
---
+- unit/integration tests สำหรับ DTO และ policy mapping
 - tests ว่า caller ส่ง `model.key` หรือ parameter overrides ไม่ได้
 - tests ว่า data-affecting jobs ถูก backend override profile จริง
-## Step 4 Re-rank (แนะนำ)
+Manual:
-ใช้
+- ยิง request จาก admin/sandbox แล้วตรวจว่า UI/API ไม่ expose free-form model selection
-```text
+### 2. Canonical naming
 Typhoon2.5
 ```
-หรือภายหลังเพิ่ม
+Executable:
-```text
+- search-based checks ว่า public-facing contract ใช้ `np-dms-ai` / `np-dms-ocr`
-bge-reranker-v2
+- tests สำหรับ settings/service/controller ที่คืนชื่อ canonical
 ```
-Flow
+Manual:
-```text
+- เปิด AI Admin Console และ OCR sandbox ตรวจ label/option/log surface ที่ผู้ใช้เห็น
 Top20
 ↓
 Top5
 ```
---
+### 3. Adaptive OCR residency
-## Step 5 Answer
+Executable:
-ใช้
+- tests ว่า residency policy ให้ `keep_alive` ต่างกันตาม headroom scenario
 - logs/trace ว่า OCR requests ใช้ residency decision ตาม policy
-```text
+Manual:
 Typhoon2.5
 ```
-Prompt
+- รัน OCR ซ้ำหลายงานในเงื่อนไข headroom ต่างกันและตรวจ behavior จริง
-```text
+### 4. Retrieval fallback
 ตอบจาก Context เท่านั้น
 อ้างอิงเอกสาร
 อ้างอิงหน้า
 ห้ามเดา
 ```
-Output
+Executable:
-```text
+- tests ว่า `/embed` และ `/rerank` fallback CPU เมื่อ GPU threshold ไม่ผ่าน
-คำตอบ
+- trace/log ว่า `rag-query` ยังตอบได้เมื่อ GPU retrieval path ถูกปิด
-อ้างอิง:
+Manual:
 ITP-001 หน้า 12
 MS-005 หน้า 8
 ```
---
+- ทดลอง RAG query ภายใต้ภาระ GPU สูงและยืนยันว่าคำตอบยังออกได้แม้ช้าลง
-# Phase 4 : Metadata Extraction
+## Implementation Workstreams
-เพิ่มภายหลัง
+### Workstream A: Contract and naming
-Typhoon2.5 Extract
+- เปลี่ยน public contract ให้ใช้ `executionProfile`
 - ลบ `model.key` และ parameter override จาก API docs/DTO ที่เกี่ยวข้อง
 - เปลี่ยน public-facing names เป็น `np-dms-ai` และ `np-dms-ocr`
-```text
+### Workstream B: Runtime policy
 Project
 Contractor
 Subcontractor
 Discipline
 Document Type
 Revision
 Date
 ```
-เก็บใน PostgreSQL
+- สร้าง policy mapping profile -> runtime configuration
 - เพิ่ม adaptive OCR residency logic
 - แยก policy ของ data-affecting jobs ออกจาก caller input
-ช่วยทำ Filter Search เช่น
+### Workstream C: Retrieval acceleration
-```text
+- เพิ่ม GPU eligibility check สำหรับ `embed` และ `rerank`
-Project = ABC
+- เพิ่ม CPU fallback path ที่ explicit
-Type = MIR
+- บันทึก telemetry/log สำหรับ fallback decisions
 Revision = C
 ```
-ก่อนเข้า Qdrant
+### Workstream D: Queue policy
---
+- คง pause/resume coordination เดิม
 - แยก lightweight realtime jobs ออกจาก generation-heavy jobs
 - ใช้ selective concurrency uplift เฉพาะ job ที่ allowed
-# Ollama Models
+### Workstream E: Verification
-## Typhoon OCR
+- เพิ่ม automated tests ตาม cutover gate
 - เพิ่ม manual validation checklist สำหรับ admin console, OCR sandbox, และ RAG path
-```dockerfile
+## Non-Goals
 FROM scb10x/typhoon-ocr1.5-3b:latest
 ```
-ไม่ต้อง custom
+- ไม่เปิดให้ caller เลือก runtime parameters เอง
 - ไม่เปลี่ยน `rag-query` ให้เป็น retrieval-first job
 - ไม่ยกเลิก pause/resume coordination เดิมทั้งหมด
 - ไม่แยก retrieval acceleration ออกเป็น policy คนละชุดกับ main/OCR
 - ไม่ใช้ phased rollout ในเอกสารฉบับนี้
---
+## Migration Note for Current Repo
-## Typhoon2.5
+repo ปัจจุบันยังมีจุดที่อิงชื่อและ policy เดิม เช่น `typhoon2.5-np-dms:latest`, `typhoon-np-dms-ocr:latest`, และ `keep_alive: 0` ในหลาย service/spec. เอกสารนี้จึงเป็น target architecture/policy ใหม่ และต้องมีการอัปเดตโค้ด, tests, cross-spec docs, และ admin UI ให้สอดคล้องก่อนจะถือว่า cutover สำเร็จ.
 ```dockerfile
 FROM scb10x/typhoon2.5-qwen3-4b:latest
 PARAMETER temperature 0.1
 PARAMETER top_p 0.9
 PARAMETER repeat_penalty 1.05
 PARAMETER num_ctx 8192
 ```
 **ไม่มี SYSTEM**
 ---
 ## Runtime Config
 ### Structure
 ```json
 {
  "num_ctx": 8192,
  "temperature": 0
 }
 ```
 ### Answer
 ```json
 {
  "num_ctx": 16384,
  "temperature": 0.1
 }
 ```
 ---
 # MVP Roadmap
 ## Sprint 1
 ✅ Upload PDF
 ✅ OCR
 ✅ Store OCR
 ✅ Chunking
 ✅ Embedding
 ✅ Qdrant Search
 ---
 ## Sprint 2
 ✅ Typhoon2.5 Structuring
 ✅ Metadata Extraction
 ✅ Better Chunking
 ---
 ## Sprint 3
 ✅ RAG QA
 ✅ Citation
 ✅ Source Reference
 ---
 ## Sprint 4
 ✅ Hybrid Search (Vector + Metadata)
 ✅ Re-ranking
 ✅ Multi-document QA
 ---
 ### Architecture สุดท้าย
 ```text
 PDF
 ↓
 Typhoon OCR
 ↓
 Raw OCR
 ↓
 Typhoon2.5
 (Structure + Metadata)
 ↓
 Markdown/Header Splitter
 ↓
 Recursive Splitter
 ↓
 BGE-M3
 ↓
 Qdrant
 --------------------------------
 Question
 ↓
 BGE-M3
 ↓
 Qdrant
 ↓
 Top-K Chunks
 ↓
 Typhoon2.5
 ↓
 Answer + Citation
 ```
 สำหรับ MVP ผมจะ **ตัด Metadata Extraction ขั้นสูงและ Re-ranker ออกก่อน** แล้วทำให้ OCR → Search → Answer ใช้งานได้จริงภายใน 2–3 สัปดาห์แรก จากนั้นค่อยเพิ่มความแม่นยำทีละส่วน.
@@ -1,5 +1,5 @@
 # File: specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/Dockerfile
-# Tesseract OCR Sidecar — HTTP API server สำหรับสกัดข้อความจาก PDF/Image
+# Typhoon OCR Sidecar — HTTP API server สำหรับสกัดข้อความจาก PDF ผ่าน np-dms-ocr (Ollama)
 # รันบน Desk-5439 ตาม ADR-023A
 # Change Log:
 # - 2026-05-25: Initial Dockerfile สำหรับ OCR sidecar (port 8765)
@@ -7,23 +7,17 @@
 # - 2026-05-30: เพิ่ม system dependencies สำหรับ OpenCV (libsm6, libxext6, libxrender1, libfontconfig1, libx11-6)
 # - 2026-05-30: Typhoon OCR ใช้ httpx เรียก Ollama ผ่าน OLLAMA_API_URL (T009a, ADR-032)
 #              Container รันบน CPU เท่านั้น ไม่ต้องการ CUDA/GPU ใน container
 # - 2026-06-11: เพิ่ม typhoon-ocr ใน requirements.txt — poppler-utils มีอยู่แล้ว (ใช้โดย prepare_ocr_messages)
 # - 2026-06-11: ตัด tesseract-ocr, tesseract-ocr-tha, tesseract-ocr-eng, libsm6, libxext6, libxrender1, libfontconfig1, libx11-6 — ไม่ใช้ Tesseract อีกต่อไป
 FROM python:3.10-slim
-# ติดตั้ง system dependencies สำหรับ PDF processing, Tesseract OCR, ภาษาไทย และ OpenCV
+# ติดตั้ง system dependencies สำหรับ PDF processing และ PyMuPDF
 RUN apt-get update && apt-get install -y --no-install-recommends \
    libglib2.0-0 \
    libgl1 \
    libgomp1 \
    poppler-utils \
    tesseract-ocr \
    tesseract-ocr-tha \
    tesseract-ocr-eng \
    libsm6 \
    libxext6 \
    libxrender1 \
    libfontconfig1 \
    libx11-6 \
    && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
@@ -1,6 +1,6 @@
 # File: specs/04-Infrastructure-OPS/04-00-docker-compose\Desk-5439\ocr-sidecar\app.py
-# Tesseract OCR HTTP Sidecar API — รับ POST /ocr แล้วคืนข้อความที่สกัดจาก PDF/Image
+# Typhoon OCR HTTP Sidecar API — รับ POST /ocr แล้วคืนข้อความที่สกัดจาก PDF/Image
-# ตาม ADR-023A: OCR auto-detect (PyMuPDF chars > 100 → Fast path, else Tesseract)
+# ตาม ADR-023A (revised 2026-06-11): ใช้ typhoon_ocr library + np-dms-ocr (Ollama) แทน Tesseract
 # Change Log:
 # - 2026-05-25: Initial FastAPI server สำหรับ Tesseract OCR sidecar
 # - 2026-05-30: เปลี่ยน lang='en' เป็น lang='ch' (CTJK) เพื่อรองรับภาษาไทย
@@ -20,20 +20,21 @@
 # - 2026-06-02: เพิ่มการตรวจสอบ API Key (X-API-Key Header) สำหรับ endpoints หลัก เพื่อความมั่นคงปลอดภัยตามข้อเสนอแนะ Code Review
 # - 2026-06-05: เพิ่ม Option 2 (aggressive preprocessing: deskew + Otsu threshold + morphology) และ Option 3 (smart post-processing: regex-based hallucination removal) เพื่อลด Tesseract noise/hallucination (T025)
 # - 2026-06-06: เปลี่ยน keep_alive จาก 300s เป็น 0 เพื่อ unload model ทันทีหลังเสร็จงาน (แก้ปัญหา VRAM ไม่พอเมื่อ typhoon2.5-np-dms load พร้อมกัน)
 # - 2026-06-11: เปลี่ยน process_with_typhoon_ocr ให้ใช้ prepare_ocr_messages จาก typhoon_ocr library + inject DMS tags; เปลี่ยน endpoint เป็น /v1/chat/completions
 import os
 import logging
 import re
 import base64
-import fitz  # PyMuPDF
+import json
 import tempfile
 import fitz  # PyMuPDF (ใช้สำหรับ page count + fast-path text extraction)
 import httpx
 from pathlib import Path
 from typing import Optional
 from PIL import Image
 import pytesseract
 import io
-import cv2
+from typhoon_ocr import prepare_ocr_messages
 import numpy as np
 from fastapi import FastAPI, HTTPException, UploadFile, File, Form, Depends, Security, status
 from fastapi.security.api_key import APIKeyHeader
@@ -46,7 +47,7 @@ from FlagEmbedding import BGEM3FlagModel, FlagReranker
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger("ocr-sidecar")
-app = FastAPI(title="Tesseract OCR Sidecar", version="1.0.0")
+app = FastAPI(title="Typhoon OCR Sidecar", version="2.0.0")
 # Initialize BGE-M3 and Reranker singletons
 bge_model = None
@@ -79,162 +80,25 @@ async def get_api_key(api_key: str = Security(api_key_header)):
 # อ่านค่า config จาก environment
 OCR_CHAR_THRESHOLD = int(os.getenv("OCR_CHAR_THRESHOLD", "100"))
 MAX_PAGES = int(os.getenv("OCR_MAX_PAGES", "0"))  # 0 = ทุกหน้า
 OCR_LANG = os.getenv("OCR_LANG", "tha+eng")  # Tesseract language code (tha+eng = Thai + English)
 OLLAMA_API_URL = os.getenv("OLLAMA_API_URL", "http://host.docker.internal:11434")
 TYPHOON_OCR_MODEL = os.getenv("TYPHOON_OCR_MODEL", "typhoon-np-dms-ocr:latest")
 TYPHOON_OCR_TIMEOUT = int(os.getenv("TYPHOON_OCR_TIMEOUT", "360"))  # รองรับ cold-start ~65s + inference ~30s/page
 # DPI สำหรับ Typhoon OCR — ต่ำกว่า Tesseract เพราะ vision model ใช้ image patches (150 DPI ลด token ~4x)
 TYPHOON_OCR_DPI = int(os.getenv("TYPHOON_OCR_DPI", "150"))
 # PSM mode: 3 (default, fully automatic) หรือ 6 (assume single column, ลด noise)
 TESSERACT_PSM = os.getenv("TESSERACT_PSM", "3")
 # PSM 3 = Fully automatic page segmentation (เหมาะกับเอกสารที่มี layout หลายส่วน เช่น วันที่/เลขที่)
 # PSM 6 = Assume single column of text (ลด hallucination จาก noise)
 # OEM 1 = LSTM only (ดีกว่า legacy engine)
 TESSERACT_CONFIG = f"--psm {TESSERACT_PSM} --oem 1"
 # Crop margin: ตัด header/afooter (บน 5%, ล่าง 2%)
 CROP_TOP_RATIO = 0.05
 CROP_BOTTOM_RATIO = 0.02
 # Enable aggressive preprocessing (Option 2) สำหรับ Tesseract
 USE_AGGRESSIVE_PREPROCESSING = os.getenv("TESSERACT_AGGRESSIVE_PREPROCESS", "true").lower() == "true"
 # Enable smart post-processing (Option 3) สำหรับลบ hallucination
 USE_SMART_CLEANING = os.getenv("TESSERACT_SMART_CLEAN", "true").lower() == "true"
-logger.info(f"Tesseract OCR Sidecar initialized (lang={OCR_LANG}, config={TESSERACT_CONFIG}, aggressive={USE_AGGRESSIVE_PREPROCESSING}, smart_clean={USE_SMART_CLEANING})")
+logger.info(f"Typhoon OCR Sidecar initialized (model={TYPHOON_OCR_MODEL}, ollama={OLLAMA_API_URL})")
 def filter_ocr_noise(text: str) -> str:
-    """Filter ขยะ OCR เช่น บรรทัดสั้น/สัญลักษณ์ที่ไม่มีความหมาย"""
+    """กรองสัญลักษณ์ที่ไม่มีความหมายออกจาก Markdown output"""
    lines = text.split("\n")
-    filtered_lines = []
+    filtered = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        # ลบบรรทัดที่สั้นเกินไป (น้อยกว่า 3 ตัวอักษร)
        if len(line) < 3:
            continue
        # ลบบรรทัดที่มีแต่สัญลักษณ์/ตัวเลขโดดๆ (ไม่มีตัวอักษรภาษาไทย/อังกฤษ)
        thai_chars = sum(1 for c in line if '\u0E00' <= c <= '\u0E7F')
        english_chars = sum(1 for c in line if c.isalpha() and c.isascii())
        total_chars = len(line)
        # ถ้ามีตัวอักษรภาษาไทยหรืออังกฤษน้อยกว่า 20% ของบรรทัด ให้ถือว่าเป็นขยะ
        if total_chars > 0 and (thai_chars + english_chars) / total_chars < 0.2:
            continue
        filtered_lines.append(line)
    return "\n".join(filtered_lines)
 def crop_header_footer(pil_image: Image.Image, top_ratio: float = 0.10, bottom_ratio: float = 0.10) -> Image.Image:
    """Crop header/footer ออกจาก image เพื่อลบข้อความที่ไม่จำเป็น"""
    width, height = pil_image.size
    top_crop = int(height * top_ratio)
    bottom_crop = int(height * bottom_ratio)
    # Crop: (left, top, right, bottom)
    cropped = pil_image.crop((0, top_crop, width, height - bottom_crop))
    return cropped
 def preprocess_image(pil_image: Image.Image) -> Image.Image:
    """Preprocess image ด้วย OpenCV เพื่อเพิ่มความแม่นยำ OCR (แบบธรรมชาติ)"""
    # แปลง PIL Image เป็น numpy array (OpenCV format)
    img_array = np.array(pil_image)
    # แปลงเป็น grayscale
    gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
    # Denoise ด้วย median blur (เบางๆ เพื่อลบ noise แต่ไม่ทำลายตัวอักษร)
    denoised = cv2.medianBlur(gray, 3)
    # ใช้ grayscale เท่านั้น (ไม่ใช้ adaptive threshold เพราะทำให้ตัวอักษรเสียรูป)
    # แปลงกลับเป็น PIL Image
    return Image.fromarray(denoised)
 def preprocess_image_aggressive(pil_image: Image.Image) -> Image.Image:
    """
    Aggressive preprocessing (Option 2) — ลด hallucination โดย:
    1. Deskew ถ้าหน้าเอียง
    2. Denoise ด้วย bilateral filter
    3. Otsu adaptive threshold
    4. Morphological operations
    """
    img_array = np.array(pil_image)
    gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
    # 1. Deskew ถ้าหน้าเอียง (detect angle จาก Canny edges + Hough lines)
    try:
        edges = cv2.Canny(gray, 100, 200)
        lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, minLineLength=100, maxLineGap=10)
        if lines is not None and len(lines) > 0:
            angles = [np.arctan2(y2-y1, x2-x1) for x1,y1,x2,y2 in lines[:min(10, len(lines))]]
            angle = np.median(angles) * 180 / np.pi
            if abs(angle) > 0.5:  # มุมเอียงน้อย ≥ 0.5 องศา
                h, w = gray.shape
                M = cv2.getRotationMatrix2D((w/2, h/2), angle, 1.0)
                gray = cv2.warpAffine(gray, M, (w, h), borderMode=cv2.BORDER_REFLECT)
                logger.info(f"[PREPROCESS] Deskewed {angle:.1f}°")
    except Exception as e:
        logger.warning(f"[PREPROCESS] Deskew failed: {e}")
    # 2. Denoise — median blur + bilateral filter
    denoised = cv2.medianBlur(gray, 3)
    denoised = cv2.bilateralFilter(denoised, 9, 75, 75)
    # 3. Otsu threshold (adaptive, ไม่ fixed value)
    _, thresh = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    # 4. Morphological operations — ลบ line noise ขนาดเล็ก (ต้าน speckle artifacts)
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 2))
    morph = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)  # ลบ small white noise
    morph = cv2.morphologyEx(morph, cv2.MORPH_CLOSE, kernel)  # ลบ small black hole
    logger.info(f"[PREPROCESS] Aggressive: Otsu threshold + morphology applied")
    return Image.fromarray(morph)
 def clean_ocr_output(text: str) -> str:
    """
    Smart post-processing (Option 3) — ลบ Tesseract hallucination โดย:
    1. ลบ line ที่เป็นแค่สัญลักษณ์ repeated
    2. ลบ line ที่เป็นแค่สัญลักษณ์แปลก
    3. ลบ line ที่ซ้ำตัวอักษรเดียว (artifact noise)
    """
    lines = text.split("\n")
    cleaned = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        # ✗ ลบ line ที่เป็นแค่สัญลักษณ์/punctuation เดี่ยวๆ ไม่มีตัวอักษร
        alphanumeric_part = re.sub(r'[^\w\u0E00-\u0E7F]', '', line)
        if len(alphanumeric_part) < 2:
            logger.debug(f"[CLEAN] Reject (no alphanum): {line[:50]}")
            continue
-
+        filtered.append(line)
-        # ✗ ลบ line ที่เป็น repeated pattern — ถ้า unique char ≤ 20% (e.g., "-----", ">>>>>>>")
+    return "\n".join(filtered)
        unique_chars = len(set(line))
        if unique_chars < max(2, len(line) // 5):
            logger.debug(f"[CLEAN] Reject (repeated pattern): {line[:50]}")
            continue
        # ✗ ลบ line ที่เป็นสัญลักษณ์แปลก (< 20% Thai/English alphanumeric)
        thai_chars = sum(1 for c in line if '\u0E00' <= c <= '\u0E7F')
        eng_chars = sum(1 for c in line if c.isascii() and c.isalnum())
        if len(line) > 0 and (thai_chars + eng_chars) / len(line) < 0.2:
            logger.debug(f"[CLEAN] Reject (low language content): {line[:50]}")
            continue
        # ✓ ปล่อยผ่าน
        cleaned.append(line)
    result = "\n".join(cleaned)
    logger.info(f"[CLEAN] Input {len(lines)} lines → {len(cleaned)} lines")
    return result
 class OcrRequest(BaseModel):
    pdfPath: str
@@ -252,11 +116,9 @@ class OcrResponse(BaseModel):
 def health():
    return {
        "status": "ok",
-        "engines": ["tesseract", "typhoon-np-dms-ocr"],
+        "engine": "typhoon-np-dms-ocr",
        "typhoonModel": TYPHOON_OCR_MODEL,
-        "tesseractConfig": TESSERACT_CONFIG,
+        "ollamaUrl": OLLAMA_API_URL,
        "aggressivePreprocess": USE_AGGRESSIVE_PREPROCESSING,
        "smartCleaning": USE_SMART_CLEANING,
    }
 # alias map สำหรับ engine name เก่า → canonical name
@@ -266,7 +128,7 @@ _ENGINE_ALIASES: dict[str, str] = {
    "typhoon_ocr": "typhoon-np-dms-ocr",
 }
-def _process_pdf_doc(doc: fitz.Document, selected_engine: str, max_pages: int, typhoon_options: dict = {}) -> OcrResponse:
+def _process_pdf_doc(doc: fitz.Document, selected_engine: str, max_pages: int, typhoon_options: dict = {}, pdf_path: str | None = None) -> OcrResponse:
    """ประมวลผล fitz.Document ด้วย engine ที่เลือก — shared logic สำหรับ /ocr และ /ocr-upload"""
    selected_engine = _ENGINE_ALIASES.get(selected_engine, selected_engine)
    pages_to_process = list(range(min(len(doc), max_pages) if max_pages > 0 else len(doc)))
@@ -291,15 +153,13 @@ def _process_pdf_doc(doc: fitz.Document, selected_engine: str, max_pages: int, t
            )
    if selected_engine == "typhoon-np-dms-ocr":
        # ใช้ prepare_ocr_messages รับ PDF path โดยตรง — ไม่ต้องแปลง PIL Image อีกต่อไป
        resolved_path = pdf_path or (str(doc.name) if hasattr(doc, 'name') and doc.name else None)
        if not resolved_path:
            raise ValueError("ไม่สามารถหา PDF path — ต้องส่ง pdf_path เข้ามาด้วย")
        typhoon_text_parts = []
        for i in pages_to_process:
-            page = doc[i]
+            typhoon_text_parts.append(process_with_typhoon_ocr(resolved_path, page_num=i + 1, options_override=typhoon_options))
            pix = page.get_pixmap(dpi=TYPHOON_OCR_DPI)
            img_bytes = pix.tobytes("png")
            img = Image.open(io.BytesIO(img_bytes))
            # ส่ง color image ตรงๆ — Typhoon OCR (vision model) ต้องการ color ไม่ใช่ grayscale binarized
            cropped_img = crop_header_footer(img, CROP_TOP_RATIO, CROP_BOTTOM_RATIO)
            typhoon_text_parts.append(process_with_typhoon_ocr(cropped_img, typhoon_options))
        typhoon_text = filter_ocr_noise("\n".join(typhoon_text_parts).strip())
        return OcrResponse(
            text=typhoon_text,
@@ -309,89 +169,65 @@ def _process_pdf_doc(doc: fitz.Document, selected_engine: str, max_pages: int, t
            engineUsed=selected_engine,
        )
-    logger.info(f"Slow path (Tesseract): {total_chars} chars too few")
+    # ถ้าไม่ใช่ engine ที่รู้จัก ให้ใช้ typhoon-np-dms-ocr เป็น fallback
-    ocr_text_parts = []
+    logger.warning(f"Unknown engine '{selected_engine}' — fallback to typhoon-np-dms-ocr")
    resolved_path = pdf_path or (str(doc.name) if hasattr(doc, 'name') and doc.name else None)
    if not resolved_path:
        raise ValueError("ไม่สามารถหา PDF path — ต้องส่ง pdf_path เข้ามาด้วย")
    fallback_parts = []
    for i in pages_to_process:
-        page = doc[i]
+        fallback_parts.append(process_with_typhoon_ocr(resolved_path, page_num=i + 1, options_override=typhoon_options))
-        pix = page.get_pixmap(dpi=300)
+    fallback_text = filter_ocr_noise("\n".join(fallback_parts).strip())
        img_bytes = pix.tobytes("png")
        img = Image.open(io.BytesIO(img_bytes))
        cropped_img = crop_header_footer(img, CROP_TOP_RATIO, CROP_BOTTOM_RATIO)
        # Option 2: Choose preprocessing strategy
        if USE_AGGRESSIVE_PREPROCESSING:
            processed_img = preprocess_image_aggressive(cropped_img)
        else:
            processed_img = preprocess_image(cropped_img)
        text = pytesseract.image_to_string(processed_img, lang=OCR_LANG, config=TESSERACT_CONFIG)
        ocr_text_parts.append(text.strip())
    ocr_text = "\n".join(ocr_text_parts).strip()
    # Option 3: Apply smart post-processing
    if USE_SMART_CLEANING:
        ocr_text = clean_ocr_output(ocr_text)
    else:
        ocr_text = filter_ocr_noise(ocr_text)
    logger.info(f"Tesseract extracted {len(ocr_text)} chars")
    return OcrResponse(
-        text=ocr_text,
+        text=fallback_text,
        ocrUsed=True,
        pageCount=page_count,
-        charCount=len(ocr_text),
+        charCount=len(fallback_text),
-        engineUsed="tesseract",
+        engineUsed="typhoon-np-dms-ocr",
    )
-def process_with_typhoon_ocr(pil_image: Image.Image, options_override: dict = {}) -> str:
+def process_with_typhoon_ocr(pdf_path: str, page_num: int = 1, options_override: dict = {}) -> str:
-    """เรียก Typhoon OCR ผ่าน Ollama — ใช้ SYSTEM ใน Modelfile เป็น instruction หลัก; options_override ยัง override ค่า Modelfile ได้"""
+    """เรียก Typhoon OCR ผ่าน Ollama /v1/chat/completions — รับ PDF path โดยตรง ไม่ต้องแปลง PIL Image"""
    model_name = TYPHOON_OCR_MODEL
-    img_buffer = io.BytesIO()
+    # prepare_ocr_messages จัดการ PDF → image ผ่าน poppler/pdftoppm ภายใน
-    pil_image.save(img_buffer, format="PNG")
+    messages = prepare_ocr_messages(pdf_path, task_type="structure", page_num=page_num)
-    image_base64 = base64.b64encode(img_buffer.getvalue()).decode("utf-8")
+    # inject DMS-specific extraction tags ต่อท้าย content
-    # ค่า default ตาม Modelfile; frontend override ได้บางส่วนหรือทั้งหมด
+    messages[0]["content"].append({
-    options = {
+        "type": "text",
-        "temperature": 0.1,
+        "text": (
-        "top_p": 0.1,
+            "Additionally:\n"
-        "repeat_penalty": 1.1,
+            "- Wrap document number with <document_number>...</document_number>\n"
-        "num_gpu": 99,  # บังคับ GPU layers สูงสุด — ป้องกัน Ollama fallback ไป CPU โดยไม่จำเป็น
+            "- Wrap document date with <document_date>...</document_date>\n"
-        "num_ctx": 4096,  # image tokens ~2772 → ต้องการ context > 2048; 4096 รองรับ image + output โดยไม่ truncate
+            "- Wrap received date with <received_date>...</received_date>\n"
-        **options_override,
+            "If a field is not found, omit the tag."
-    }
+        ),
    })
    # ค่า default ตาม official; options_override ยัง override ได้บางส่วน
    payload = {
        "model": model_name,
-        "prompt": """You are an expert in structuring Thai documents
+        "messages": messages,
-
+        "max_tokens": 16000,
 Task: Extract the information from the image in the most correct and organized format.
 Output Rules:
 - Return ONLY clean Markdown output
 - Include ALL information visible on the page
 - Preserve document structure and hierarchy
 - Do NOT add explanations or interpretations
 - Do NOT include these instructions in your response
 Formatting:
 - Tables: Use HTML <table> tags
 - Math: $inline$ and $$block$$ LaTeX
 - Figures: <figure>Thai description</figure>
 - Pages: <page_number>N</page_number>
 - Boxes: ☐ / ☑
 - Unclear: [unclear: context]
 - Signatures/Stamps: Describe location and context
 Extract all text from this image.""",
        "images": [image_base64],
        "stream": False,
-        "options": options,
+        "repetition_penalty": options_override.get("repeat_penalty", 1.2),
-        "keep_alive": 0,  # Unload model ทันทีหลังเสร็จงานเพื่อคืน VRAM ให้ typhoon2.5-np-dms ใช้งานได้
+        "temperature": options_override.get("temperature", 0.1),
        "top_p": options_override.get("top_p", 0.6),
        "keep_alive": 0,  # Unload model ทันทีหลังเสร็จงานเพื่อคืน VRAM ให้ np-dms-ai ใช้งานได้
    }
    # ใช้ Ollama OpenAI-compatible endpoint (/v1/chat/completions)
    with httpx.Client(timeout=TYPHOON_OCR_TIMEOUT) as client:
-        response = client.post(f"{OLLAMA_API_URL}/api/generate", json=payload)
+        response = client.post(
            f"{OLLAMA_API_URL}/v1/chat/completions",
            json=payload,
            headers={"Authorization": "Bearer ollama"},
        )
        response.raise_for_status()
        data = response.json()
-        result_text = str(data.get("response", "")).strip()
+        raw_text = str(data.get("choices", [{}])[0].get("message", {}).get("content", "")).strip()
        # parse JSON output จาก model (format: {"natural_text": "..."})
        try:
            result_text = json.loads(raw_text).get("natural_text", raw_text)
        except (json.JSONDecodeError, AttributeError):
            result_text = raw_text
        logger.info(
            f"[DIAG] Ollama response — model={model_name} "
            f"textLen={len(result_text)} "
@@ -440,12 +276,22 @@ def ocr_upload(
    if repeatPenalty is not None:
        typhoon_options["repeat_penalty"] = repeatPenalty
    pdf_bytes = file.file.read()
    import tempfile
    tmp_pdf_path: str | None = None
    try:
-        doc = fitz.open(stream=pdf_bytes, filetype="pdf")
+        # บันทึก PDF เป็น temp file เพื่อให้ prepare_ocr_messages อ่านได้ผ่าน path
-    except Exception as e:
+        with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
-        raise HTTPException(status_code=422, detail=f"เปิดไฟล์ PDF ล้มเหลว: {e}")
+            tmp.write(pdf_bytes)
-    logger.info(f"OCR upload: {file.filename} engine={selected_engine} options={typhoon_options or 'modelfile-defaults'}")
+            tmp_pdf_path = tmp.name
-    return _process_pdf_doc(doc, selected_engine, max_pages, typhoon_options)
+        try:
            doc = fitz.open(stream=pdf_bytes, filetype="pdf")
        except Exception as e:
            raise HTTPException(status_code=422, detail=f"เปิดไฟล์ PDF ล้มเหลว: {e}")
        logger.info(f"OCR upload: {file.filename} engine={selected_engine} options={typhoon_options or 'modelfile-defaults'}")
        return _process_pdf_doc(doc, selected_engine, max_pages, typhoon_options, pdf_path=tmp_pdf_path)
    finally:
        if tmp_pdf_path:
            Path(tmp_pdf_path).unlink(missing_ok=True)
 class NormalizeRequest(BaseModel):
    text: str
@@ -1,18 +1,18 @@
-# OCR Sidecar Requirements (Tesseract-based)
+# OCR Sidecar Requirements (Typhoon OCR via Ollama)
 # Change Log:
 # - 2026-05-30: เปลี่ยนจาก PaddleOCR เป็น Tesseract OCR เพื่อความเข้ากันได้กับ CPU เก่า (ไม่ต้องการ AVX)
 # - 2026-05-30: ลบ paddlepaddle/paddleocr dependencies เนื่องจาก SIGILL บน CPU ที่ไม่รองรับ AVX
 # - 2026-05-30: เพิ่ม opencv-python สำหรับ image preprocessing (threshold, denoise) เพื่อเพิ่มความแม่นยำ OCR
 # - 2026-06-11: เพิ่ม typhoon-ocr สำหรับ prepare_ocr_messages (official prompt builder สำหรับ typhoon-ocr1.5-3b)
 # - 2026-06-11: ตัด pytesseract, opencv-python, numpy ออก — ไม่ใช้ Tesseract อีกต่อไป
 numpy<2.0
 PyMuPDF==1.24.0
 pytesseract==0.3.13
 fastapi==0.111.0
 uvicorn[standard]==0.30.1
 python-multipart==0.0.9
 pythainlp==5.0.4
 httpx==0.27.0
 Pillow==10.0.0
 opencv-python==4.8.1.78
 FlagEmbedding>=1.2.0
 typhoon-ocr>=0.4.1
@@ -0,0 +1,40 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/checklists/requirements.md
 // Change Log:
 // - 2026-06-11: Initial spec quality checklist
 # Specification Quality Checklist: AI Runtime Policy Refactor
 **Purpose**: Validate specification completeness and quality before proceeding to planning
 **Created**: 2026-06-11
 **Feature**: [spec.md](../spec.md)
 ## Content Quality
 - [x] No implementation details (languages, frameworks, APIs)
 - [x] Focused on user value and business needs
 - [x] Written for non-technical stakeholders
 - [x] All mandatory sections completed
 ## Requirement Completeness
 - [x] No [NEEDS CLARIFICATION] markers remain
 - [x] Requirements are testable and unambiguous
 - [x] Success criteria are measurable
 - [x] Success criteria are technology-agnostic (no implementation details)
 - [x] All acceptance scenarios are defined
 - [x] Edge cases are identified
 - [x] Scope is clearly bounded
 - [x] Dependencies and assumptions identified
 ## Feature Readiness
 - [x] All functional requirements have clear acceptance criteria
 - [x] User scenarios cover primary flows (5 user stories covering all 5 workstreams)
 - [x] Feature meets measurable outcomes defined in Success Criteria
 - [x] No implementation details leak into specification
 ## Notes
 - Spec draws from grilling session output (AI-Refactor.md) — all ambiguities resolved per CONTEXT.md flagged items
 - Canonical terminology from CONTEXT.md Glossary Updates (ADR-034) used throughout
 - Big bang cutover gate explicitly captured in US5
@@ -0,0 +1,51 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/contracts/create-ai-job.dto.md
 // Change Log:
 // - 2026-06-11: API contract for CreateAiJobDto
 # Contract: POST /api/ai/jobs
 ## Request DTO
 ```typescript
 interface CreateAiJobRequest {
  type: 'auto-fill-document' | 'migrate-document' | 'rag-query';
  documentPublicId?: string;     // UUIDv7 — ADR-019
  attachmentPublicId?: string;   // UUIDv7 — ADR-019
  executionProfile?: 'fast' | 'balanced' | 'thai-accurate' | 'large-context';
  // [FORBIDDEN] model.key — HTTP 400 if present
  // [FORBIDDEN] temperature, top_p, maxTokens — HTTP 400 if present
 }
 ```
 ## Validation Rules
 | Field | Rule |
 |-------|------|
 | `type` | Required; enum |
 | `executionProfile` | Optional; enum; defaults to `balanced` |
 | `large-context` | Requires admin role (CASL `ai.use_large_context`) — HTTP 403 if unauthorized |
 | `model.*` | ANY model subfield → HTTP 400 |
 | `temperature` | Present at root → HTTP 400 |
 | `top_p` | Present at root → HTTP 400 |
 | `maxTokens` | Present at root → HTTP 400 |
 ## Response DTO
 ```typescript
 interface AiJobResponse {
  jobId: string;             // BullMQ job ID
  status: 'queued' | 'completed' | 'failed';
  modelUsed: 'np-dms-ai' | 'np-dms-ocr';   // Canonical name — never runtime tag
  executionProfile: ExecutionProfile;         // Effective profile (after backend override)
  queueName: 'ai-realtime' | 'ai-batch';
 }
 ```
 ## Error Responses
 | Status | When |
 |--------|------|
 | 400 | `model.key` present, or parameter overrides present, or invalid `executionProfile` |
 | 403 | `large-context` by non-admin |
 | 422 | `documentPublicId` not found |
 | 504 | CPU fallback retrieval timeout |
@@ -0,0 +1,131 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/data-model.md
 // Change Log:
 // - 2026-06-11: Data model for AI Runtime Policy Refactor
 # Data Model: AI Runtime Policy Refactor
 > หมายเหตุ: Feature นี้ไม่เพิ่ม schema DB ใหม่ (ADR-009 compliant) — เปลี่ยนเฉพาะ TypeScript interfaces, DTO shapes, และ Python data structures บน sidecar
 ---
 ## TypeScript Types (Backend)
 ### ExecutionProfile (enum)
 ```typescript
 // File: backend/src/modules/ai/interfaces/execution-policy.interface.ts
 export type ExecutionProfile = 'fast' | 'balanced' | 'thai-accurate' | 'large-context';
 ```
 ### RuntimePolicy (interface)
 ```typescript
 // File: backend/src/modules/ai/interfaces/execution-policy.interface.ts
 export interface RuntimePolicy {
  canonicalModel: 'np-dms-ai' | 'np-dms-ocr';  // ชื่อ canonical เท่านั้น
  temperature: number;
  topP: number;
  maxTokens: number;
  keepAliveSeconds: number;                        // สำหรับ main model
 }
 ```
 ### OcrResidencyDecision (interface)
 ```typescript
 // File: backend/src/modules/ai/interfaces/ocr-residency.interface.ts
 export interface OcrResidencyDecision {
  keepAliveSeconds: number;          // 0 = unload; > 0 = residency window
  vramHeadroomMb: number;            // หรือ -1 ถ้า query ล้มเหลว
  activeProfile: ExecutionProfile | null;
  reason: 'large-context-active' | 'high-pressure' | 'headroom-sufficient' | 'query-failed';
 }
 ```
 ### VramHeadroom (interface)
 ```typescript
 // File: backend/src/modules/ai/interfaces/execution-policy.interface.ts
 export interface VramHeadroom {
  totalMb: number;         // ค่า total VRAM (hardcoded จาก env)
  usedMb: number;          // ค่าจาก Ollama /api/ps
  availableMb: number;     // totalMb - usedMb
  querySuccess: boolean;   // false = ใช้ safe default
 }
 ```
 ### CreateAiJobDto (updated)
 ```typescript
 // File: backend/src/modules/ai/dto/create-ai-job.dto.ts
 // [CHANGE] ลบ model field และ parameter overrides ออก
 export class CreateAiJobDto {
  @IsEnum(['auto-fill-document', 'migrate-document', 'rag-query'])
  type: 'auto-fill-document' | 'migrate-document' | 'rag-query';
  @IsOptional()
  @IsUUID('all')
  documentPublicId?: string;
  @IsOptional()
  @IsUUID('all')
  attachmentPublicId?: string;
  @IsOptional()
  @IsEnum(['fast', 'balanced', 'thai-accurate', 'large-context'])
  executionProfile?: ExecutionProfile;
  // [REMOVED] model: { key, parameters } — ไม่อนุญาตแล้ว
 }
 ```
 ---
 ## Python Types (OCR Sidecar)
 ### VramHeadroom (dataclass)
 ```python
 # File: specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/services/vram_monitor.py
@dataclass
 class VramHeadroom:
    total_mb: float
    used_mb: float
    available_mb: float
    query_success: bool
 ```
 ### OcrResidencyPolicy (dataclass)
 ```python
 # File: specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/services/residency_policy.py
@dataclass
 class OcrResidencyDecision:
    keep_alive_seconds: int   # 0 = unload
    vram_headroom_mb: float
    reason: str               # 'large-context-active' | 'high-pressure' | 'headroom-sufficient' | 'query-failed'
 ```
 ### EmbedRequest (updated)
 ```python
 # ไม่มี model selection field — backend policy กำหนด model ทั้งหมด
 class EmbedRequest(BaseModel):
    texts: List[str]
    # [NO model field] — device selection เป็น internal logic ของ sidecar
 ```
 ---
 ## ai_audit_logs — เพิ่ม Fields (ไม่เปลี่ยน schema, เปลี่ยน payload JSON)
 ```text
 ai_audit_logs.metadata (JSON column ที่มีอยู่แล้ว) จะเพิ่ม fields:
  - modelUsed: "np-dms-ai" | "np-dms-ocr"  (canonical name เสมอ)
  - executionProfile: ExecutionProfile
  - ocrResidencyDecision: OcrResidencyDecision (สำหรับ OCR jobs)
  - retrievalDevice: "gpu" | "cpu"           (สำหรับ RAG jobs)
  - vramHeadroomMb: number                   (ขณะ job เริ่มรัน)
 ```
 > ใช้ JSON column ที่มีอยู่ — ไม่ต้อง ALTER TABLE (ADR-009 compliant)
@@ -0,0 +1,170 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/plan.md
 // Change Log:
 // - 2026-06-11: Initial implementation plan for AI Runtime Policy Refactor
 # Implementation Plan: AI Runtime Policy Refactor
 **Branch**: `235-ai-runtime-policy-refactor` | **Date**: 2026-06-11 | **Spec**: [spec.md](./spec.md)
 **Input**: Feature specification from `specs/200-fullstacks/235-ai-runtime-policy-refactor/spec.md`
 ## Summary
 Refactor AI runtime ของ LCBP3-DMS ให้รองรับ GPU ใหม่ (RTX 5060 Ti 16GB) โดย: (A) เปลี่ยน API contract ให้ใช้ `executionProfile` แทน caller-driven model selection, (B) สร้าง backend policy mapping layer, (C) เพิ่ม adaptive OCR residency, (D) เพิ่ม CPU fallback สำหรับ retrieval acceleration, และ (E) ปรับ BullMQ queue concurrency พร้อม verification suite ครอบคลุม big bang cutover gate ทั้ง 4 แกน
 ---
 ## Technical Context
 **Language/Version**: TypeScript 5.x (NestJS 10, Next.js 14), Python 3.11 (OCR sidecar FastAPI)
 **Primary Dependencies**:
 - Backend: NestJS, BullMQ, TypeORM, CASL, class-validator, class-transformer
 - Frontend: Next.js, TanStack Query, Zod, shadcn/ui
 - Sidecar: FastAPI, PyMuPDF (fitz), typhoon-ocr, httpx, FlagEmbedding
 - Infrastructure: Ollama (Desk-5439), Redis, MariaDB
 **Storage**: MariaDB (ai_audit_logs, ai_prompts, ai_intent_patterns), Redis (BullMQ, cache)
 **Testing**: Jest (backend unit/integration), Vitest (frontend), Pytest (sidecar)
 **Target Platform**: QNAP NAS (backend/frontend containers), Desk-5439 (Ollama + OCR sidecar)
 **Performance Goals**: OCR cold start < 5s (with residency), retrieval CPU fallback < 30s timeout
 **Constraints**: Big bang rollout — no legacy parallel path; LLM-First GPU ownership must be enforced
 **Scale/Scope**: Single-server AI stack on Desk-5439; BullMQ concurrency max `ai-realtime: 2`, `ai-batch: 1`
 ---
 ## Constitution Check
 _GATE: Must pass before Phase 0 research. Re-check after Phase 1 design._
 | Rule | Status | Notes |
 |------|--------|-------|
 | ADR-019 UUID: no parseInt on UUID | ✅ Pass | No new UUID handling in this feature |
 | ADR-009: No TypeORM migrations | ✅ Pass | No schema changes required |
 | ADR-016 Security: CASL Guard on all API | ✅ Required | `large-context` profile must have CASL admin check |
 | ADR-007 Error Handling: layered classification | ✅ Required | 400 (validation), 403 (profile auth), 504 (CPU timeout) |
 | ADR-008 BullMQ: no inline jobs | ✅ Pass | Queue policy adjustment, not new inline processing |
 | ADR-023/023A AI Boundary: no direct DB/storage | ✅ Pass | Policy layer stays in NestJS service |
 | ADR-023A BullMQ 2-queue: ai-realtime + ai-batch | ✅ Required | concurrency adjustment within existing queues |
 | ADR-002 Doc Numbering: Redis Redlock | ✅ N/A | Not applicable to this feature |
 | TypeScript: no `any`, no `console.log` | ✅ Required | All new TypeScript code must comply |
 | File headers: `// File: path/filename` | ✅ Required | All new files must have header |
 **No constitution violations.** Proceeding to Phase 0.
 ---
 ## Project Structure
 ### Documentation (this feature)
 ```text
 specs/200-fullstacks/235-ai-runtime-policy-refactor/
 ├── spec.md               # Feature specification
 ├── plan.md               # This file
 ├── research.md           # Phase 0 output
 ├── data-model.md         # Phase 1 output
 ├── quickstart.md         # Phase 1 output
 ├── tasks.md              # Phase 2 output
 ├── checklists/
 │   └── requirements.md
 └── contracts/
    ├── create-ai-job.dto.ts.md
    ├── execution-policy.interface.ts.md
    └── ocr-residency-policy.interface.ts.md
 ```
 ### Source Code (repository root)
 ```text
 backend/src/modules/ai/
 ├── dto/
 │   ├── create-ai-job.dto.ts          # [MODIFY] เอา model.key ออก, เพิ่ม executionProfile
 │   └── ai-job-response.dto.ts        # [MODIFY] เพิ่ม modelUsed canonical name
 ├── services/
 │   ├── ai.service.ts                  # [MODIFY] เพิ่ม profile validation + canonical name
 │   ├── ai-policy.service.ts           # [NEW] ExecutionProfile → RuntimePolicy mapping
 │   ├── ocr.service.ts                 # [MODIFY] เพิ่ม adaptive residency calculation
 │   └── vram-monitor.service.ts        # [NEW] VRAM headroom query service
 ├── processors/
 │   ├── ai-batch.processor.ts          # [MODIFY] ใช้ policy จาก AiPolicyService
 │   └── ai-realtime.processor.ts       # [MODIFY] lightweight job classification + concurrency
 ├── interfaces/
 │   ├── execution-policy.interface.ts  # [NEW] RuntimePolicy type definition
 │   └── ocr-residency.interface.ts     # [NEW] OcrResidencyDecision type
 ├── guards/
 │   └── execution-profile.guard.ts     # [NEW] large-context profile admin check
 └── ai.module.ts                       # [MODIFY] register new services + guard
 backend/src/config/
 └── bullmq.config.ts                   # [MODIFY] ai-realtime concurrency uplift config
 specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/
 ├── app.py                             # [MODIFY] adaptive keep_alive, CPU fallback embed/rerank
 ├── services/
 │   ├── vram_monitor.py                # [NEW] VRAM headroom query via Ollama API
 │   └── residency_policy.py           # [NEW] keep_alive calculation policy
 └── requirements.txt                   # [MODIFY] add nvidia-ml-py or pynvml if needed
 frontend/
 ├── types/
 │   └── ai.ts                          # [MODIFY] เอา model fields ออก, เพิ่ม executionProfile
 ├── lib/services/
 │   └── admin-ai.service.ts            # [MODIFY] update types + canonical name display
 └── components/admin/ai/
    └── OcrSandboxPromptManager.tsx    # [MODIFY] แสดง canonical names ใน UI
 backend/src/modules/ai/
 └── tests/
    ├── ai-policy.service.spec.ts       # [NEW] unit tests profile mapping
    ├── ocr-residency.spec.ts           # [NEW] unit tests adaptive residency
    └── execution-profile.guard.spec.ts # [NEW] unit tests CASL guard
 ```
 ---
 ## Phases
 ### Phase 1: Foundational — Policy Infrastructure
 ต้องเสร็จก่อน workstream อื่นทั้งหมด:
 1. สร้าง `VramMonitorService` — query VRAM headroom จาก Ollama `/api/ps` endpoint
 2. สร้าง `AiPolicyService` — mapping `ExecutionProfile` → `RuntimePolicy`
 3. สร้าง `ExecutionProfileGuard` — CASL check สำหรับ `large-context`
 4. แก้ `CreateAiJobDto` — เอา `model.key` + parameter overrides ออก
 5. แก้ `vram_monitor.py` บน sidecar — query GPU headroom
 ### Phase 2: Contract & Canonical Naming (Workstream A)
 1. แก้ `AiService` — validate profile, override data-affecting jobs, log canonical names
 2. แก้ `ai-job-response.dto.ts` — `modelUsed` เป็น canonical name
 3. แก้ Frontend types และ Admin Console UI — แสดง canonical names
 4. เพิ่ม rejection tests สำหรับ `model.key` และ parameter overrides
 ### Phase 3: Adaptive OCR Residency (Workstream B)
 1. แก้ `OcrService` — inject `VramMonitorService`, คำนวณ `keep_alive` แบบ dynamic
 2. แก้ `residency_policy.py` บน sidecar — รับ `keep_alive` จาก backend policy
 3. เพิ่ม unit tests residency scenarios
 ### Phase 4: Retrieval Acceleration (Workstream C)
 1. แก้ `app.py` — เพิ่ม GPU headroom check ใน `/embed` และ `/rerank`
 2. เพิ่ม CPU fallback path พร้อม log
 3. แก้ `ai-batch.processor.ts` สำหรับ RAG query fallback handling
 ### Phase 5: Queue Policy (Workstream D)
 1. แก้ `bullmq.config.ts` — `ai-realtime` concurrency = 2 (configurable)
 2. แก้ `ai-realtime.processor.ts` — classify lightweight vs generation-heavy jobs
 3. ตรวจว่า `rag-query` ถูก route ไป `ai-batch` เท่านั้น
 ### Phase 6: Verification & Cutover (Workstream E)
 1. รวม test suite ทั้ง 4 แกน
 2. Manual validation checklist (Admin Console, OCR Sandbox)
 3. Cutover gate verification
 ---
 ## Complexity Tracking
 ไม่มี constitution violations ที่ต้องอธิบาย
@@ -0,0 +1,136 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/quickstart.md
 // Change Log:
 // - 2026-06-11: Verification quickstart for AI Runtime Policy Refactor
 # Quickstart: AI Runtime Policy Refactor — Verification Guide
 ## Prerequisites
 - Backend running (`pnpm run start:dev` in `backend/`)
 - OCR sidecar running on Desk-5439 (`docker compose up` in ocr-sidecar/)
 - Ollama running with `np-dms-ai` and `np-dms-ocr` tags registered
 - Admin user token available
 ---
 ## Gate 1: Policy Contract Verification
 ### 1A. Reject model.key (should return 400)
 ```bash
 curl -X POST http://localhost:3001/api/ai/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type": "rag-query", "model": {"key": "typhoon2.5-np-dms:latest"}}' \
  | jq '.statusCode, .message'
 # Expected: 400, message about model.key not allowed
 ```
 ### 1B. Reject parameter overrides (should return 400)
 ```bash
 curl -X POST http://localhost:3001/api/ai/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type": "rag-query", "temperature": 0.9}' \
  | jq '.statusCode'
 # Expected: 400
 ```
 ### 1C. Valid executionProfile (should return 201)
 ```bash
 curl -X POST http://localhost:3001/api/ai/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type": "rag-query", "executionProfile": "balanced", "documentPublicId": "<uuid>"}' \
  | jq '.data.modelUsed'
 # Expected: "np-dms-ai"
 ```
 ### 1D. large-context by non-admin (should return 403)
 ```bash
 curl -X POST http://localhost:3001/api/ai/jobs \
  -H "Authorization: Bearer $NON_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type": "rag-query", "executionProfile": "large-context"}' \
  | jq '.statusCode'
 # Expected: 403
 ```
 ---
 ## Gate 2: Canonical Naming Verification
 ### 2A. Check audit log after job
 ```sql
 SELECT metadata->>'$.modelUsed' FROM ai_audit_logs ORDER BY created_at DESC LIMIT 1;
 -- Expected: "np-dms-ai" (ไม่ใช่ "typhoon2.5-np-dms:latest")
 ```
 ### 2B. Check Admin Console (Manual)
 1. เปิด `/admin/ai` ใน browser
 2. ตรวจว่า model labels ทั้งหมดแสดง `np-dms-ai` และ `np-dms-ocr`
 3. ตรวจว่าไม่มี `typhoon*` ปรากฏใน UI
 ---
 ## Gate 3: Adaptive OCR Residency Verification
 ### 3A. OCR under large-context profile
 ```bash
 # ส่ง OCR job ขณะที่มี large-context job active
 # ดู sidecar log
 docker logs ocr-sidecar --tail 20
 # Expected log line: keep_alive=0 reason=large-context-active
 ```
 ### 3B. OCR with headroom sufficient
 ```bash
 # ส่ง OCR job เมื่อ GPU headroom สูง (ไม่มี model loaded หนัก)
 docker logs ocr-sidecar --tail 20
 # Expected log line: keep_alive=120 reason=headroom-sufficient
 ```
 ---
 ## Gate 4: Retrieval CPU Fallback Verification
 ### 4A. Force GPU pressure then run RAG
 ```bash
 # 1. Force load large model
 curl http://localhost:11434/api/generate -d '{"model":"np-dms-ai","prompt":"warmup","keep_alive":-1}'
 # 2. Run RAG query
 curl -X POST http://localhost:3001/api/ai/jobs \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"type":"rag-query","executionProfile":"balanced","documentPublicId":"<uuid>"}' \
  | jq '.data.status'
 # Expected: "completed" (ไม่ fail)
 # 3. ตรวจ sidecar log
 docker logs ocr-sidecar --tail 20
 # Expected: device=cpu reason=gpu-headroom-below-threshold
 ```
 ---
 ## Automated Test Suite
 ```bash
 # Backend unit + integration tests
 cd backend
 pnpm test -- --testPathPattern="ai-policy|ocr-residency|execution-profile"
 # Sidecar tests
 cd specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar
 pytest tests/ -v
 ```
 **All tests must pass** before cutover gate is considered complete.
@@ -0,0 +1,149 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/research.md
 // Change Log:
 // - 2026-06-11: Phase 0 research for AI Runtime Policy Refactor
 # Research: AI Runtime Policy Refactor
 ## 1. VRAM Headroom Query Strategy
 **Decision**: ใช้ Ollama `/api/ps` endpoint เพื่อดู running models และ VRAM usage — คำนวณ headroom จาก total VRAM (16GB RTX 5060 Ti) หักด้วย loaded model VRAM
 **Rationale**:
 - Ollama `/api/ps` response มี `size_vram` สำหรับแต่ละ loaded model
 - ไม่ต้องพึ่ง `pynvml` หรือ `nvidia-ml-py` ซึ่งเพิ่ม dependency และ platform coupling
 - หาก `/api/ps` timeout หรือ error → safe default = 0 headroom (unload)
 **Alternatives considered**:
 - `pynvml` direct NVIDIA API: platform-specific, ต้อง CUDA toolkit, ไม่ต้อง
 - `nvidia-smi` subprocess: fragile on container env, parsing overhead
 - Hardcode threshold per model: ไม่ adaptive, ต้องอัปเดตทุกครั้งที่เปลี่ยน model
 **Response shape จาก Ollama `/api/ps`**:
 ```json
 {
  "models": [
    {
      "name": "np-dms-ai:latest",
      "model": "np-dms-ai:latest",
      "size": 8192000000,
      "size_vram": 7680000000,
      "digest": "...",
      "expires_at": "..."
    }
  ]
 }
 ```
 ---
 ## 2. ExecutionProfile → RuntimePolicy Mapping
 **Decision**: Mapping table ใน `AiPolicyService` เป็น `readonly` constant — ไม่เก็บใน DB เพราะเป็น architecture decision ไม่ใช่ operational config
 **Rationale**:
 - Profile set เล็กและเสถียร (4 values) — DB overhead ไม่คุ้ม
 - ถ้าต้องการเปลี่ยน profile behavior ต้องผ่าน code review (governance)
 - Runtime parameters เป็น implementation detail ของ backend policy — ไม่ expose ใน API
 **Policy mapping (draft)**:
 | Profile | Canonical Model | Temperature | Top-P | Max Tokens | Notes |
 |---------|----------------|-------------|-------|------------|-------|
 | `fast` | `np-dms-ai` | 0.1 | 0.9 | 1024 | Quick suggestions |
 | `balanced` | `np-dms-ai` | 0.3 | 0.9 | 2048 | Default RAG/suggest |
 | `thai-accurate` | `np-dms-ai` | 0.1 | 0.8 | 2048 | Thai doc extraction |
 | `large-context` | `np-dms-ai` | 0.3 | 0.9 | 8192 | Admin-only, long docs |
 **Data-affecting overrides**:
 - `migrate-document` → force `thai-accurate` profile parameters
 - `auto-fill-document` → force `thai-accurate` profile parameters
 - `ocr-extraction` → handled by OCR sidecar policy, not main LLM
 ---
 ## 3. Adaptive OCR Residency Calculation
 **Decision**: Policy function ใน `OcrService` (backend) คำนวณ `keep_alive` แล้วส่งไปใน OCR request header/body — สidecar ใช้ค่านั้นตรงๆ
 **Rationale**:
 - Backend มี context ของ active job profile ที่ sidecar ไม่มี
 - Central policy ง่ายกว่า distributed decision
 **Algorithm**:
 ```
 function calculateOcrKeepAlive(activeProfile, vramHeadroomMb):
  if activeProfile == 'large-context': return 0
  if vramHeadroomMb < VRAM_HEADROOM_THRESHOLD_MB: return 0
  if vramHeadroomMb >= VRAM_HEADROOM_THRESHOLD_MB: return OCR_RESIDENCY_WINDOW_SECONDS (default: 120)
  fallback (query error): return 0
 ```
 **Default values**:
 - `VRAM_HEADROOM_THRESHOLD_MB`: 3000 (3GB) — configurable env variable
 - `OCR_RESIDENCY_WINDOW_SECONDS`: 120 (2 min) — configurable env variable
 ---
 ## 4. CPU Fallback for Retrieval (FlagEmbedding + BGE-Reranker)
 **Decision**: `FlagEmbedding` รองรับ `use_fp16=False` และ device selection — pass `device="cpu"` เมื่อ headroom ไม่พอ
 **Rationale**:
 - FlagEmbedding (`BGE-M3`) รองรับ CPU inference โดย native — ไม่ต้อง rewrite
 - `BGE-Reranker-Large` ก็รองรับ CPU เช่นกัน
 - ต้องเพิ่ม timeout guard: CPU embed อาจใช้เวลา 10–30s สำหรับ long doc
 **Pattern ใน sidecar**:
 ```python
 async def embed_with_fallback(texts: list[str], vram_headroom_mb: float) -> EmbedResponse:
    device = "cuda" if vram_headroom_mb >= settings.VRAM_HEADROOM_THRESHOLD_MB else "cpu"
    # ใช้ FlagEmbedding พร้อม device parameter
    # log fallback decision
    return result
 ```
 ---
 ## 5. BullMQ Concurrency Uplift Pattern
 **Decision**: ใช้ job-type classification ใน `ai-realtime.processor.ts` — ตรวจ `job.data.type` ก่อน process; lightweight jobs (intent-classify, tool-suggest) ทำงาน concurrently; generation-heavy jobs enforce semaphore
 **Rationale**:
 - BullMQ Worker รองรับ `concurrency: 2` ระดับ worker configuration
 - Lightweight jobs ไม่เรียก Ollama → ไม่มี GPU contention จริง
 - ไม่ต้องสร้าง queue ใหม่ — เปลี่ยน config + add guard ใน processor พอ
 **Lightweight job types** (ที่อนุญาต concurrency = 2):
 - `intent-classify` (Pattern Layer only)
 - `tool-suggest` (no model switch)
 **Generation-heavy** (ยังคง serialize):
 - `rag-query`
 - `auto-fill-document`
 - `migrate-document`
 - `ocr-extraction`
 ---
 ## 6. Canonical Name Enforcement Strategy
 **Decision**: ใช้ `AiPolicyService.getCanonicalModelName(runtimeModelTag)` function ที่ map runtime tag → canonical — เรียกก่อน log/response ทุกครั้ง
 **Pattern**:
 ```typescript
 // ไม่ว่า Ollama จะตอบ runtime tag อะไร ให้ map ก่อน expose
 const canonicalName = this.aiPolicyService.getCanonicalModelName(ollamaResponse.model);
 // canonicalName = "np-dms-ai" หรือ "np-dms-ocr" เสมอ
 ```
 **Mapping table**:
 ```typescript
 const CANONICAL_MODEL_MAP: Record<string, string> = {
  'typhoon2.5-np-dms:latest': 'np-dms-ai',
  'np-dms-ai:latest': 'np-dms-ai',
  'np-dms-ai': 'np-dms-ai',
  'typhoon-np-dms-ocr:latest': 'np-dms-ocr',
  'np-dms-ocr:latest': 'np-dms-ocr',
  'np-dms-ocr': 'np-dms-ocr',
 };
 ```
@@ -0,0 +1,193 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/spec.md
 // Change Log:
 // - 2026-06-11: Initial specification for AI Runtime Policy Refactor (RTX 5060 Ti 16GB)
 # Feature Specification: AI Runtime Policy Refactor
 **Feature Branch**: `235-ai-runtime-policy-refactor`
 **Created**: 2026-06-11
 **Status**: Draft
 **Category**: 200-fullstacks
 **Input**: User description from `docs/AI-Refactor.md` + `docs/0001-ai-runtime-policy-refactor.md`
 ## Overview
 ปรับ AI runtime ของ LCBP3-DMS ให้รองรับ GPU ใหม่ (RTX 5060 Ti 16GB) โดยนำ canonical model identities (`np-dms-ai`, `np-dms-ocr`), policy-driven `executionProfile` contract, และ LLM-First GPU ownership มาใช้แทนระบบเดิมที่ caller เลือก model/parameter เองได้ และ keep_alive แบบ fixed ค่า
 อ้างอิง: `docs/AI-Refactor.md`, `docs/0001-ai-runtime-policy-refactor.md`, ADR-033, ADR-034
 ---
 ## User Scenarios & Testing _(mandatory)_
 ### User Story 1 — Policy Contract & Canonical Naming (Priority: P1)
 นักพัฒนาและ admin ที่ส่ง AI job request ผ่าน AI Gateway จะส่งได้แค่ `executionProfile` (`fast | balanced | thai-accurate | large-context`) โดยไม่สามารถระบุชื่อ model หรือ override runtime parameters ได้เอง — system แสดงและบันทึก model ในทุก layer ด้วยชื่อ canonical `np-dms-ai` และ `np-dms-ocr` แทนชื่อ runtime เดิม
 **Why this priority**: เป็นรากฐานของทุก workstream — ถ้า contract ยังเป็น caller-driven อยู่ workstream อื่นไม่มีความหมาย
 **Independent Test**: ยิง POST ไปยัง AI Gateway endpoint ด้วย payload ที่มี `model.key` หรือ `temperature` แล้วตรวจว่า API reject 400 พร้อม error message; ยิงด้วย `executionProfile: "balanced"` แล้วตรวจว่าผ่านและ log/response แสดง `np-dms-ai`
 **Acceptance Scenarios**:
 1. **Given** AI job request ที่มี `model: { key: "typhoon2.5-np-dms:latest" }`, **When** ส่งไปยัง `POST /api/ai/jobs`, **Then** system ตอบ HTTP 400 พร้อมข้อความว่า field `model.key` ไม่อนุญาต
 2. **Given** AI job request ที่มี `executionProfile: "balanced"`, **When** job ถูก dispatch ไปยัง `ai-batch` queue, **Then** job payload บันทึก `modelUsed: "np-dms-ai"` ใน audit log
 3. **Given** admin เปิด AI Admin Console, **When** ดู model information panel, **Then** แสดงชื่อ `np-dms-ai` และ `np-dms-ocr` ไม่ใช่ชื่อ runtime จริง (เช่น `typhoon2.5-np-dms:latest`)
 4. **Given** `auto-fill-document` job ถูกส่งมาพร้อม `executionProfile: "fast"`, **When** backend process job, **Then** backend override เป็น deterministic profile โดยไม่ใช้ค่า `fast` ที่ caller ส่งมา
 5. **Given** `large-context` profile ถูกส่งโดย non-admin user, **When** backend validate, **Then** ตอบ HTTP 403 เพราะ profile นั้น restrict เฉพาะ admin/special workflows
 ---
 ### User Story 2 — Adaptive OCR Residency (Priority: P2)
 Backend คำนวณ `keep_alive` value ของ `np-dms-ocr` แบบ dynamic ตาม VRAM headroom และ active workload ณ ขณะนั้น แทนการใช้ค่า fixed `keep_alive: 0` หรือ `keep_alive: 300` ตายตัว
 **Why this priority**: แก้ปัญหา VRAM contention โดยตรง — ถ้า OCR ค้างอยู่ใน VRAM ตลอดจะบล็อก main model; ถ้า unload ทุกครั้งจะมี cold start penalty สูง 5–15 วินาที
 **Independent Test**: รัน OCR job ขณะที่ `large-context` profile active และตรวจว่า `keep_alive: 0` ถูกส่งไป OCR sidecar; รัน OCR job ขณะที่ VRAM headroom สูงและตรวจว่าได้ residency window > 0
 **Acceptance Scenarios**:
 1. **Given** active job กำลังใช้ `large-context` profile, **When** OCR job เข้ามา, **Then** `keep_alive` ที่ส่งไป Ollama = `0`
 2. **Given** ไม่มี active main model pressure และ VRAM headroom ≥ threshold, **When** OCR job เข้ามา, **Then** `keep_alive` ที่ส่งไป Ollama > `0` (residency window)
 3. **Given** main model pressure สูง (high VRAM utilization), **When** OCR job เข้ามา, **Then** `keep_alive` = `0` เสมอ
 4. **Given** OCR residency policy ทำงาน, **When** ดู trace/log ของ OCR request, **Then** log บันทึก residency decision พร้อม headroom value ที่ใช้ตัดสิน
 ---
 ### User Story 3 — Retrieval Acceleration with CPU Fallback (Priority: P3)
 `/embed` และ `/rerank` endpoints บน OCR sidecar ตรวจสอบ VRAM headroom ก่อนใช้ GPU; ถ้า headroom ไม่ผ่าน policy threshold ให้ fallback ไป CPU ทันทีโดยไม่ fail และไม่รอ GPU queue
 **Why this priority**: ป้องกัน RAG query ล้มเหลวในช่วงที่ GPU ถูกใช้งานสูง — retrieval ยังทำงานได้แค่ช้าลง
 **Independent Test**: จำลอง GPU pressure สูงแล้วยิง RAG query — ตรวจว่า query ยังตอบได้ (อาจช้ากว่าปกติ) และ log บันทึก `"retrieval: cpu-fallback"`
 **Acceptance Scenarios**:
 1. **Given** GPU headroom < threshold, **When** `POST /embed` ถูกเรียก, **Then** ใช้ CPU compute โดยไม่ return error
 2. **Given** GPU headroom < threshold, **When** `POST /rerank` ถูกเรียก, **Then** ใช้ CPU compute และ response ปกติ
 3. **Given** fallback เกิดขึ้น, **When** ดู sidecar log, **Then** log entry มี `device: "cpu"` และ `reason: "gpu-headroom-below-threshold"`
 4. **Given** `rag-query` job รัน, **When** GPU ถูก main model ใช้งานอยู่, **Then** RAG query ยังตอบ response ได้ (ไม่ timeout หรือ fail hard)
 ---
 ### User Story 4 — Queue Policy & Selective Realtime Concurrency (Priority: P4)
 BullMQ queue ปรับให้ `ai-realtime` รองรับ concurrency = 2 ได้เฉพาะ lightweight realtime jobs (intent classification ที่ไม่เรียก OCR, tool-only suggestion ที่ไม่ต้อง model switching) โดยยังคง pause/resume coordination เดิม และ `rag-query` ยังถูก classify เป็น generation-centric job ที่อยู่ใน `ai-batch`
 **Why this priority**: เพิ่ม throughput ให้ lightweight jobs โดยไม่กระทบ GPU safety
 **Independent Test**: ส่ง intent classification job 2 อันพร้อมกัน ตรวจว่าทั้งสองรันพร้อมกันได้ใน `ai-realtime`; ส่ง `rag-query` ตรวจว่าไปอยู่ใน `ai-batch` ไม่ใช่ `ai-realtime`
 **Acceptance Scenarios**:
 1. **Given** 2 intent classification jobs เข้ามาพร้อมกัน, **When** ทั้งคู่ถูก dispatch, **Then** ทั้งคู่ process พร้อมกันใน `ai-realtime` queue (concurrency = 2)
 2. **Given** `rag-query` job เข้ามา, **When** dispatch, **Then** job ถูกส่งไป `ai-batch` queue ไม่ใช่ `ai-realtime`
 3. **Given** `ai-batch` ถูก pause เนื่องจาก realtime pressure, **When** pause/resume coordination ทำงาน, **Then** `ai-realtime` ยังคง concurrency = 2 ได้สำหรับ lightweight jobs
 ---
 ### User Story 5 — Verification & Cutover Gate (Priority: P5)
 ระบบมี automated tests และ manual validation checklist ครบตามทั้ง 4 แกน (policy contract, canonical naming, adaptive OCR residency, retrieval fallback) ก่อนถือว่า big bang cutover สำเร็จ — ไม่อนุญาต partial success
 **Why this priority**: เป็น safety net ของทั้ง refactor — partial cutover อาจทำให้ระบบ inconsistent
 **Independent Test**: รัน test suite ที่ครอบคลุมทั้ง 4 แกนแล้วทุก test ผ่าน; admin สามารถเปิด AI Admin Console และ OCR Sandbox ตรวจ label/behavior จริงได้
 **Acceptance Scenarios**:
 1. **Given** test suite รัน, **When** ทุก test ผ่าน, **Then** cutover gate ถือว่าผ่านในส่วน executable verification
 2. **Given** admin เปิด AI Admin Console, **When** ดู model labels ทั้งหมด, **Then** ไม่มีชื่อ `typhoon2.5-np-dms:latest` หรือ `typhoon-np-dms-ocr:latest` ปรากฏใน UI
 3. **Given** admin รัน OCR Sandbox ซ้ำหลาย job ในเงื่อนไข headroom ต่างกัน, **When** ดู behavior, **Then** `keep_alive` ต่างกันตาม policy ที่ defined
 ---
 ### Edge Cases
 - ถ้า VRAM headroom calculation service ล้มเหลว (timeout หรือ error) → ต้อง fallback เป็น `keep_alive: 0` เสมอ (safe default)
 - ถ้า caller ส่ง `executionProfile` ที่ไม่อยู่ใน canonical set → ตอบ 400 validation error
 - ถ้า `large-context` profile ถูก whitelist ให้ admin แต่ VRAM ไม่พอ → backend ต้อง reject พร้อม error ชัดเจน ไม่ใช่ silent fallback
 - ถ้า OCR job เข้ามาพร้อมกับ main model generation job → LLM-First rule บังคับ: OCR ต้องรอหรือใช้ `keep_alive: 0`
 - ถ้า `/embed` fallback ไป CPU แล้ว job ใช้เวลานานเกิน timeout → ต้อง return partial result หรือ error ที่ชัดเจน ไม่ใช่ hang
 ---
 ## Requirements _(mandatory)_
 ### Functional Requirements
 **Workstream A: Contract & Canonical Naming**
 - **FR-A01**: System MUST reject AI job requests ที่มี `model.key` field ใน payload (HTTP 400)
 - **FR-A02**: System MUST reject AI job requests ที่มี direct `temperature`, `top_p`, หรือ `maxTokens` overrides (HTTP 400)
 - **FR-A03**: `executionProfile` MUST รับค่าได้เฉพาะ `fast | balanced | thai-accurate | large-context`
 - **FR-A04**: `large-context` profile MUST ถูก authorize เฉพาะ admin role หรือ backend-whitelisted workflows
 - **FR-A05**: System MUST map `executionProfile` → canonical model name และ runtime parameters ใน backend policy layer
 - **FR-A06**: งาน data-affecting (`migrate-document`, `auto-fill-document`) MUST ถูก backend override profile โดยไม่ใช้ค่าที่ caller ส่งมา
 - **FR-A07**: ทุก layer (API response, audit log, Admin Console, OCR Sandbox) MUST แสดงชื่อ `np-dms-ai` และ `np-dms-ocr` แทนชื่อ runtime จริง
 **Workstream B: Runtime Policy**
 - **FR-B01**: Backend MUST มี policy mapping: `executionProfile` → `{ canonicalModel, keep_alive, temperature, top_p, maxTokens }`
 - **FR-B02**: OCR residency MUST คำนวณ `keep_alive` แบบ dynamic จาก VRAM headroom และ active profile
 - **FR-B03**: ถ้า active profile = `large-context` หรือ main model pressure = high → OCR `keep_alive` MUST = `0`
 - **FR-B04**: ถ้า VRAM headroom ≥ policy threshold → OCR สามารถใช้ residency window > 0
 - **FR-B05**: VRAM headroom calculation ล้มเหลว → MUST fallback เป็น `keep_alive: 0` (safe default)
 - **FR-B06**: OCR residency decision MUST ถูก log พร้อม headroom value ที่ใช้ตัดสิน
 **Workstream C: Retrieval Acceleration**
 - **FR-C01**: `/embed` endpoint MUST ตรวจ VRAM headroom ก่อน GPU compute; ถ้าไม่ผ่าน → fallback CPU
 - **FR-C02**: `/rerank` endpoint MUST ตรวจ VRAM headroom ก่อน GPU compute; ถ้าไม่ผ่าน → fallback CPU
 - **FR-C03**: CPU fallback MUST ไม่ hard fail และ MUST ไม่รอ GPU queue — ถ้า CPU compute timeout ต้อง return HTTP 504 พร้อม error message ชัดเจน (ไม่ return partial result)
 - **FR-C04**: Fallback event MUST ถูก log พร้อม `device: "cpu"` และ `reason`
 - **FR-C05**: `rag-query` job MUST ยังตอบได้เมื่อ GPU retrieval path ถูก fallback ไป CPU
 - **FR-C06**: VRAM headroom threshold MUST เป็น configurable env variable (`VRAM_HEADROOM_THRESHOLD_MB`) — ถ้า VRAM query ล้มเหลว ให้ใช้ safe default = 0 MB (บังคับ fallback)
 **Workstream D: Queue Policy**
 - **FR-D01**: `ai-realtime` queue MUST รองรับ concurrency = 2 สำหรับ lightweight realtime jobs
 - **FR-D02**: Lightweight realtime jobs ได้แก่: intent classification (ไม่เรียก OCR), tool-only suggestion (ไม่ต้อง model switching)
 - **FR-D03**: `rag-query` MUST ถูก dispatch ไป `ai-batch` ไม่ใช่ `ai-realtime`
 - **FR-D04**: pause/resume coordination ระหว่าง `ai-realtime` และ `ai-batch` MUST ยังคงทำงานได้ตามเดิม
 ### Key Entities
 - **ExecutionProfile**: Enum value ที่ caller ส่งมา (`fast | balanced | thai-accurate | large-context`) — contract ระดับ API
 - **RuntimePolicy**: Backend mapping จาก `ExecutionProfile` → `{ canonicalModel, keep_alive, temperature, top_p, maxTokens }` — ไม่ expose ใน API
 - **VramHeadroom**: ค่า computed ณ เวลา request ที่ใช้ตัดสิน OCR residency และ retrieval acceleration — บันทึกใน log
 - **CanonicalModelIdentity**: ชื่อ `np-dms-ai` หรือ `np-dms-ocr` — ใช้ทุกชั้นที่ผู้ใช้เห็น
 - **OcrResidencyDecision**: ผลการคำนวณ `keep_alive` value สำหรับ OCR job แต่ละครั้ง — บันทึกใน log พร้อม input factors
 ---
 ## Success Criteria _(mandatory)_
 ### Measurable Outcomes
 - **SC-001**: AI job requests ที่มี `model.key` หรือ parameter overrides ถูก reject 100% ในทุก environment
 - **SC-002**: ทุก layer ที่ผู้ใช้และนักพัฒนาเห็น (API response, audit log, Admin Console, OCR Sandbox) แสดงชื่อ `np-dms-ai` / `np-dms-ocr` 100% โดยไม่มีชื่อ runtime รั่วออกมา
 - **SC-003**: OCR cold start penalty ลดลงจากการใช้ adaptive residency ในสถานการณ์ที่ VRAM headroom เพียงพอ (วัดจาก average OCR latency ใน non-contention scenario)
 - **SC-004**: RAG query ยังตอบ response ได้ 100% แม้ GPU retrieval path ถูก fallback ไป CPU (ไม่มี hard failure)
 - **SC-005**: Automated test suite ครอบคลุมทั้ง 4 แกนของ cutover gate ผ่าน 100%
 - **SC-006**: lightweight realtime job throughput เพิ่มขึ้น (สามารถ process 2 concurrent lightweight jobs) ขณะที่ pause/resume coordination ยังทำงานได้
 ---
 ## Clarifications
 ### Session 2026-06-11
 - Q: ถ้า `/embed` fallback ไป CPU แล้ว job ใช้เวลานานเกิน timeout → ควร return partial result หรือ return error ที่ชัดเจน? → A: Return error ที่ชัดเจนพร้อม HTTP 504 timeout message — ไม่ return partial result เพราะ downstream LLM context จะ incomplete และทำให้ผลลัพธ์ผิดพลาดโดยไม่รู้ตัว
 - Q: VRAM headroom threshold ระดับ spec ควรกำหนด default value ไหม? → A: ไม่กำหนดใน spec — threshold เป็น operational config (env variable `VRAM_HEADROOM_THRESHOLD_MB`) ที่ ops/admin ปรับได้ runtime; spec ระบุแค่ว่า "ต้องมี threshold ที่ configurable" และ "ต้องใช้ safe default = 0 (unload) เมื่อ query ล้มเหลว"
 ## Assumptions
 - GPU ปัจจุบัน (RTX 5060 Ti 16GB) รองรับ VRAM monitoring API ที่ Ollama หรือ sidecar สามารถ query ได้
 - VRAM headroom threshold ค่าเริ่มต้นจะถูกกำหนดใน config/env และปรับได้โดยไม่ต้อง redeploy
 - Canonical model names (`np-dms-ai`, `np-dms-ocr`) ถูก tag ใน Ollama registry บน Desk-5439 ก่อน cutover
 - OCR sidecar (`app.py`) บน Desk-5439 จะถูก update เป็นส่วนหนึ่งของ cutover
 - Big bang rollout: ไม่มี parallel legacy path — ทุก change deploy พร้อมกันในรอบเดียว
 - `ai-realtime` concurrency uplift เป็น configuration change ไม่ใช่ architectural change ใหม่
@@ -0,0 +1,204 @@
 // File: specs/200-fullstacks/235-ai-runtime-policy-refactor/tasks.md
 // Change Log:
 // - 2026-06-11: Initial task list for AI Runtime Policy Refactor
 # Tasks: AI Runtime Policy Refactor
 **Input**: Design documents from `specs/200-fullstacks/235-ai-runtime-policy-refactor/`
 **Prerequisites**: plan.md ✅, spec.md ✅, research.md ✅, data-model.md ✅, contracts/ ✅
 ## Format: `[ID] [P?] [Story] Description`
 - **[P]**: Can run in parallel (different files, no dependencies)
 - **[Story]**: US1=Contract&Naming, US2=OCR Residency, US3=Retrieval Fallback, US4=Queue Policy, US5=Verification
 ---
 ## Phase 1: Setup (Shared Infrastructure)
 **Purpose**: สร้าง foundational types และ interfaces ก่อน workstream ทุกอัน
 - [ ] T001 สร้าง interface file `backend/src/modules/ai/interfaces/execution-policy.interface.ts` (ExecutionProfile type, RuntimePolicy interface, VramHeadroom interface)
 - [ ] T002 สร้าง interface file `backend/src/modules/ai/interfaces/ocr-residency.interface.ts` (OcrResidencyDecision interface)
 - [ ] T003 [P] สร้าง `backend/src/modules/ai/services/vram-monitor.service.ts` — query Ollama `/api/ps` เพื่อคำนวณ VRAM headroom
 - [ ] T004 [P] สร้าง `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/services/vram_monitor.py` — Python VRAM headroom query via Ollama `/api/ps`
 ---
 ## Phase 2: Foundational (Blocking Prerequisites)
 **Purpose**: Policy infrastructure ที่ทุก workstream ต้องพึ่งพา — MUST complete ก่อนทุก user story
 **⚠️ CRITICAL**: No user story work can begin until this phase is complete
 - [ ] T005 สร้าง `backend/src/modules/ai/services/ai-policy.service.ts` — ExecutionProfile → RuntimePolicy mapping, canonical model name mapping, data-affecting job override logic
 - [ ] T006 สร้าง `backend/src/modules/ai/guards/execution-profile.guard.ts` — CASL check: `large-context` เฉพาะ admin role
 - [ ] T007 [P] แก้ `backend/src/modules/ai/dto/create-ai-job.dto.ts` — เอา `model.key` และ parameter override fields ออก, เพิ่ม `executionProfile?: ExecutionProfile` พร้อม class-validator
 - [ ] T008 สร้าง `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/services/residency_policy.py` — OCR keep_alive calculation function
 - [ ] T009 แก้ `backend/src/modules/ai/ai.module.ts` — register `AiPolicyService`, `VramMonitorService`, `ExecutionProfileGuard`
 **Checkpoint**: Foundation ready — policy services, guard, and updated DTO available
 ---
 ## Phase 3: User Story 1 — Policy Contract & Canonical Naming (P1) 🎯 MVP
 **Goal**: API reject `model.key`/parameter overrides; ทุก layer แสดง canonical names; data-affecting jobs ถูก override
 **Independent Test**: ยิง POST ด้วย `model.key` → ต้องได้ 400; ยิงด้วย `executionProfile: "balanced"` → ต้องได้ 201 + `modelUsed: "np-dms-ai"`
 ### Implementation for User Story 1
 - [ ] T010 [US1] แก้ `backend/src/modules/ai/ai.service.ts` — inject `AiPolicyService`, validate `executionProfile`, apply backend override สำหรับ `migrate-document` และ `auto-fill-document`, set `modelUsed` canonical name ใน audit log
 - [ ] T011 [P] [US1] แก้ `backend/src/modules/ai/dto/ai-job-response.dto.ts` — เพิ่ม `modelUsed: 'np-dms-ai' | 'np-dms-ocr'` field, เพิ่ม `executionProfile` field (effective profile หลัง override)
 - [ ] T012 [P] [US1] แก้ `backend/src/modules/ai/ai.controller.ts` — ใช้ `ExecutionProfileGuard` บน create-job endpoint, validate forbidden fields ใน pipe
 - [ ] T013 [P] [US1] แก้ `frontend/types/ai.ts` — เอา `model` field ออก, เพิ่ม `executionProfile?: ExecutionProfile`, เพิ่ม `modelUsed?: string`
 - [ ] T014 [US1] แก้ `frontend/lib/services/admin-ai.service.ts` — update request/response types ให้สอดคล้องกับ DTO ใหม่
 - [ ] T015 [P] [US1] แก้ `frontend/components/admin/ai/OcrSandboxPromptManager.tsx` — แสดง `np-dms-ai` / `np-dms-ocr` แทนชื่อ runtime ใน result cards และ model info
 - [ ] T016 [US1] แก้ `frontend/app/(admin)/admin/ai/page.tsx` — แสดง canonical names ใน System Health panel และ model status cards
 **Checkpoint**: US1 fully functional — policy contract enforced, canonical naming in all layers
 ---
 ## Phase 4: User Story 2 — Adaptive OCR Residency (P2)
 **Goal**: `OcrService` คำนวณ `keep_alive` dynamic ตาม VRAM headroom + active profile; sidecar รับค่าและใช้
 **Independent Test**: ดู log จาก OCR job ใน high-pressure scenario → `keep_alive=0`; ใน headroom-sufficient scenario → `keep_alive>0`
 ### Implementation for User Story 2
 - [ ] T017 [US2] แก้ `backend/src/modules/ai/services/ocr.service.ts` — inject `VramMonitorService` และ `AiPolicyService`, เพิ่ม `calculateOcrResidency()` method, ส่ง `keep_alive` ที่คำนวณได้ไปใน OCR sidecar request, log `OcrResidencyDecision`
 - [ ] T018 [P] [US2] แก้ `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/app.py` — รับ `keep_alive` parameter จาก request body แทน hardcode `keep_alive=0`, ส่ง `keep_alive` ค่านั้นไปใน Ollama `/v1/chat/completions` call
 - [ ] T019 [P] [US2] เพิ่ม env variables ใน docker-compose ของ Desk-5439 OCR sidecar — `VRAM_HEADROOM_THRESHOLD_MB`, `OCR_RESIDENCY_WINDOW_SECONDS`, `GPU_TOTAL_VRAM_MB`
 - [ ] T020 [US2] เพิ่ม unit tests `backend/src/modules/ai/tests/ocr-residency.spec.ts` — scenarios: large-context-active, high-pressure, headroom-sufficient, query-failed fallback
 **Checkpoint**: US2 functional — OCR keep_alive computed dynamically per policy
 ---
 ## Phase 5: User Story 3 — Retrieval Acceleration with CPU Fallback (P3)
 **Goal**: `/embed` และ `/rerank` บน sidecar ตรวจ VRAM headroom; fallback CPU ถ้าไม่พอ; log fallback decision
 **Independent Test**: จำลอง GPU pressure → ยิง `/embed` → ต้องได้ผลลัพธ์ (ไม่ fail) + log `device: "cpu"`
 ### Implementation for User Story 3
 - [ ] T021 [P] [US3] แก้ `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/app.py` — เพิ่ม VRAM headroom check ใน `POST /embed` endpoint; ถ้าผ่าน threshold ใช้ GPU, ถ้าไม่ผ่านหรือ query ล้มเหลว ใช้ CPU; log `device` และ `reason`
 - [ ] T022 [P] [US3] แก้ `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/app.py` — เพิ่ม VRAM headroom check ใน `POST /rerank` endpoint; CPU fallback logic เหมือน `/embed`; เพิ่ม timeout guard (504 response ถ้า CPU timeout)
 - [ ] T023 [US3] แก้ `backend/src/modules/ai/processors/ai-batch.processor.ts` — รอง handle กรณีที่ `/embed` หรือ `/rerank` ตอบ `device: "cpu"` ใน response; log `retrievalDevice` ลง ai_audit_logs metadata
 - [ ] T024 [P] [US3] สร้าง `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/tests/test_retrieval_fallback.py` — pytest tests สำหรับ CPU fallback behavior ของ `/embed` และ `/rerank`
 **Checkpoint**: US3 functional — retrieval never hard-fails due to GPU pressure
 ---
 ## Phase 6: User Story 4 — Queue Policy & Selective Realtime Concurrency (P4)
 **Goal**: `ai-realtime` concurrency = 2 สำหรับ lightweight jobs; `rag-query` route ไป `ai-batch`; pause/resume ยังทำงาน
 **Independent Test**: ส่ง 2 intent-classify jobs พร้อมกัน → ทั้งสองรันพร้อมกัน; ส่ง rag-query → ไปอยู่ใน `ai-batch`
 ### Implementation for User Story 4
 - [ ] T025 [US4] แก้ `backend/src/config/bullmq.config.ts` — เพิ่ม `REALTIME_CONCURRENCY` env variable (default: 2); ปรับ `ai-realtime` worker concurrency ให้ configurable
 - [ ] T026 [US4] แก้ `backend/src/modules/ai/processors/ai-realtime.processor.ts` — เพิ่ม job type classification: `LIGHTWEIGHT_REALTIME_JOBS = ['intent-classify', 'tool-suggest']`; generation-heavy jobs ถูก redirect ไป `ai-batch` ถ้าเข้ามาผิด queue; เพิ่ม log สำหรับ classification decision
 - [ ] T027 [P] [US4] ตรวจสอบ `backend/src/modules/ai/ai.service.ts` — ยืนยันว่า `rag-query` ถูก dispatch ไป `ai-batch` เสมอ (ไม่ใช่ `ai-realtime`); เพิ่ม explicit assertion ใน dispatch logic
 - [ ] T028 [P] [US4] เพิ่ม unit tests `backend/src/modules/ai/tests/queue-policy.spec.ts` — ทดสอบ job classification, rag-query routing, lightweight job concurrency
 **Checkpoint**: US4 functional — selective concurrency active, rag-query always in ai-batch
 ---
 ## Phase 7: User Story 5 — Verification & Cutover Gate (P5)
 **Goal**: Test suite ครอบ 4 แกน cutover gate ทั้งหมด; manual validation checklist พร้อม; Admin Console / OCR Sandbox แสดงถูกต้อง
 **Independent Test**: `pnpm test -- --testPathPattern="ai-policy|ocr-residency|execution-profile|queue-policy"` ทุก test ผ่าน 100%
 ### Implementation for User Story 5
 - [ ] T029 [US5] สร้าง `backend/src/modules/ai/tests/ai-policy.service.spec.ts` — unit tests ครอบ: profile mapping ทุก 4 values, canonical name mapping, data-affecting override, `large-context` guard validation
 - [ ] T030 [P] [US5] สร้าง `backend/src/modules/ai/tests/execution-profile.guard.spec.ts` — unit tests: admin passes, non-admin blocked, missing token blocked
 - [ ] T031 [P] [US5] สร้าง `backend/src/modules/ai/tests/vram-monitor.service.spec.ts` — unit tests: successful query, Ollama timeout fallback, empty models response
 - [ ] T032 [US5] ทดสอบ manual validation ตาม `quickstart.md` — รัน curl commands ทั้ง Gate 1–4, ตรวจ Admin Console labels, ตรวจ OCR Sandbox behavior; บันทึกผลใน checklist
 - [ ] T033 [P] [US5] อัปเดต env template ไฟล์ `specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/.env.template` — เพิ่ม `VRAM_HEADROOM_THRESHOLD_MB`, `OCR_RESIDENCY_WINDOW_SECONDS`, `GPU_TOTAL_VRAM_MB`, `REALTIME_CONCURRENCY`
 - [ ] T034 [P] [US5] อัปเดต `backend/.env.example` — เพิ่ม `AI_VRAM_HEADROOM_THRESHOLD_MB`, `AI_REALTIME_CONCURRENCY`
 **Checkpoint**: All 5 user stories complete — big bang cutover gate ready for validation
 ---
 ## Phase 8: Polish & Cross-Cutting Concerns
 - [ ] T039 [US1] แก้ `backend/src/modules/ai/processors/ai-batch.processor.ts` — เปลี่ยน `ocrUsed` label value จาก `"Typhoon OCR"` / `"PaddleOCR"` เป็น `"np-dms-ocr"` ใน Redis completed result (ครอบคลุม FR-A07: canonical names ทุก layer รวมถึง OCR Sandbox badge)
 - [ ] T035 [P] ตรวจสอบ i18n keys ที่ต้องเพิ่มใน `frontend/public/locales/` สำหรับ error messages ใหม่ (400 model.key, 403 large-context, 504 CPU timeout)
 - [ ] T036 อัปเดต CONTEXT.md และ AGENTS.md — เพิ่ม `np-dms-ai` / `np-dms-ocr` เป็น canonical identity ใน System readiness summary; แก้ references เดิมที่ยังใช้ชื่อ runtime
 - [ ] T037 [P] ตรวจสอบ ADR-034 references ทั้งหมดใน codebase ด้วย search — ไฟล์ไหนยังใช้ `typhoon2.5-np-dms:latest` หรือ `typhoon-np-dms-ocr:latest` ใน user-facing surfaces (ไม่ใช่ Modelfile/ops internals)
 - [ ] T038 รัน `pnpm lint` และ `pnpm type-check` สำหรับ backend และ frontend — แก้ทุก error ก่อน cutover
 ---
 ## Dependencies & Execution Order
 ### Phase Dependencies
 - **Setup (Phase 1)**: ไม่มี dependency — เริ่มได้ทันที
 - **Foundational (Phase 2)**: ต้องรอ Phase 1 (T001, T002) — BLOCKS ทุก user story
 - **US1 (Phase 3)**: ต้องรอ Phase 2 complete — สำคัญสุด, ทำก่อน
 - **US2 (Phase 4)**: ต้องรอ Phase 2 complete — ขึ้นกับ `VramMonitorService` จาก T003
 - **US3 (Phase 5)**: ต้องรอ Phase 2 complete — ขึ้นกับ `vram_monitor.py` จาก T004
 - **US4 (Phase 6)**: ต้องรอ Phase 2 complete — independent จาก US1/US2/US3
 - **US5 (Phase 7)**: ต้องรอ US1+US2+US3+US4 complete (ทดสอบทุกแกน)
 - **Polish (Phase 8)**: ต้องรอ US5 ผ่าน cutover gate
 ### User Story Dependencies
 - **US1 (P1)**: ต้อง complete ก่อน — contract เป็น foundation ของ canonical naming ทุกชั้น
 - **US2 (P2)**: ขึ้นกับ `VramMonitorService` (T003, Phase 1) เท่านั้น — parallel กับ US1 ได้
 - **US3 (P3)**: ขึ้นกับ `vram_monitor.py` (T004, Phase 1) เท่านั้น — parallel กับ US1/US2 ได้
 - **US4 (P4)**: Independent จาก US1/US2/US3 — parallel ได้หลัง Phase 2
 - **US5 (P5)**: ต้องรอทุก US ก่อนหน้า
 ### Parallel Opportunities
 - T001 + T002: parallel (different files)
 - T003 + T004: parallel (different stacks)
 - T005, T006, T007: T005 ทำก่อน (T006, T007 ขึ้นกับ types จาก T005)
 - US1 + US2 + US3 + US4: parallel หลัง Phase 2 complete (ถ้ามีทีม)
 - T029, T030, T031, T033, T034: parallel (different test files / env files)
 ---
 ## Implementation Strategy
 ### MVP First (US1 Only)
 1. Phase 1: Setup (T001–T004)
 2. Phase 2: Foundational (T005–T009)
 3. Phase 3: US1 (T010–T016)
 4. **STOP & VALIDATE**: ยิง curl ตาม Gate 1 และ Gate 2 ใน quickstart.md
 5. Deploy/validate canonical naming ใน Admin Console
 ### Incremental Delivery
 1. Phase 1+2 → Foundation
 2. US1 → Policy contract + canonical naming (MVP)
 3. US2 → Adaptive OCR residency
 4. US3 → Retrieval CPU fallback
 5. US4 → Queue policy
 6. US5 → Full cutover gate verification
 ### Total Task Count
 - **Total**: 39 tasks
 - **US1**: 7 tasks (T010–T016)
 - **US2**: 4 tasks (T017–T020)
 - **US3**: 4 tasks (T021–T024)
 - **US4**: 4 tasks (T025–T028)
 - **US5**: 6 tasks (T029–T034)
 - **Setup**: 4 tasks (T001–T004)
 - **Foundational**: 5 tasks (T005–T009)
 - **Polish**: 4 tasks (T035–T038)