np-dms/lcbp3

Fork 0

Files

T

admin 71c5e88181

CI / CD Pipeline / build (push) Has been skipped

Details

CI / CD Pipeline / deploy (push) Has been skipped

Details

690611:1705 ADR-035-235 #00 [skip CI]

2026-06-11 17:05:17 +07:00

13 KiB

Raw Blame History

AI Runtime Refactor

เอกสารนี้สรุปผล grilling session สำหรับการ refactor AI runtime หลังอัปเกรด GPU จาก RTX 2060 SUPER 8GB เป็น ASUS DUAL RTX 5060 Ti 16GB

เอกสารอ้างอิง:

เป้าหมาย

เปลี่ยนชื่อโมเดลหลักและ OCR ไปเป็น canonical identities ใหม่
ย้ายสัญญา API จาก caller-driven model selection ไปเป็น policy-driven executionProfile
รวมการจัดการ VRAM ของ main model, OCR, embedding, และ reranking ไว้ใน policy เดียว
ใช้ big bang rollout แบบมีกติกา cutover และ verification ที่รันซ้ำได้

Decision Summary

1. Canonical naming

ใช้ np-dms-ai เป็น canonical model identity เดียวทุกชั้นที่ผู้ใช้และนักพัฒนาเห็น
ใช้ np-dms-ocr เป็น canonical OCR identity เดียวทุกชั้น
ชื่อ runtime/base model จริงเป็น implementation detail ใน Modelfile, deploy script, หรือ ops internals เท่านั้น

2. API contract

caller ส่งได้เพียง executionProfile
caller ห้ามส่ง model.key
caller ห้าม override temperature, top_p, maxTokens, หรือ runtime parameters อื่นโดยตรง
backend policy เป็นผู้ map executionProfile ไปยัง canonical model, runtime parameters, และ keep_alive policy

3. Canonical profile set

โปรไฟล์ระดับ contract มีแค่:

fast
balanced
thai-accurate
large-context

กฎเพิ่ม:

large-context จำกัดเฉพาะ admin/special workflows
งานที่มีผลต่อข้อมูล เช่น migrate-document, auto-fill-document, OCR extraction ใช้ backend override profile เอง

4. Runtime resource policy

np-dms-ai เป็น workload หลักของ generation path
np-dms-ocr ใช้ adaptive residency แทน fixed keep_alive
retrieval acceleration (BGE-M3, BGE-Reranker-Large) อยู่ใน policy เดียวกับ main/OCR
GPU ownership ใช้หลัก LLM-first
ถ้า VRAM headroom ไม่พอ retrieval ต้อง fallback CPU ทันที

5. Queue policy

คงโครง ai-realtime / ai-batch และ pause/resume coordination เดิมเป็นแกน
อนุญาต ai-realtime = 2 ได้เฉพาะ lightweight realtime jobs
rag-query ไม่ใช่ lightweight realtime job
rag-query เป็น generation-centric job: retrieval เป็นขั้นเตรียม context และ fallback CPU ได้

6. Rollout policy

rollout ใช้ Big Bang
cutover จะถือว่าสำเร็จต่อเมื่อผ่านครบทั้ง:
- policy contract
- model switching
- adaptive OCR residency
- RAG fallback

Canonical Models

Canonical Name	บทบาท	Residency policy	หมายเหตุ
`np-dms-ai`	main generation model	resident by default	backend policy คุม runtime parameters
`np-dms-ocr`	OCR model	adaptive	ใช้ policy ตาม VRAM headroom และ active workload

หมายเหตุ:

เอกสารนี้ไม่บังคับว่าฐานจริงต้องเป็น model family ใดเสมอไป
การเปลี่ยน base runtime model ในอนาคตไม่ควรเปลี่ยน canonical API/UI name ถ้า semantics เดิมยังอยู่

Execution Profile Contract

Request DTO

interface CreateAiJobRequest {
  type: 'auto-fill-document' | 'migrate-document' | 'rag-query';
  documentId?: string;
  attachmentId?: string;
  executionProfile?: 'fast' | 'balanced' | 'thai-accurate' | 'large-context';
}

Policy rules

migrate-document: backend override เป็น profile ที่ deterministic สูงเสมอ
auto-fill-document: backend override ได้ตาม data-affecting policy
rag-query: ปกติใช้ balanced หรือ policy ที่ backend กำหนด
large-context: ใช้ได้เฉพาะ admin/special workflows ที่ backend whitelist

Forbidden contract

สิ่งต่อไปนี้ต้องไม่มีใน public contract:

model: {
  key: string;
  parameters: {
    temperature?: number;
    topP?: number;
    maxTokens?: number;
  };
}

เหตุผล:

caller bypass governance ได้
verification matrix โตเกินจำเป็น
profile abstraction หมดความหมายทันที

Adaptive OCR Residency

หลักการ:

np-dms-ocr ไม่ใช้ fixed keep_alive: 0 หรือ fixed keep_alive: 300 ตายตัว
backend policy คำนวณ residency จาก VRAM headroom และ active model/workload ปัจจุบัน
ถ้า active workload กิน VRAM สูง หรือ profile ปัจจุบันเสี่ยงชน headroom ให้ fallback เป็น keep_alive: 0
ถ้า headroom เหลือและไม่มี contention สำคัญ อนุญาต residency window ชั่วคราวได้

ตัวอย่าง policy:

if active_profile == 'large-context' => OCR keep_alive = 0
if active_main_model_pressure == high => OCR keep_alive = 0
if headroom >= policy threshold => OCR keep_alive = short residency window

LLM-First GPU Ownership

ลำดับสิทธิ์ VRAM:

np-dms-ai
np-dms-ocr
BGE-M3
BGE-Reranker-Large

ผลเชิงพฤติกรรม:

retrieval path ใช้ GPU ได้เฉพาะเมื่อ policy ระบุว่ามี headroom จริง
retrieval path ไม่มีสิทธิ์บังคับรอ GPU เพื่อแย่ง resource จาก main/OCR path
หาก headroom ไม่พอ embed และ rerank ต้อง fallback CPU ทันที

Retrieval Acceleration

Scope

เอกสารนี้ถือว่า retrieval acceleration เป็นส่วนหนึ่งของ runtime resource policy เดียวกัน ไม่ใช่ tuning แยก

Sidecar policy

ปัจจุบัน:

POST /embed   -> CPU
POST /rerank  -> CPU

เป้าหมาย:

POST /embed   -> GPU เมื่อ headroom ผ่าน policy, ไม่เช่นนั้นใช้ CPU
POST /rerank  -> GPU เมื่อ headroom ผ่าน policy, ไม่เช่นนั้นใช้ CPU
POST /ocr-upload -> OCR path ตาม adaptive OCR residency
POST /normalize  -> CPU

Retrieval fallback rule

ห้าม queue รอ GPU เพื่อให้ retrieval ได้ acceleration
ห้าม fail hard เพียงเพราะ GPU ไม่พอ
ให้ degrade ไป CPU แล้วตอบงานต่อ

Queue and Scheduling

Baseline

ai-batch ยังสามารถถูก pause/resume โดย realtime path ตาม coordination model เดิม
ai-realtime = 1 ยังคงเป็น baseline สำหรับงาน generation-heavy

Selective realtime uplift

อนุญาต ai-realtime = 2 เฉพาะกลุ่มงานที่เป็น lightweight realtime jobs เช่น:

intent classification ที่ไม่เรียก OCR
tool-only suggestion path ที่ไม่บังคับ model switching
metadata-free chat steps ที่ไม่ใช้ GPU-heavy generation

ไม่รวม:

rag-query
OCR-triggering jobs
งานที่บังคับ model switching
generation-heavy jobs

Big Bang Rollout

Decision

refactor รอบนี้ใช้ big bang rollout เพราะระบบยังไม่เปิด production

Consequence

ห้ามใช้เกณฑ์ partial success แบบ "บางแกนผ่านก็ถือว่าปล่อยได้"

Cutover gate

ต้องผ่านครบทุกแกน:

policy contract ใหม่ทำงานจริง
canonical naming ใหม่ทำงานจริง
model switching และ OCR residency ตรง policy ใหม่
retrieval GPU/CPU fallback ทำงานจริง

Verification

ใช้แนวทาง executable-first แต่ทุกแกนต้องมี manual validation path ประกบ

1. Policy contract

Executable:

unit/integration tests สำหรับ DTO และ policy mapping
tests ว่า caller ส่ง model.key หรือ parameter overrides ไม่ได้
tests ว่า data-affecting jobs ถูก backend override profile จริง

Manual:

ยิง request จาก admin/sandbox แล้วตรวจว่า UI/API ไม่ expose free-form model selection

2. Canonical naming

Executable:

search-based checks ว่า public-facing contract ใช้ np-dms-ai / np-dms-ocr
tests สำหรับ settings/service/controller ที่คืนชื่อ canonical

Manual:

เปิด AI Admin Console และ OCR sandbox ตรวจ label/option/log surface ที่ผู้ใช้เห็น

3. Adaptive OCR residency

Executable:

tests ว่า residency policy ให้ keep_alive ต่างกันตาม headroom scenario
logs/trace ว่า OCR requests ใช้ residency decision ตาม policy

Manual:

รัน OCR ซ้ำหลายงานในเงื่อนไข headroom ต่างกันและตรวจ behavior จริง

4. Retrieval fallback

Executable:

tests ว่า /embed และ /rerank fallback CPU เมื่อ GPU threshold ไม่ผ่าน
trace/log ว่า rag-query ยังตอบได้เมื่อ GPU retrieval path ถูกปิด

Manual:

ทดลอง RAG query ภายใต้ภาระ GPU สูงและยืนยันว่าคำตอบยังออกได้แม้ช้าลง

Implementation Workstreams

Workstream A: Contract and naming

เปลี่ยน public contract ให้ใช้ executionProfile
ลบ model.key และ parameter override จาก API docs/DTO ที่เกี่ยวข้อง
เปลี่ยน public-facing names เป็น np-dms-ai และ np-dms-ocr

Workstream B: Runtime policy

สร้าง policy mapping profile -> runtime configuration
เพิ่ม adaptive OCR residency logic
แยก policy ของ data-affecting jobs ออกจาก caller input

Workstream C: Retrieval acceleration

เพิ่ม GPU eligibility check สำหรับ embed และ rerank
เพิ่ม CPU fallback path ที่ explicit
บันทึก telemetry/log สำหรับ fallback decisions

Workstream D: Queue policy

คง pause/resume coordination เดิม
แยก lightweight realtime jobs ออกจาก generation-heavy jobs
ใช้ selective concurrency uplift เฉพาะ job ที่ allowed

Workstream E: Verification

เพิ่ม automated tests ตาม cutover gate
เพิ่ม manual validation checklist สำหรับ admin console, OCR sandbox, และ RAG path

Non-Goals

ไม่เปิดให้ caller เลือก runtime parameters เอง
ไม่เปลี่ยน rag-query ให้เป็น retrieval-first job
ไม่ยกเลิก pause/resume coordination เดิมทั้งหมด
ไม่แยก retrieval acceleration ออกเป็น policy คนละชุดกับ main/OCR
ไม่ใช้ phased rollout ในเอกสารฉบับนี้

Migration Note for Current Repo

repo ปัจจุบันยังมีจุดที่อิงชื่อและ policy เดิม เช่น typhoon2.5-np-dms:latest, typhoon-np-dms-ocr:latest, และ keep_alive: 0 ในหลาย service/spec. เอกสารนี้จึงเป็น target architecture/policy ใหม่ และต้องมีการอัปเดตโค้ด, tests, cross-spec docs, และ admin UI ให้สอดคล้องก่อนจะถือว่า cutover สำเร็จ.

13 KiB Raw Blame History

AI Runtime Refactor

เป้าหมาย

Decision Summary

1. Canonical naming

2. API contract

3. Canonical profile set

4. Runtime resource policy

5. Queue policy

6. Rollout policy

Canonical Models

Execution Profile Contract

Request DTO

Policy rules

Forbidden contract

Adaptive OCR Residency

LLM-First GPU Ownership

Retrieval Acceleration

Scope

Sidecar policy

Retrieval fallback rule

Queue and Scheduling

Baseline

Selective realtime uplift

Big Bang Rollout

Decision

Consequence

Cutover gate

Verification

1. Policy contract

2. Canonical naming

3. Adaptive OCR residency

4. Retrieval fallback

Implementation Workstreams

Workstream A: Contract and naming

Workstream B: Runtime policy

Workstream C: Retrieval acceleration

Workstream D: Queue policy

Workstream E: Verification

Non-Goals

Migration Note for Current Repo

13 KiB

Raw Blame History