10 KiB
// File: specs/200-fullstacks/227-ai-admin-console/spec.md // Change Log: // - 2026-05-20: Feature Specification สำหรับระบบ AI Admin Console // - 2026-05-21: Restructure following spec-template.md with User Stories, FRs, Success Criteria
Feature Specification: AI Admin Console
Feature Branch: 227-ai-admin-console
Created: 2026-05-20
Status: Draft
Category: 200-fullstacks
Input: ADR-027 AI Admin Panel and Dynamic Control Architecture
User Scenarios & Testing
User Story 1 - Superadmin Toggles AI System On/Off (Priority: P1)
As a Superadmin, I need to dynamically enable or disable AI features for all regular users without redeploying the system, so that I can perform maintenance, manage system load, or handle AI infrastructure issues gracefully.
Why this priority: This is the core control mechanism of the feature. Without it, the admin cannot perform emergency maintenance or manage system resources during high load periods.
Independent Test: Can be fully tested by a Superadmin toggling the AI switch and observing that regular users immediately see the disabled state (within polling interval) while the Superadmin retains full access.
Acceptance Scenarios:
- Given the AI system is currently enabled, When a Superadmin toggles the switch to disabled, Then the setting is persisted to database and cache, and regular users see disabled AI buttons within 30 seconds
- Given the AI system is currently disabled, When a Superadmin toggles the switch to enabled, Then regular users can access AI features again after the polling interval
- Given a regular user has AI permissions, When they attempt to use AI features while the system is disabled, Then they receive HTTP 503 with a user-friendly message explaining temporary unavailability
User Story 2 - Normal Users Experience Soft Fallback (Priority: P1)
As a regular user with AI permissions, I need clear visual feedback when AI features are temporarily disabled, so that I understand why AI buttons are unavailable and can complete my work manually without confusion.
Why this priority: Critical for user experience. Abrupt feature disappearance creates confusion and support tickets. Soft fallback maintains user trust.
Independent Test: Can be tested by disabling AI system and verifying that regular users see disabled buttons with tooltips and global banner, rather than errors or missing UI elements.
Acceptance Scenarios:
- Given the AI system is disabled by admin, When a regular user views a document form with AI suggestion buttons, Then those buttons appear disabled with a tooltip explaining "ระบบ AI ไม่พร้อมใช้งานชั่วคราว"
- Given the AI system is disabled, When a regular user loads any page, Then a global banner appears at the top stating AI is temporarily unavailable
- Given a regular user attempts direct API access to AI endpoints while disabled, When the request is made, Then the system returns HTTP 503 with recovery guidance
User Story 3 - Superadmin Monitors AI Health Status (Priority: P2)
As a Superadmin, I need real-time visibility into AI infrastructure health (Ollama, Qdrant, BullMQ queues), so that I can diagnose issues, monitor latency, and make informed decisions about enabling/disabling AI services.
Why this priority: Essential for operational awareness but secondary to the control mechanism itself.
Independent Test: Can be tested by accessing the AI Admin Console health dashboard and verifying all metrics display correctly with appropriate status indicators.
Acceptance Scenarios:
- Given the AI Admin Console is accessed, When a Superadmin views the health panel, Then they see Ollama latency, active model version, Qdrant collection stats, and BullMQ queue metrics (waiting/active/failed jobs)
- Given a service is experiencing issues, When health check runs, Then the status displays as degraded/down with relevant metrics highlighted
- Given the Superadmin is monitoring the system, When they refresh or view the dashboard, Then metrics are cached for 30 seconds to prevent excessive load
User Story 4 - Superadmin Uses RAG Playground Sandbox (Priority: P2)
As a Superadmin, I need an isolated RAG testing environment where I can query documents and receive AI-generated responses with citations, so that I can test and refine AI behavior without affecting production queues or user experiences.
Why this priority: Enables safe testing and troubleshooting of AI capabilities during maintenance windows.
Independent Test: Can be tested by submitting a RAG query in the sandbox and receiving a complete response with document citations, while verifying the job runs through the isolated sandbox queue.
Acceptance Scenarios:
- Given the AI system is disabled for regular users, When a Superadmin submits a RAG query in the sandbox, Then the query processes through the isolated queue and returns results with citations
- Given a RAG job is submitted, When it is processing, Then the Superadmin can poll for status updates every 5 seconds and see progress
- Given the sandbox queue has multiple jobs, When jobs are processed, Then Superadmin jobs have SUPERADMIN priority (higher than regular batch jobs)
User Story 5 - Superadmin Uses OCR Sandbox for Metadata Extraction (Priority: P2)
As a Superadmin, I need to upload PDF files to an isolated OCR sandbox to test metadata extraction capabilities, so that I can validate AI accuracy and tune extraction parameters without impacting production document processing.
Why this priority: Supports AI tuning and validation workflows, enabling data-driven improvements to extraction accuracy.
Independent Test: Can be tested by uploading a PDF to the OCR sandbox and receiving extracted metadata in JSON format with confidence scores.
Acceptance Scenarios:
- Given a PDF file is uploaded to the OCR sandbox, When processing completes, Then the system returns extracted metadata as formatted JSON with syntax highlighting
- Given an OCR job is submitted, When processing fails, Then the error is displayed inline in a red box with actionable guidance
- Given the queue length is >= 3, When additional sandbox requests are made, Then dynamic rate limiting applies (10 requests/hour per user)
Edge Cases
- EC-001: What happens when Redis cache is unavailable? System must fall back to database query with <100ms latency penalty
- EC-002: How does system handle concurrent toggle requests? Last-write-wins with optimistic locking; invalid cache after successful write
- EC-003: What if Ollama/Qdrant times out during health check? Health service returns DEGRADED status, not DOWN; timeout is 5 seconds per service
- EC-004: How are long-running sandbox jobs handled? Job status polling available; jobs can be cancelled by admin; results cached for 1 hour
- EC-005: What happens if a Superadmin loses permissions mid-session? Next API request returns 403; UI redirects to unauthorized page
Requirements
Functional Requirements
- FR-001: System MUST provide a toggle switch accessible only to Superadmin (
system.manage_all) to enable/disable AI features system-wide - FR-002: System MUST persist AI enabled/disabled state to
system_settingstable with Redis caching for <1ms latency on status checks - FR-003: System MUST display disabled AI buttons with explanatory tooltips to regular users when AI is turned off
- FR-004: System MUST show a global banner at the top of all pages when AI is disabled, visible only to users with AI permissions
- FR-005: System MUST return HTTP 503 Service Unavailable to regular users attempting AI API calls when AI is disabled
- FR-006: System MUST allow Superadmins full AI access (including sandbox) even when AI is disabled for regular users
- FR-007: System MUST provide health monitoring dashboard showing Ollama latency, model version, Qdrant stats, and BullMQ queue metrics
- FR-008: System MUST cache health check results for 30 seconds to prevent excessive infrastructure load
- FR-009: System MUST provide isolated RAG sandbox queue (
ai-admin-sandbox) with SUPERADMIN job priority - FR-010: System MUST provide isolated OCR sandbox for PDF metadata extraction with JSON output and syntax highlighting
- FR-011: System MUST implement dynamic rate limiting for sandbox based on queue length (queue < 3: no limit, queue >= 3: 10 req/hr)
- FR-012: System MUST poll AI status every 30 seconds from frontend for users with AI permissions
- FR-013: System MUST support job status polling every 5 seconds for sandbox operations
- FR-014: System MUST implement AiEnabledGuard with layered permission check (system.manage_all + ai.suggest/ai.rag_query bypass)
Key Entities
- SystemSetting: Stores dynamic configuration values (AI_FEATURES_ENABLED, etc.) with metadata (data_type, category, validation_rules)
- SandboxJob: Represents a sandbox operation (RAG query or OCR extraction) with priority, status, and results
- HealthStatus: Aggregated health metrics from Ollama, Qdrant, and BullMQ with status indicators (HEALTHY/DEGRADED/DOWN)
Success Criteria
Measurable Outcomes
- SC-001: Superadmin can toggle AI system state with changes reflected to regular users within 30 seconds
- SC-002: AI status check API responds in under 1ms when cached, under 50ms on cache miss
- SC-003: 100% of regular users see disabled AI buttons with tooltips when AI is turned off (no hidden or broken UI)
- SC-004: Health dashboard displays all 3 services (Ollama, Qdrant, BullMQ) with <5 second data staleness
- SC-005: Sandbox RAG queries return complete responses with citations within 2x normal queue processing time
- SC-006: Sandbox OCR extraction returns valid JSON for 95% of test PDFs with clear error messages for failures
- SC-007: Zero unauthorized access to admin endpoints (verified by security tests)
- SC-008: System gracefully degrades when AI disabled with zero error reports from confused users