Files
lcbp3/specs/100-Infrastructures/140-ocr-sidecar-refactor/data-model.md
T
admin a80ebef285
CI / CD Pipeline / build (push) Successful in 7m37s
CI / CD Pipeline / deploy (push) Failing after 20m15s
refactor(ai): OCR sidecar canonical naming cleanup — typhoon→np-dms, remove hardcoded keys, asyncio.to_thread, ADR-040/041
2026-06-20 16:37:04 +07:00

9.0 KiB

Data Model: OCR Sidecar Refactor

Date: 2026-06-20
Purpose: Define data contracts and entity relationships for OCR sidecar refactor

Overview

The OCR sidecar is a pure compute worker with no database access (ADR-023/023A boundary). All data persistence and business logic remain in backend services. This document defines the data contracts between backend and sidecar.

Entities

OCR Request (Backend → Sidecar)

interface OcrRequest {
  pdfPath: string;           // Absolute path to PDF file (whitelisted)
  systemPrompt?: string;     // System prompt from Active Prompt
  dmsTags?: {                // DMS extraction tags from Active Prompt
    documentNumber?: string;
    documentDate?: string;
    receivedDate?: string;
  };
  runtimeParams: {           // Runtime parameters from ai_execution_profiles
    temperature: number;
    top_p: number;
    repeat_penalty: number;
    max_tokens: number;
  };
  pageRange?: {              // Page range for processing
    start: number;
    end: number;
  };
}

OCR Response (Sidecar → Backend)

interface OcrResponse {
  text: string;              // Extracted text (Markdown format)
  ocrUsed: boolean;          // Whether OCR was used (vs fast-path text layer)
  modelUsed: string;         // Model identifier (e.g., "typhoon-np-dms-ocr")
  processingTimeMs: number;  // Processing time in milliseconds
  error?: string;            // Error message if failed
}

AI Execution Profile (Database)

-- Existing table (no schema changes)
CREATE TABLE ai_execution_profiles (
  id INT AUTO_INCREMENT PRIMARY KEY,
  profile_name VARCHAR(100) UNIQUE NOT NULL,
  model_name VARCHAR(100) NOT NULL,
  parameters JSON NOT NULL,  -- { temperature, top_p, repeat_penalty, max_tokens, keep_alive }
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

-- Row for OCR extraction:
-- profile_name = 'ocr-extract'
-- parameters = { temperature: 0.7, top_p: 0.9, repeat_penalty: 1.1, max_tokens: 4096 }

Active Prompt (Database)

-- Existing table (no schema changes per ADR-029/037)
CREATE TABLE ai_prompts (
  id INT AUTO_INCREMENT PRIMARY KEY,
  public_id UUID,
  prompt_type VARCHAR(50) NOT NULL,  -- 'ocr_extraction'
  template TEXT NOT NULL,           -- System prompt template with {{ocr_text}} placeholder
  context_config JSON,               -- DMS tags configuration
  version INT NOT NULL,
  is_active TINYINT(1) DEFAULT 0,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  UNIQUE KEY (prompt_type, version)
);

-- Active prompt for OCR extraction:
-- prompt_type = 'ocr_extraction'
-- template = "Extract document metadata from: {{ocr_text}}..."
-- context_config = { dmsTags: { documentNumber: true, documentDate: true, receivedDate: true } }

Data Flow

Phase 1: OCR Request Flow (Before ADR-041)

Backend OcrService
  ↓
1. Resolve parameters from ai_execution_profiles (row 'ocr-extract')
2. Resolve Active Prompt from ai_prompts (type 'ocr_extraction')
3. Extract systemPrompt and DMS tags from Active Prompt
4. Build OcrRequest with parameters, systemPrompt, DMS tags
5. Send POST /ocr with X-API-Key header to sidecar
  ↓
Sidecar (app.py)
  ↓
1. Validate X-API-Key
2. Canonicalize pdfPath and check whitelist
3. Extract systemPrompt and DMS tags from request
4. Call calculate_ocr_residency(active_profile) for keep_alive
5. Process OCR with Ollama (inject systemPrompt + DMS tags)
6. Return OcrResponse
  ↓
Backend OcrService
  ↓
1. Parse OcrResponse
2. Return extracted text to caller

Phase 2: OCR Request Flow (After ADR-041)

Backend OcrService
  ↓
1. Resolve parameters from ai_execution_profiles (row 'ocr-extract')
2. Resolve Active Prompt from ai_prompts (type 'ocr_extraction')
3. Extract systemPrompt and DMS tags from Active Prompt
4. Build OcrRequest with parameters, systemPrompt, DMS tags
5. Send POST /ocr (NO X-API-Key header) to sidecar
  ↓
Sidecar (app.py)
  ↓
1. NO X-API-Key validation (network isolation only)
2. Canonicalize pdfPath and check whitelist
3. Extract systemPrompt and DMS tags from request
4. Call calculate_ocr_residency(active_profile) for keep_alive
5. Process OCR with Ollama (inject systemPrompt + DMS tags)
6. Return OcrResponse
  ↓
Backend OcrService
  ↓
1. Parse OcrResponse
2. Return extracted text to caller

Backend Service Changes

OcrService Parameter Resolution

// backend/src/modules/ai/services/ocr.service.ts

async extractMetadata(documentId: string): Promise<AIMetadata> {
  // 1. Resolve runtime parameters from ai_execution_profiles
  const profile = await this.aiProfilesService.getActiveProfile('ocr-extract');
  const runtimeParams = profile.parameters; // { temperature, top_p, repeat_penalty, max_tokens }

  // 2. Resolve Active Prompt
  const activePrompt = await this.aiPromptsService.getActivePrompt('ocr_extraction');
  const systemPrompt = activePrompt.template;
  const dmsTags = activePrompt.context_config?.dmsTags || {};

  // 3. Build request
  const ocrRequest: OcrRequest = {
    pdfPath: document.filePath,
    systemPrompt,
    dmsTags,
    runtimeParams,
  };

  // 4. Send to sidecar (with X-API-Key in Phase 1)
  const response = await this.httpClient.post(
    `${this.ocrApiUrl}/ocr`,
    ocrRequest,
    { headers: { 'X-API-Key': this.ocrApiKey } } // Phase 1 only
  );

  return response.data;
}

SandboxOcrEngineService Parameter Resolution

// backend/src/modules/ai/services/sandbox-ocr-engine.service.ts

async processSandboxOcr(request: SandboxOcrRequest): Promise<SandboxOcrResult> {
  // Same parameter resolution pattern as OcrService
  const profile = await this.aiProfilesService.getActiveProfile('ocr-extract');
  const activePrompt = await this.aiPromptsService.getActivePrompt('ocr_extraction');

  const ocrRequest: OcrRequest = {
    pdfPath: request.pdfPath,
    systemPrompt: activePrompt.template,
    dmsTags: activePrompt.context_config?.dmsTags || {},
    runtimeParams: profile.parameters,
  };

  const response = await this.httpClient.post(
    `${this.ocrApiUrl}/ocr`,
    ocrRequest,
    { headers: { 'X-API-Key': this.ocrApiKey } } // Phase 1 only
  );

  return response.data;
}

Sidecar API Changes

POST /ocr Request Body

# specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/app.py

from pydantic import BaseModel

class OcrRequest(BaseModel):
    pdf_path: str
    system_prompt: Optional[str] = None
    dms_tags: Optional[Dict[str, str]] = None
    runtime_params: RuntimeParams
    page_range: Optional[PageRange] = None

class RuntimeParams(BaseModel):
    temperature: float
    top_p: float
    repeat_penalty: float
    max_tokens: int

class PageRange(BaseModel):
    start: int
    end: int

POST /ocr Response Body

class OcrResponse(BaseModel):
    text: str
    ocr_used: bool
    model_used: str
    processing_time_ms: float
    error: Optional[str] = None

Environment Variables

Sidecar Environment Variables

# specs/04-Infrastructure-OPS/04-00-docker-compose/Desk-5439/ocr-sidecar/.env

# Phase 1 (before ADR-041)
OCR_SIDECAR_API_KEY=required_value  # Fail-fast if missing

# Phase 2 (after ADR-041) - remove OCR_SIDECAR_API_KEY

# Common variables
OCR_SIDECAR_UPLOAD_BASE=/mnt/uploads  # CIFS mount base path
OLLAMA_API_URL=http://localhost:11434
TYPHOON_OCR_MODEL=typhoon-np-dms-ocr:latest

Backend Environment Variables

# backend/.env

# Phase 1 (before ADR-041)
OCR_API_URL=http://192.168.10.100:8765
OCR_API_KEY=required_value  # Send-side X-API-Key

# Phase 2 (after ADR-041) - remove OCR_API_KEY

# Common variables
OCR_SIDECAR_UPLOAD_BASE=/app/uploads  # Backend view of uploads

Validation Rules

Path Canonicalization (Sidecar)

def validate_pdf_path(pdf_path: str, base_path: str) -> str:
    """Canonicalize and whitelist PDF path"""
    # 1. Canonicalize path
    canonical = os.path.abspath(os.path.realpath(pdf_path))
    
    # 2. Check whitelist
    if not canonical.startswith(base_path):
        raise HTTPException(
            status_code=403,
            detail="Path outside whitelisted base directory"
        )
    
    return canonical

Parameter Validation (Backend)

// Validate runtime parameters from ai_execution_profiles
function validateRuntimeParams(params: any): RuntimeParams {
  if (!params.temperature || params.temperature < 0 || params.temperature > 2) {
    throw new BusinessException('Invalid temperature value');
  }
  if (!params.top_p || params.top_p < 0 || params.top_p > 1) {
    throw new BusinessException('Invalid top_p value');
  }
  // ... similar validation for other params
  return params;
}

No Schema Changes

This refactor does not require database schema changes:

  • ai_execution_profiles table already exists (ADR-036)
  • ai_prompts table already exists (ADR-029/037)
  • No new tables or columns needed
  • Per ADR-009: No TypeORM migrations (edit SQL directly if needed, but not needed here)