690503:0135 Update workflow #01
CI / CD Pipeline / build (push) Failing after 6m6s
CI / CD Pipeline / deploy (push) Has been skipped

This commit is contained in:
2026-05-03 01:35:05 +07:00
parent d239b58387
commit 2c24991f88
85 changed files with 6335 additions and 100 deletions
@@ -0,0 +1,45 @@
# Specification Quality Checklist: Unified Workflow Engine — Production Hardening & Integrated Context
**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2026-05-02
**Feature**: [spec.md](../spec.md)
---
## Content Quality
- [~] No implementation details (languages, frameworks, APIs) — *Note: Technology-specific terms (Redis, BullMQ, ClamAV, JSON Logic) are present in FRs as ADR-mandated architectural constraints (ADR-001/ADR-008/ADR-016), not spec-level implementation choices. Consistent with existing `001-transmittals-circulation/spec.md` pattern.*
- [x] Focused on user value and business needs
- [~] Written for non-technical stakeholders — *Note: Platform/infrastructure feature; technical Functional Requirements (FR-001 to FR-021) intentionally use ADR terminology. User Stories (P1-P3) and Success Criteria are non-technical.*
- [x] All mandatory sections completed
## Requirement Completeness
- [x] No [NEEDS CLARIFICATION] markers remain
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified
## Feature Readiness
- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification
## Notes
- Spec derived from ADR-001 (Unified Workflow Engine v1.1 — 2026-05-02 production hardening) and ADR-021 (Integrated Workflow Context & Step-specific Attachments)
- **Clarification session 2026-05-02 (5/5 questions resolved):**
- Q1: DSL `require.role` → CASL ability check (FR-002a)
- Q2: Observability = structured log + metrics (FR-022, FR-023, SC-009)
- Q3: File rollback on DB failure = move back to temp, 24h TTL (FR-019)
- Q4: Admin UI for DSL authoring is IN scope (FR-024, FR-025)
- Q5: All 4 modules (RFA/Transmittal/Circulation/Correspondence) need banner gap-filling (FR-011, Assumptions updated)
- ADR-001 clarifications fully captured in FR-001 through FR-010 and SC-001 through SC-005
- ADR-021 requirements (REQ-01 to REQ-06) fully captured in FR-011 through FR-025 and SC-006 through SC-009
- Visual workflow builder (drag-and-drop DSL editor) is explicitly **out of scope** (Phase 2)
@@ -0,0 +1,205 @@
openapi: "3.1.0"
info:
title: Workflow Engine — Definitions API
version: "1.1.0"
description: |
Endpoints for managing workflow DSL definitions.
Requires system.manage_all (Super Admin only) for all write operations (FR-009).
Includes DSL validation endpoint for Admin UI inline feedback (FR-025).
paths:
/workflow-engine/definitions:
get:
summary: List all workflow definitions (latest version per code)
tags: [WorkflowDefinitions]
security:
- BearerAuth: []
responses:
"200":
description: Array of latest definitions
content:
application/json:
schema:
type: array
items:
$ref: "#/components/schemas/WorkflowDefinitionDto"
post:
summary: Create a new workflow definition (auto-increments version)
description: |
Creates a new version for the given workflow_code.
DSL is compiled and validated (Phase 1 save-time check — FR-008).
Requires system.manage_all permission.
tags: [WorkflowDefinitions]
security:
- BearerAuth: []
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/CreateWorkflowDefinitionDto"
responses:
"201":
description: Definition created
content:
application/json:
schema:
$ref: "#/components/schemas/WorkflowDefinitionDto"
"400":
description: DSL structure validation failed (Phase 1)
"403":
description: Requires system.manage_all
/workflow-engine/definitions/{id}:
get:
summary: Get a specific definition by UUID
tags: [WorkflowDefinitions]
security:
- BearerAuth: []
parameters:
- name: id
in: path
required: true
schema:
type: string
format: uuid
responses:
"200":
content:
application/json:
schema:
$ref: "#/components/schemas/WorkflowDefinitionDto"
patch:
summary: Update a workflow definition (DSL or is_active toggle)
description: |
Updating DSL re-compiles and re-validates (Phase 1).
Toggling is_active=true invalidates the Redis active pointer cache immediately (FR-007, SC-005).
In-progress instances are NOT rebound (FR-010).
Requires system.manage_all.
tags: [WorkflowDefinitions]
security:
- BearerAuth: []
parameters:
- name: id
in: path
required: true
schema:
type: string
format: uuid
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/UpdateWorkflowDefinitionDto"
responses:
"200":
content:
application/json:
schema:
$ref: "#/components/schemas/WorkflowDefinitionDto"
"400":
description: DSL validation failed
"403":
description: Requires system.manage_all
/workflow-engine/definitions/validate:
post:
summary: Validate a DSL JSON without saving (for Admin UI inline feedback — FR-025)
description: |
Runs Phase 1 (structure) validation only. Returns errors per field.
No authentication required for this endpoint (read-only, no state change)
— but still protected by JWT for Admin UI use.
tags: [WorkflowDefinitions]
security:
- BearerAuth: []
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [dsl]
properties:
dsl:
type: object
description: DSL JSON to validate
responses:
"200":
description: Validation result
content:
application/json:
schema:
$ref: "#/components/schemas/DslValidationResultDto"
components:
schemas:
WorkflowDefinitionDto:
type: object
properties:
id:
type: string
format: uuid
workflowCode:
type: string
example: RFA_FLOW_V1
version:
type: integer
example: 2
isActive:
type: boolean
dsl:
type: object
description: Raw DSL JSON (JSON Logic conditions only — no eval/new Function)
createdAt:
type: string
format: date-time
CreateWorkflowDefinitionDto:
type: object
required: [workflow_code, dsl]
properties:
workflow_code:
type: string
example: RFA_FLOW_V2
dsl:
type: object
description: DSL JSON — must use JSON Logic format for conditions (FR-001)
is_active:
type: boolean
default: true
UpdateWorkflowDefinitionDto:
type: object
properties:
dsl:
type: object
is_active:
type: boolean
workflow_code:
type: string
DslValidationResultDto:
type: object
properties:
valid:
type: boolean
errors:
type: array
items:
type: object
properties:
path:
type: string
description: JSON path to the invalid field (e.g. "states.DRAFT.transitions")
message:
type: string
description: Human-readable error description
securitySchemes:
BearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
@@ -0,0 +1,276 @@
openapi: "3.1.0"
info:
title: Workflow Engine — Transition API
version: "1.1.0"
description: |
Endpoints for triggering workflow state transitions.
ADR-001 v1.1: Added version_no (optimistic lock) and action_by_user_uuid.
ADR-021: Step-specific attachment support via attachmentPublicIds.
paths:
/workflow-engine/instances/{id}/transition:
post:
summary: Trigger a workflow state transition
description: |
Transitions the workflow instance to the next state based on the DSL definition.
Requires Idempotency-Key header (ADR-016).
Optionally includes pre-uploaded attachment publicIds (ADR-021).
Supports optimistic concurrency control via versionNo (ADR-001 v1.1).
tags: [WorkflowEngine]
security:
- BearerAuth: []
parameters:
- name: id
in: path
required: true
schema:
type: string
format: uuid
description: Workflow Instance UUID
- name: Idempotency-Key
in: header
required: true
schema:
type: string
format: uuid
description: UUIDv7 idempotency key — duplicate requests return cached response
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/WorkflowTransitionDto"
responses:
"200":
description: Transition successful
content:
application/json:
schema:
$ref: "#/components/schemas/WorkflowTransitionResponseDto"
"409":
description: |
Conflict — one of:
- version_no mismatch (optimistic lock) — refresh and retry
- Terminal state — cannot transition further
- Upload rejected (state not in PENDING_REVIEW/PENDING_APPROVAL)
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"422":
description: DSL condition not met or required context field missing
content:
application/json:
schema:
$ref: "#/components/schemas/ValidationErrorResponse"
"403":
description: User lacks the required CASL ability for this transition
"503":
description: Redlock unavailable — retry after brief delay
/workflow-engine/instances/{id}:
get:
summary: Get workflow instance state
description: Returns current state, available actions, and versionNo for optimistic locking.
tags: [WorkflowEngine]
security:
- BearerAuth: []
parameters:
- name: id
in: path
required: true
schema:
type: string
format: uuid
responses:
"200":
description: Instance details
content:
application/json:
schema:
$ref: "#/components/schemas/WorkflowInstanceDto"
/workflow-engine/instances/{id}/history:
get:
summary: Get workflow history (timeline)
description: Returns all transition records for a workflow instance, including step-specific attachments.
tags: [WorkflowEngine]
security:
- BearerAuth: []
parameters:
- name: id
in: path
required: true
schema:
type: string
format: uuid
responses:
"200":
description: History items
content:
application/json:
schema:
type: array
items:
$ref: "#/components/schemas/WorkflowHistoryItemDto"
components:
schemas:
WorkflowTransitionDto:
type: object
required: [action]
properties:
action:
type: string
example: APPROVE
description: Action name matching a DSL transition key
comment:
type: string
maxLength: 2000
description: Optional decision comment
versionNo:
type: integer
minimum: 1
description: |
Current version_no from the client. If provided, triggers optimistic
lock check — returns 409 if mismatch (ADR-001 v1.1 FR-002).
example: 5
payload:
type: object
additionalProperties: true
description: Additional context fields required by DSL conditions
attachmentPublicIds:
type: array
items:
type: string
format: uuid
maxItems: 20
description: |
Pre-uploaded attachment UUIDs (ADR-021). Files must have been
uploaded via Two-Phase upload and passed ClamAV scan before
this request. Only valid in PENDING_REVIEW or PENDING_APPROVAL.
WorkflowTransitionResponseDto:
type: object
properties:
success:
type: boolean
example: true
previousState:
type: string
example: PENDING_REVIEW
nextState:
type: string
example: PENDING_APPROVAL
historyId:
type: string
format: uuid
description: UUID of the created WorkflowHistory record
isCompleted:
type: boolean
description: True if the transition reached a terminal state
versionNo:
type: integer
description: Updated versionNo after successful transition — client must store for next request
WorkflowInstanceDto:
type: object
properties:
id:
type: string
format: uuid
currentState:
type: string
example: PENDING_REVIEW
status:
type: string
enum: [ACTIVE, COMPLETED, CANCELLED, TERMINATED]
versionNo:
type: integer
description: Current optimistic lock version — include in next transition request
availableActions:
type: array
items:
type: string
example: [APPROVE, REJECT, RETURN]
workflowCode:
type: string
example: RFA_FLOW_V1
WorkflowHistoryItemDto:
type: object
properties:
id:
type: string
format: uuid
fromState:
type: string
toState:
type: string
action:
type: string
actorUuid:
type: string
format: uuid
description: UUID of the acting user (ADR-019 — INT FK excluded from API)
actorName:
type: string
description: Populated via user join for display
comment:
type: string
nullable: true
createdAt:
type: string
format: date-time
attachments:
type: array
items:
$ref: "#/components/schemas/AttachmentSummaryDto"
AttachmentSummaryDto:
type: object
properties:
publicId:
type: string
format: uuid
description: ADR-019 public identifier
originalFilename:
type: string
mimeType:
type: string
fileSize:
type: integer
createdAt:
type: string
format: date-time
ErrorResponse:
type: object
properties:
userMessage:
type: string
recoveryAction:
type: string
errorCode:
type: string
ValidationErrorResponse:
type: object
properties:
userMessage:
type: string
fields:
type: array
items:
type: object
properties:
field:
type: string
message:
type: string
securitySchemes:
BearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
@@ -0,0 +1,388 @@
# Data Model: Unified Workflow Engine — Production Hardening
**Phase 1 Output** | Generated: 2026-05-02
**Extends**: `specs/08-Tasks/ADR-021-workflow-context/data-model.md` (deltas 0108 already applied)
---
## 1. Schema Deltas
### Delta 09 — `version_no` on `workflow_instances`
**File**: `specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql`
```sql
-- ============================================================
-- Delta 09: ADR-001 v1.1 — Optimistic Lock
-- เพิ่ม version_no ใน workflow_instances สำหรับ Optimistic Concurrency Control
-- ============================================================
-- ข้อควรระวัง: Existing rows จะได้ค่า DEFAULT 1 อัตโนมัติ — ไม่มี Data Loss
-- Rollback: ALTER TABLE workflow_instances DROP COLUMN version_no;
ALTER TABLE workflow_instances
ADD COLUMN version_no INT NOT NULL DEFAULT 1
COMMENT 'Optimistic lock counter — incremented on every successful transition (ADR-001 v1.1 FR-002)';
-- Index เพื่อรองรับ CAS check: WHERE id = ? AND version_no = ?
CREATE INDEX idx_wf_inst_version
ON workflow_instances (id, version_no);
```
**Migration Notes (ADR-009):**
- Apply via MariaDB CLI or n8n delta workflow — ไม่มี TypeORM migration file
- Existing instances get `version_no = 1` — no disruption to active workflows
- Rollback: `ALTER TABLE workflow_instances DROP INDEX idx_wf_inst_version; ALTER TABLE workflow_instances DROP COLUMN version_no;`
---
### Delta 10 — `action_by_user_uuid` on `workflow_histories`
**File**: `specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql`
```sql
-- ============================================================
-- Delta 10: ADR-001 v1.1 / ADR-019 UUID Compliance
-- เพิ่ม action_by_user_uuid ใน workflow_histories
-- เพื่อ expose User identity ผ่าน API โดยไม่ต้องเปิดเผย INT PK (ADR-019)
-- ============================================================
-- ข้อควรระวัง: NULL สำหรับ Historical records ที่สร้างก่อน delta นี้ (เป็น Acceptable)
-- Rollback: ALTER TABLE workflow_histories DROP COLUMN action_by_user_uuid;
ALTER TABLE workflow_histories
ADD COLUMN action_by_user_uuid VARCHAR(36) NULL
COMMENT 'UUID ของ User ผู้ดำเนินการ — ใช้ใน API Response (ADR-019). INT FK action_by_user_id ยังคงอยู่สำหรับ Internal use';
```
**Migration Notes (ADR-009):**
- NULL สำหรับ historical records — acceptable; API consumers treat NULL as "system action" or "pre-migration"
- Populate on all new transitions from this delta forward
---
## 2. Backend Entity Changes
### 2.1 `workflow-instance.entity.ts` — Add `versionNo`
**File**: `backend/src/modules/workflow-engine/entities/workflow-instance.entity.ts`
```typescript
// เพิ่มหลัง updatedAt column
@Column({
name: 'version_no',
type: 'int',
default: 1,
comment: 'Optimistic lock — incremented on each successful transition (ADR-001 v1.1)',
})
versionNo!: number;
```
**Import to add**: No new imports needed.
---
### 2.2 `workflow-history.entity.ts` — Add `actionByUserUuid`
**File**: `backend/src/modules/workflow-engine/entities/workflow-history.entity.ts`
```typescript
// เพิ่มหลัง actionByUserId column
@Column({
name: 'action_by_user_uuid',
length: 36,
nullable: true,
comment: 'UUID ของ User ผู้ดำเนินการ — expose ใน API Response per ADR-019',
})
actionByUserUuid?: string;
```
---
### 2.3 `workflow-history-item.dto.ts` — Add `actorUuid`
**File**: `backend/src/modules/workflow-engine/dto/workflow-history-item.dto.ts`
```typescript
// เพิ่ม field ใน WorkflowHistoryItemDto
@ApiPropertyOptional({
description: 'UUID ของ User ผู้ดำเนินการ (ADR-019)',
example: '019505a1-7c3e-7000-8000-abc123def456',
})
actorUuid?: string;
```
---
## 3. `processTransition()` — Optimistic Lock Changes
### Updated signature
```typescript
async processTransition(
instanceId: string,
action: string,
userId: number,
userUuid: string, // NEW: ADR-019 UUID for history record
comment?: string,
payload: Record<string, unknown> = {},
attachmentPublicIds?: string[],
clientVersionNo?: number, // NEW: Optimistic lock — sent by client
)
```
### Fast-fail check (before Redlock)
```typescript
if (clientVersionNo !== undefined) {
const current = await this.instanceRepo.findOne({
where: { id: instanceId },
select: ['id', 'versionNo'],
});
if (!current) throw new NotFoundException('Workflow Instance', instanceId);
if (current.versionNo !== clientVersionNo) {
throw new ConflictException(
'WORKFLOW_VERSION_CONFLICT',
`Expected version_no=${clientVersionNo}, actual=${current.versionNo}`,
'เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช',
['รีเฟรชหน้าแล้วลองใหม่']
);
}
}
```
### History creation — add `actionByUserUuid`
```typescript
const history = this.historyRepo.create({
instanceId: instance.id,
fromState,
toState,
action,
actionByUserId: userId,
actionByUserUuid: userUuid, // NEW
comment,
metadata: { events: evaluation.events },
});
```
### Version increment (inside DB transaction, after history save)
```typescript
// CAS update — ถ้า version_no ถูกเปลี่ยนระหว่างนี้ (TOCTOU) จะไม่มีแถวถูก update
const result = await queryRunner.manager
.createQueryBuilder()
.update(WorkflowInstance)
.set({ versionNo: () => 'version_no + 1' })
.where('id = :id AND version_no = :expected', {
id: instanceId,
expected: instance.versionNo,
})
.execute();
if (result.affected === 0) {
// TOCTOU: version changed under pessimistic lock (edge case — should not normally occur)
throw new ConflictException(
'WORKFLOW_VERSION_CONFLICT',
'version_no changed between lock acquisition and update',
'เกิด Conflict กรุณารีเฟรชและลองใหม่',
['รีเฟรชหน้า', 'ลองดำเนินการอีกครั้ง']
);
}
```
---
## 4. `processTransition()` — Structured Observability Changes
### New metric injections in constructor
```typescript
@InjectMetric('workflow_transitions_total')
private readonly transitionsTotal: Counter<string>,
@InjectMetric('workflow_transition_duration_ms')
private readonly transitionDuration: Histogram<string>,
```
### Wrap in timer + log
```typescript
const startMs = Date.now();
let outcome: 'success' | 'conflict' | 'forbidden' | 'validation_error' | 'system_error' = 'system_error';
let workflowCode = 'unknown';
try {
// ... existing processTransition logic ...
workflowCode = instance.definition.workflow_code;
outcome = 'success';
} catch (err) {
if (err instanceof ConflictException) outcome = 'conflict';
else if (err instanceof ForbiddenException) outcome = 'forbidden';
else if (err instanceof WorkflowException) outcome = 'validation_error';
throw err;
} finally {
const durationMs = Date.now() - startMs;
this.transitionDuration.labels({ workflow_code: workflowCode }).observe(durationMs);
this.transitionsTotal.labels({ workflow_code: workflowCode, action, outcome }).inc();
this.logger.log(JSON.stringify({
instanceId, action, fromState: instance?.currentState,
toState: outcome === 'success' ? toState : undefined,
userUuid, durationMs, outcome, workflowCode,
}));
}
```
### Module registration (in `workflow-engine.module.ts`)
```typescript
import { makeCounterProvider, makeHistogramProvider } from '@willsoto/nestjs-prometheus';
// Add to providers array:
makeCounterProvider({
name: 'workflow_transitions_total',
help: 'Total workflow transitions by code, action, and outcome',
labelNames: ['workflow_code', 'action', 'outcome'],
}),
makeHistogramProvider({
name: 'workflow_transition_duration_ms',
help: 'Workflow transition duration in milliseconds',
labelNames: ['workflow_code'],
buckets: [50, 100, 250, 500, 1000, 2500, 5000],
}),
```
---
## 5. DSL Cache Changes (FR-007)
### Cache methods in `workflow-engine.service.ts`
```typescript
// ใน createDefinition() — หลัง save
await this.cacheManager.set(
`wf:def:${saved.workflow_code}:${saved.version}`,
saved,
3600 * 1000 // 1 hour in ms (cache-manager v5 uses ms)
);
// ใน update() — ก่อน save (ถ้า DSL เปลี่ยน)
await this.cacheManager.del(`wf:def:${definition.workflow_code}:${definition.version}`);
// ใน activate/deactivate — invalidate active pointer
await this.redis.del(`wf:def:${definition.workflow_code}:active`);
if (dto.is_active === true) {
await this.cacheManager.set(
`wf:def:${definition.workflow_code}:active`,
saved,
3600 * 1000
);
}
```
---
## 6. BullMQ DLQ + n8n Webhook Changes (FR-005, FR-006)
### `workflow-event.service.ts` additions
```typescript
// ใน WorkflowEventProcessor:
@OnWorkerEvent('failed')
async onJobFailed(job: Job, error: Error): Promise<void> {
// ตรวจสอบว่าหมด retry แล้วหรือยัง
if ((job.attemptsMade ?? 0) >= (job.opts.attempts ?? 3)) {
// ส่งไปยัง DLQ
await this.failedQueue.add('dead-letter', {
originalJobId: job.id,
queue: 'workflow-events',
data: job.data,
failedAt: new Date().toISOString(),
error: error.message,
});
// แจ้ง Ops ผ่าน n8n webhook (ถ้าตั้งค่าไว้)
const webhookUrl = process.env.N8N_WEBHOOK_URL;
if (webhookUrl) {
try {
await fetch(webhookUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
event: 'workflow_event_failed',
jobId: job.id,
workflowCode: job.data?.workflowCode,
instanceId: job.data?.instanceId,
error: error.message,
timestamp: new Date().toISOString(),
}),
});
} catch (webhookErr) {
// Warning เท่านั้น — ไม่ throw เพื่อไม่ให้กระทบ DLQ add
this.logger.warn(`n8n webhook failed: ${(webhookErr as Error).message}`);
}
} else {
this.logger.warn('N8N_WEBHOOK_URL not configured — DLQ job created without ops notification');
}
}
}
```
### Worker configuration (verify/update in `workflow-engine.module.ts`)
```typescript
WorkerHost({
connection: { ... },
concurrency: 5,
limiter: { max: 50, duration: 60000 },
}),
// Job default options
defaultJobOptions: {
attempts: 3,
backoff: { type: 'exponential', delay: 500 },
removeOnComplete: { age: 86400 },
removeOnFail: false, // Keep in failed state for Bull Board visibility
}
```
---
## 7. Updated Entity Relationship Diagram
```
workflow_definitions
workflow_code + version (unique)
is_active: BOOLEAN
│ 1
▼ N
workflow_instances
version_no: INT DEFAULT 1 ← NEW (Delta 09)
current_state: VARCHAR(50)
context: JSON
contract_id: INT NULL
│ 1
▼ N
workflow_histories
action_by_user_id: INT NULL ← existing (internal FK)
action_by_user_uuid: VARCHAR(36) ← NEW (Delta 10, ADR-019)
from_state / to_state / action
metadata: JSON
│ 1
▼ N
attachments
workflow_history_id: CHAR(36) NULL ← Delta 04 (already applied)
uuid: VARCHAR(36) ← publicId (ADR-019)
```
---
## 8. Index Strategy (updated)
| Table | Index | Columns | Purpose | Status |
|-------|-------|---------|---------|--------|
| `workflow_instances` | `idx_wf_inst_version` | `(id, version_no)` | Optimistic lock CAS check | **NEW** |
| `workflow_instances` | `idx_wf_inst_entity` | `(entity_type, entity_id)` | Polymorphic lookup | Existing |
| `workflow_histories` | `idx_wf_hist_instance` | `(instance_id)` | History per instance | Existing |
| `attachments` | `idx_att_wfhist_created` | `(workflow_history_id, created_at)` | Step attachments | Delta 04 |
+272
View File
@@ -0,0 +1,272 @@
# Implementation Plan: Unified Workflow Engine — Production Hardening & Integrated Context
**Branch**: `003-unified-workflow-engine` | **Date**: 2026-05-02 | **Spec**: [spec.md](./spec.md)
**Input**: Feature specification from `specs/003-unified-workflow-engine/spec.md`
---
## Summary
The Workflow Engine backend infrastructure is substantially implemented (service, entities, guards, DSL, Redlock, Prometheus metrics). This plan closes the remaining production-hardening gaps from ADR-001 v1.1 (optimistic lock, user UUID in history, CASL-mapped DSL roles, per-transition metrics, DSL Redis cache, DLQ + n8n webhook) and completes ADR-021 (step-specific attachment data-wiring in all 4 modules, file preview modal, Admin DSL editor UI).
Clarification decisions from `spec.md`:
- **Q1**: DSL `require.role` → CASL ability check (FR-002a)
- **Q2**: Observability = structured log + counter + histogram (FR-022, FR-023)
- **Q3**: File rollback on DB failure = move back to temp, 24h TTL (FR-019)
- **Q4**: Admin DSL editor UI is in scope (FR-024, FR-025)
- **Q5**: All 4 modules need banner gap-filling (FR-011)
---
## Technical Context
**Language/Version**: TypeScript 5.4, Node.js 20 LTS
**Primary Dependencies**: NestJS 10, TypeORM 0.3, BullMQ 5, `@willsoto/nestjs-prometheus`, `json-logic-js`, `redlock`, `ioredis`
**Frontend**: Next.js 14 (App Router), TanStack Query v5, React Hook Form + Zod, shadcn/ui
**Storage**: MariaDB 10.11, Redis 7, StorageService (Two-Phase Upload per ADR-016)
**Testing**: Jest + `@nestjs/testing` (backend), Vitest (frontend)
**Target Platform**: QNAP NAS Docker Compose (backend), Next.js SSR (frontend)
**Performance Goals**: Transition P95 < 1s (no upload); upload+transition P95 < 5s; cache invalidation < 1s across all instances
**Constraints**: ADR-009 (no TypeORM migrations), ADR-019 (UUID strings, no parseInt), ADR-016 (Two-Phase Upload), ADR-008 (BullMQ async)
**Scale/Scope**: 4 document modules × ~50 active workflows concurrently; up to 20 history records per instance
---
## Constitution Check
_GATE: Must pass before Phase 0. Re-checked after Phase 1 design._
| Gate | Rule | Status | Notes |
|------|------|--------|-------|
| ADR-019 UUID | No `parseInt` on UUIDs; expose `publicId` strings only | ✅ PASS | `WorkflowInstance.id` and `WorkflowHistory.id` are UUID PKs (native CHAR(36)); `action_by_user_uuid` addition follows pattern |
| ADR-009 Schema | No TypeORM migrations; edit SQL directly | ✅ PASS | Two new delta files planned (delta-09, delta-10) |
| ADR-016 Security | Two-Phase upload; ClamAV; whitelist | ✅ PASS | Already implemented in `processTransition()`; file preview uses existing attachment endpoint |
| ADR-008 BullMQ | Async notifications; no inline dispatch | ✅ PASS | `WorkflowEventService` dispatches to `workflow-events` queue; DLQ is the gap |
| ADR-007 Errors | Layered exception hierarchy | ✅ PASS | `WorkflowException`, `ConflictException`, `ServiceUnavailableException` already in use |
| ADR-002 Numbering | Redlock for document numbering | ✅ N/A | Workflow engine does not generate document numbers |
| ADR-018/020 AI | No AI direct DB access | ✅ N/A | No AI integration in this feature |
| FR-002 Optimistic Lock | `version_no` column on `workflow_instances` | ⚠️ GAP | Column missing — delta-09 required |
| FR-003 User UUID | `action_by_user_uuid` on `workflow_histories` | ⚠️ GAP | Column missing — delta-10 required |
**Post-gate verdict**: PASS with two schema deltas required before implementation begins.
---
## Project Structure
### Documentation (this feature)
```text
specs/003-unified-workflow-engine/
├── plan.md ← This file
├── research.md ← Phase 0 output
├── data-model.md ← Phase 1 output
├── quickstart.md ← Phase 1 output
└── contracts/ ← Phase 1 output
├── workflow-transition.yaml
└── workflow-definitions.yaml
```
### Source Code Layout
```text
backend/src/modules/workflow-engine/
├── entities/
│ ├── workflow-instance.entity.ts ← ADD versionNo column
│ └── workflow-history.entity.ts ← ADD actionByUserUuid column
├── guards/
│ └── workflow-transition.guard.ts ← ADD DSL require.role → CASL mapping (FR-002a)
├── dto/
│ └── workflow-history-item.dto.ts ← ADD actorUuid field
├── workflow-engine.service.ts ← ADD version_no check, structured log, metrics, cache invalidation
├── workflow-event.service.ts ← ADD DLQ processor + n8n webhook (FR-005/006)
└── workflow-engine.module.ts ← Register new metrics providers
specs/03-Data-and-Storage/deltas/
├── 09-add-version-no-to-workflow-instances.sql ← NEW
└── 10-add-action-by-user-uuid-to-workflow-histories.sql ← NEW
frontend/components/workflow/
├── integrated-banner.tsx ← GAP-FILL: step-attachment upload zone
├── workflow-lifecycle.tsx ← GAP-FILL: history items with attachment list
└── file-preview-modal.tsx ← NEW component
frontend/app/(admin)/admin/workflows/
└── definitions/
├── page.tsx ← NEW: DSL list + activate/deactivate
└── [id]/
└── page.tsx ← NEW: DSL JSON editor + inline validation
frontend/app/(admin)/admin/doc-control/
├── rfa/[uuid]/page.tsx ← GAP-FILL: availableActions, step-attach
├── transmittals/[uuid]/page.tsx ← GAP-FILL: step-attach upload zone
├── circulation/[uuid]/page.tsx ← GAP-FILL: step-attach upload zone
└── correspondence/[uuid]/page.tsx ← GAP-FILL + new IntegratedBanner wiring
```
---
## Implementation Phases
### Phase B1: Schema Deltas (prerequisite)
Apply before any code changes.
| Delta | File | Change |
|-------|------|--------|
| 09 | `09-add-version-no-to-workflow-instances.sql` | `ALTER TABLE workflow_instances ADD COLUMN version_no INT NOT NULL DEFAULT 1` |
| 10 | `10-add-action-by-user-uuid-to-workflow-histories.sql` | `ALTER TABLE workflow_histories ADD COLUMN action_by_user_uuid VARCHAR(36) NULL` |
### Phase B2: Entity & DTO Updates
| Task | File | Change |
|------|------|--------|
| B2-1 | `workflow-instance.entity.ts` | Add `@Column() versionNo: number` with `@Version()` decorator |
| B2-2 | `workflow-history.entity.ts` | Add `@Column() actionByUserUuid?: string` |
| B2-3 | `workflow-history-item.dto.ts` | Add `actorUuid: string` field (exposed in API per ADR-019) |
### Phase B3: Optimistic Lock in `processTransition()` (FR-002)
In `workflow-engine.service.ts`:
1. Accept `clientVersionNo?: number` parameter in `processTransition()`
2. If provided: compare against `instance.versionNo` BEFORE Redlock acquisition → throw `ConflictException` (HTTP 409) if mismatch
3. After DB transaction commit: increment `instance.versionNo + 1` via `UPDATE workflow_instances SET version_no = version_no + 1 WHERE id = :id AND version_no = :expected`
4. No separate pessimistic lock change needed — keep both as defense-in-depth
### Phase B4: CASL Role Mapping in Guard (FR-002a)
In `workflow-transition.guard.ts`:
1. After Level 1 (Superadmin) check, extract DSL `require.role` from the current step config
2. Map each DSL role string to a CASL ability string via `DSL_ROLE_TO_CASL` config map
3. Check `userPermissions.includes(mappedAbility)` for any match → pass
4. Fall through to existing Level 3 (assignedUserId) check for `"AssignedHandler"` role
```typescript
const DSL_ROLE_TO_CASL: Record<string, string> = {
'Superadmin': 'system.manage_all',
'OrgAdmin': 'organization.manage_users',
'ContractMember': 'contract.view',
'AssignedHandler': '__assigned__', // resolved by existing Level 3 check
};
```
### Phase B5: Structured Observability (FR-022, FR-023)
In `workflow-engine.service.ts`:
1. Inject two new metrics via `@InjectMetric()`:
- `workflow_transitions_total` (Counter: `workflow_code`, `action`, `outcome`)
- `workflow_transition_duration_ms` (Histogram: `workflow_code`)
2. Wrap `processTransition()` in a `startTimer``observe(duration)` block
3. Emit structured log on every outcome:
```typescript
this.logger.log(JSON.stringify({
instanceId, action, fromState, toState, userUuid,
durationMs, outcome, workflowCode
}));
```
4. Register providers in `workflow-engine.module.ts`
### Phase B6: DSL Redis Cache Invalidation (FR-007)
In `workflow-engine.service.ts`:
1. In `createDefinition()`: after save, call `cacheManager.set('wf:def:${code}:${version}', entity, 3600)`
2. In `update()`: call `cacheManager.del('wf:def:${code}:${oldVersion}')` before save
3. In `getDefinitionById()` / cached lookup: read-through with `cacheManager.get()` → fallback to DB
4. On `is_active` toggle: invalidate ALL `wf:def:{code}:*` keys (use `redis.keys()` + `redis.del()` pattern)
### Phase B7: BullMQ DLQ + n8n Webhook (FR-005, FR-006)
In `workflow-event.service.ts`:
1. Add `workflow-events-failed` queue registration
2. Add `@OnWorkerEvent('failed')` handler in the processor class
3. On `attempts === maxAttempts`: POST to `process.env.N8N_WEBHOOK_URL` with job payload (never hardcoded)
4. Verify existing `workflow-events` worker has `concurrency: 5, attempts: 3, backoff: { type: 'exponential', delay: 500 }`
### Phase B8: File Rollback on Transaction Failure (FR-019)
In `workflow-engine.service.ts` `processTransition()`:
1. After file linkage step inside transaction, if `queryRunner.commitTransaction()` throws:
- Call `storageService.moveToTemp(attachmentPublicIds)` in the `catch` block
- Log the rollback with attachment IDs for audit
2. The 24h TTL on temp files is handled by existing `FileCleanupService` cron
### Phase F1: File Preview Modal (FR-020)
New component: `frontend/components/workflow/file-preview-modal.tsx`
- Props: `attachment: WorkflowAttachmentSummary | null`, `onClose: () => void`
- Renders PDF via `<iframe src="/api/files/{publicId}/preview" />` for PDFs
- Renders `<img>` for image MIME types
- Falls back to download link for unsupported types
- Uses shadcn/ui `Dialog` component
### Phase F2: Step-Attachment Upload Zone (FR-014FR-019)
In `integrated-banner.tsx`:
1. Show upload zone only when `currentState ∈ {PENDING_REVIEW, PENDING_APPROVAL}` AND user is assigned handler/org-admin/superadmin
2. Upload zone calls existing Two-Phase upload endpoint, then appends `publicId` to pending list
3. On action button click, pass `attachmentPublicIds` array to `use-workflow-action.ts` hook
4. On success: invalidate TanStack Query cache for document + history
In `workflow-lifecycle.tsx`:
1. For each history item, render `attachments[]` as clickable file chips
2. On click: open `FilePreviewModal`
### Phase F3: Module Banner Gap-Fill (FR-011, all 4 modules)
For each detail page (`rfa`, `transmittals`, `circulation`, `correspondence`):
1. Ensure service `findOneByUuid()` exposes: `workflowInstanceId`, `workflowState`, `availableActions`, `workflowPriority`
2. Pass live values to `<IntegratedBanner>` and `<WorkflowLifecycle>`
3. Add step-attachment upload zone via Phase F2 components
4. Verify `WorkflowHistoryItemDto` includes `attachments[]` in the history endpoint
Correspondence is the only module requiring new backend wiring (Transmittal + Circulation already done per v1.8.7; RFA has partial wiring — needs `availableActions` + step-attach).
### Phase F4: Admin DSL Editor UI (FR-024, FR-025)
New pages under `frontend/app/(admin)/admin/workflows/definitions/`:
**List page** (`page.tsx`):
- Table of all workflow definitions with columns: `workflow_code`, `version`, `is_active`, actions (Edit / Activate / Deactivate)
- Uses TanStack Query `useWorkflowDefinitions()` hook
- Activate/Deactivate via `PATCH /workflow-engine/definitions/:id` with `{ is_active: true/false }`
**Editor page** (`[id]/page.tsx`):
- Load definition via `useWorkflowDefinition(id)`
- JSON editor (Monaco Editor or `@uiw/react-codemirror` in JSON mode)
- Inline validation: call `POST /workflow-engine/definitions/validate` with DSL JSON → display errors inline
- Save button disabled when validation errors present (FR-025)
- Form managed with React Hook Form + Zod (for wrapper metadata fields)
---
## Complexity Tracking
No constitution violations requiring justification.
---
## Risk Register
| Risk | Impact | Mitigation |
|------|--------|-----------|
| `version_no` delta on live DB with existing instances | Medium | Delta sets `DEFAULT 1`; existing rows auto-initialize; no data loss |
| `action_by_user_uuid` delta — NULL for historical records | Low | Column is NULLABLE; historical records remain valid |
| DSL role mapping gaps (unknown role strings) | Medium | `DSL_ROLE_TO_CASL` unknown keys default to `__assigned__` check — fail-safe |
| Monaco Editor bundle size (~2MB) | Low | Lazy-loaded only on Admin DSL editor page; no impact to user-facing pages |
| n8n webhook URL not configured in some environments | Medium | Guard with `if (!N8N_WEBHOOK_URL)` → warn log, don't throw; ops can configure later |
---
## Test Plan
| Area | Tests Required | Target |
|------|---------------|--------|
| `WorkflowEngineService.processTransition` | Concurrent optimistic lock (409), version increment, structured log emission | Unit (Jest) |
| `WorkflowTransitionGuard` | DSL role → CASL mapping for each level | Unit (Jest) |
| `WorkflowEventService` DLQ | Failed job triggers n8n webhook | Unit (Jest + mock) |
| Transition metrics | Counter/histogram incremented on success + failure | Unit (Jest) |
| DSL cache invalidation | Activate triggers cache del | Integration (Jest) |
| File rollback (FR-019) | DB failure → `moveToTemp()` called | Unit (Jest + mock) |
| `FilePreviewModal` | Renders PDF/image/fallback correctly | Frontend (Vitest) |
| Admin DSL editor | Validation errors shown inline; save blocked | Frontend (Vitest) |
| Module gap-fill E2E | Each module detail page renders live `availableActions` | Manual / Playwright |
@@ -0,0 +1,205 @@
# Quickstart: Unified Workflow Engine — Production Hardening
**Phase 1 Output** | Generated: 2026-05-02
**For**: Developers implementing tasks from `tasks.md` (generated by `/speckit-tasks`)
---
## Pre-flight Checklist
Before writing any code:
- [ ] Apply Delta 09: `specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql`
- [ ] Apply Delta 10: `specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql`
- [ ] Confirm `workflow_instances` has `version_no` column: `DESCRIBE workflow_instances;`
- [ ] Confirm `workflow_histories` has `action_by_user_uuid` column: `DESCRIBE workflow_histories;`
- [ ] Verify existing tests pass: `pnpm test --testPathPattern=workflow-engine`
---
## Implementation Order
Tasks MUST be implemented in this order to avoid breaking existing functionality:
```
[B1] Schema Deltas (DB)
[B2] Entity + DTO updates
[B3] processTransition() — optimistic lock
[B4] WorkflowTransitionGuard — CASL role mapping
[B5] Observability — metrics + structured log
[B6] DSL Redis cache invalidation
[B7] BullMQ DLQ + n8n webhook
[F1] FilePreviewModal component
[F2] Step-attachment upload zone in IntegratedBanner
[F3] Module gap-fill (all 4 modules)
[F4] Admin DSL editor UI
```
---
## Key Files Reference
| Task | File | Action |
|------|------|--------|
| B1 | `specs/03-Data-and-Storage/deltas/09-*.sql` | CREATE |
| B1 | `specs/03-Data-and-Storage/deltas/10-*.sql` | CREATE |
| B2 | `backend/src/modules/workflow-engine/entities/workflow-instance.entity.ts` | EDIT — add `versionNo` |
| B2 | `backend/src/modules/workflow-engine/entities/workflow-history.entity.ts` | EDIT — add `actionByUserUuid` |
| B2 | `backend/src/modules/workflow-engine/dto/workflow-history-item.dto.ts` | EDIT — add `actorUuid` |
| B3 | `backend/src/modules/workflow-engine/workflow-engine.service.ts` | EDIT — optimistic lock, rollback, metrics |
| B4 | `backend/src/modules/workflow-engine/guards/workflow-transition.guard.ts` | EDIT — DSL role → CASL |
| B5 | `backend/src/modules/workflow-engine/workflow-engine.module.ts` | EDIT — register metrics providers |
| B6 | `backend/src/modules/workflow-engine/workflow-engine.service.ts` | EDIT — cache set/del in createDefinition/update |
| B7 | `backend/src/modules/workflow-engine/workflow-event.service.ts` | EDIT — DLQ + n8n webhook |
| F1 | `frontend/components/workflow/file-preview-modal.tsx` | CREATE |
| F2 | `frontend/components/workflow/integrated-banner.tsx` | EDIT — upload zone |
| F2 | `frontend/components/workflow/workflow-lifecycle.tsx` | EDIT — attachment chips |
| F3 | `frontend/app/(admin)/admin/doc-control/correspondence/[uuid]/page.tsx` | EDIT — banner wiring |
| F3 | `frontend/app/(admin)/admin/doc-control/rfa/[uuid]/page.tsx` | EDIT — step-attach gap |
| F3 | `frontend/app/(admin)/admin/doc-control/transmittals/[uuid]/page.tsx` | EDIT — step-attach gap |
| F3 | `frontend/app/(admin)/admin/doc-control/circulation/[uuid]/page.tsx` | EDIT — step-attach gap |
| F4 | `frontend/app/(admin)/admin/workflows/definitions/page.tsx` | CREATE |
| F4 | `frontend/app/(admin)/admin/workflows/definitions/[id]/page.tsx` | CREATE |
---
## Critical Patterns
### Optimistic Lock — Client Side
```typescript
// Frontend: store versionNo from GET /workflow-engine/instances/:id
const { data: instance } = useWorkflowInstance(instanceId);
// On transition: pass versionNo in body
await triggerTransition({
action: 'APPROVE',
versionNo: instance.versionNo, // ← MUST include
attachmentPublicIds: pendingFiles,
comment,
});
// On 409 → show toast "เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช"
// Invalidate query cache → user sees updated state
```
### DSL Role Mapping — Guard
```typescript
// backend/src/modules/workflow-engine/guards/workflow-transition.guard.ts
const DSL_ROLE_TO_CASL: Record<string, string> = {
'Superadmin': 'system.manage_all',
'OrgAdmin': 'organization.manage_users',
'ContractMember': 'contract.view',
'AssignedHandler': '__assigned__',
};
// In canActivate: extract require.role from DSL compiled state
const stepConfig = compiled?.states?.[instance.currentState];
const requiredRoles: string[] = stepConfig?.require?.role ?? [];
for (const dslRole of requiredRoles) {
const caslAbility = DSL_ROLE_TO_CASL[dslRole];
if (!caslAbility) continue;
if (caslAbility === '__assigned__') continue; // handled by Level 3 check
if (userPermissions.includes(caslAbility)) return true;
}
// Fall through to Level 3 (assignedUserId) check as before
```
### File Preview Modal — Usage
```tsx
// In workflow-lifecycle.tsx
import { FilePreviewModal } from './file-preview-modal';
const [preview, setPreview] = useState<WorkflowAttachmentSummary | null>(null);
// In attachment chip onClick:
<button onClick={() => setPreview(attachment)}>{attachment.originalFilename}</button>
<FilePreviewModal attachment={preview} onClose={() => setPreview(null)} />
```
### Admin DSL Editor — Monaco Setup
```tsx
// In definitions/[id]/page.tsx
import dynamic from 'next/dynamic';
const MonacoEditor = dynamic(() => import('@monaco-editor/react'), { ssr: false });
// Validate on change (debounced 800ms)
const handleEditorChange = useCallback(
debounce(async (value: string) => {
try {
const parsed = JSON.parse(value);
const result = await validateDsl(parsed);
setValidationErrors(result.errors);
} catch {
setValidationErrors([{ path: 'root', message: 'Invalid JSON' }]);
}
}, 800),
[]
);
```
---
## Testing Verification Commands
```bash
# Backend unit tests for workflow engine
cd backend
pnpm test --testPathPattern=workflow-engine --coverage
# Frontend typecheck
cd frontend
pnpm tsc --noEmit
# Frontend component tests
cd frontend
pnpm vitest run components/workflow
# Full backend test suite
cd backend
pnpm test --coverage
```
---
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `N8N_WEBHOOK_URL` | Prod only | URL for dead-letter job ops notifications |
| `REDIS_URL` | All | Redis connection for BullMQ + cache |
Both must be set in `docker-compose.yml` — never hardcoded.
---
## Commit Message Convention
```
feat(workflow-engine): add optimistic lock version_no (FR-002, ADR-001 v1.1)
feat(workflow-engine): add CASL DSL role mapping to guard (FR-002a)
feat(workflow-engine): structured transition log + metrics (FR-022/023)
feat(workflow-engine): DSL cache invalidation on activate (FR-007)
feat(workflow-engine): BullMQ DLQ + n8n webhook (FR-005/006)
feat(workflow-ui): FilePreviewModal component (FR-020)
feat(workflow-ui): step-attachment upload zone in IntegratedBanner (FR-014-019)
feat(workflow-ui): Admin DSL editor page (FR-024/025)
feat(correspondence): IntegratedBanner gap-fill wiring (FR-011)
chore(schema): delta-09 version_no, delta-10 action_by_user_uuid (ADR-009)
```
@@ -0,0 +1,209 @@
# Research: Unified Workflow Engine — Production Hardening Decisions
**Phase 0 Output** | Generated: 2026-05-02
**Builds on**: `specs/08-Tasks/ADR-021-workflow-context/research.md` (attachment strategy, FK structure, UUID type — all resolved previously)
---
## Decision 1: Optimistic Lock Strategy for `processTransition()` (FR-002)
**Question:** `processTransition()` already uses `pessimistic_write` DB lock. ADR-001 v1.1 requires adding `version_no` optimistic lock. Should they co-exist or replace?
### Option A: Replace pessimistic with optimistic (Selected ❌)
Remove `lock: { mode: 'pessimistic_write' }` and rely solely on `version_no` CAS.
**Cons:**
- Two concurrent requests with different `version_no` values still cause a race window between the DB read and the UPDATE
- Redlock already acquired before DB transaction — removing pessimistic adds no benefit to latency
### Option B: Dual-layer defense-in-depth (Selected ✅)
Keep `pessimistic_write` inside the transaction. Add `version_no` check as a **fast-fail before Redlock acquisition**.
**Flow:**
```
Client sends { action, version_no: N }
[Fast-fail] Read instance.version_no from DB (no lock)
If N ≠ instance.version_no → HTTP 409 immediately (no Redlock acquired)
[Acquire Redlock]
[DB Transaction with pessimistic_write]
Re-check version_no under lock (TOCTOU defense)
If still mismatch → 409 and release lock
Else: commit + increment version_no
```
**Pros:**
- Fast-fail saves Redlock round-trip for stale clients (SC-001 — no double approvals)
- Inner pessimistic lock prevents any residual race within the DB transaction
- Defense-in-depth: two independent barriers
**Decision:** Option B — dual-layer.
**Rationale:** Zero latency regression for non-conflicting requests; stale-client 409 fired before lock acquisition; inner lock remains for cross-process correctness.
---
## Decision 2: DSL `require.role` → CASL Ability Mapping (FR-002a)
**Question:** How should the guard resolve DSL `require.role: ["Admin"]` against the CASL permission model?
### Option A: Static config map in guard (Selected ✅)
```typescript
const DSL_ROLE_TO_CASL: Record<string, string> = {
'Superadmin': 'system.manage_all',
'OrgAdmin': 'organization.manage_users',
'ContractMember': 'contract.view',
'AssignedHandler': '__assigned__',
};
```
Guard resolves each DSL role string → CASL permission string → `userPermissions.includes(mapped)`.
**Pros:**
- No new DB tables or config entities
- Testable in isolation (mock `userPermissions`)
- Backward-compatible: unknown DSL roles fall through to `__assigned__` check
### Option B: Dynamic mapping table in DB (Rejected ❌)
Store DSL role → CASL ability mappings in a new `workflow_role_mappings` table.
**Cons:**
- New table requires ADR-009 delta + entity + service
- Over-engineering for a mapping that changes rarely
- Adds DB query to every transition guard check
**Decision:** Option A — static config map.
**Rationale:** The mapping is stable (tied to ADR-016 RBAC levels); config-driven in code is sufficient and avoids over-engineering.
---
## Decision 3: Per-Transition Prometheus Metrics (FR-023)
**Question:** The existing service has Redlock-specific metrics. Where should workflow-level transition metrics be registered?
### Existing metrics (keep)
- `workflow_redlock_acquire_duration_ms` (Histogram)
- `workflow_redlock_acquire_failures_total` (Counter)
### New metrics needed
- `workflow_transitions_total` (Counter, labels: `workflow_code`, `action`, `outcome`)
- `workflow_transition_duration_ms` (Histogram, labels: `workflow_code`)
**Registration approach:** Add `makeCounterProvider` and `makeHistogramProvider` in `workflow-engine.module.ts` via `@willsoto/nestjs-prometheus`. Inject with `@InjectMetric('workflow_transitions_total')`.
**Outcome label values:**
- `success` — transition committed
- `conflict` — optimistic lock mismatch (409) or TOCTOU
- `forbidden` — CASL guard rejection (403)
- `validation_error` — DSL condition failed (422)
- `system_error` — unexpected exception (500)
**Decision:** Register in `WorkflowEngineModule`; inject into `WorkflowEngineService`; record in `processTransition()` try/catch/finally block.
---
## Decision 4: DSL Definition Redis Cache Pattern (FR-007)
**Question:** Cache key format, TTL, and invalidation strategy for `workflow_definitions`.
### Cache key design
```
wf:def:{workflow_code}:{version} → single definition
wf:def:{workflow_code}:active → pointer to active version (for fast active lookup)
```
### Invalidation triggers
| Event | Action |
|-------|--------|
| `createDefinition()` | SET new key; leave old active pointer |
| `update()` — DSL change | DEL old key; SET updated key |
| `is_active = true` | SET `wf:def:{code}:active`; DEL previous active pointer |
| `is_active = false` | DEL `wf:def:{code}:active` |
### TTL: 3600 seconds (1 hour). Acceptable stale window for inactive definitions; active pointer is always invalidated on toggle.
**Decision:** Two-key pattern with a separate `:active` pointer key. Invalidate pointer immediately on `is_active` change → satisfies SC-005 (< 1s invalidation).
---
## Decision 5: BullMQ Dead-Letter Queue Architecture (FR-005, FR-006)
**Question:** How to implement `workflow-events-failed` DLQ with n8n webhook notification?
### Option A: Separate `workflow-events-failed` queue (Selected ✅)
```typescript
// WorkflowEventProcessor
@OnWorkerEvent('failed')
async onFailed(job: Job, error: Error) {
if (job.attemptsMade >= job.opts.attempts) {
// All retries exhausted → DLQ + webhook
await this.failedQueue.add('dead-letter', { jobId: job.id, ...job.data });
await this.notifyOps(job, error);
}
}
```
**Pros:**
- Failed jobs visible in Bull Board under `workflow-events-failed`
- Can be requeued manually via Bull Board UI
- n8n only notified on final failure (not on intermediate retries)
### Option B: BullMQ native `removeOnFail: false` only (Rejected ❌)
Keep jobs in `workflow-events` completed/failed states, no separate queue.
**Cons:**
- Bull Board has no separate DLQ view
- No ops notification mechanism
- Harder to isolate and requeue failed jobs
**Decision:** Option A — separate `workflow-events-failed` queue.
**n8n webhook:** Send via `fetch(process.env.N8N_WEBHOOK_URL, { method: 'POST', body: JSON.stringify(payload) })`. Guard with `if (!process.env.N8N_WEBHOOK_URL)` to avoid hard failure in dev environment.
---
## Decision 6: Admin DSL Editor — JSON Editor Library (FR-024, FR-025)
**Question:** Which JSON editor library for the DSL authoring UI?
### Option A: Monaco Editor (`@monaco-editor/react`) (Selected ✅)
Full VS Code-like editor with JSON syntax highlighting, bracket matching, and inline error markers.
**Pros:**
- Inline error decoration (squiggle underlines) — satisfies FR-025 inline validation feedback
- JSON schema validation via `monaco.languages.json.jsonDefaults.setDiagnosticsOptions()`
- Familiar to developers
- Already potentially used elsewhere in admin UIs
**Cons:**
- ~2MB bundle (lazy loaded via `dynamic(() => import('@monaco-editor/react'), { ssr: false })`)
### Option B: CodeMirror 6 (`@uiw/react-codemirror`) (Alternative)
Lighter (~400KB for JSON extension).
**Cons:**
- No native JSON Schema validation; requires custom linting extension
- Inline error decoration requires manual setup
**Decision:** Option A — Monaco Editor, lazy-loaded. The bundle cost is acceptable for an Admin-only page (not user-facing); inline validation is a critical FR-025 requirement.
**DSL Schema**: Provide the compiled DSL JSON Schema to Monaco for inline validation → errors shown before the user clicks Save.
---
## Carry-Forward from Prior Research
The following decisions from `specs/08-Tasks/ADR-021-workflow-context/research.md` remain valid and are not re-litigated:
- **File attachment strategy**: Upload-then-reference (Two-Phase, ADR-016) ✅
- **FK structure**: Direct `workflow_history_id` on `attachments` table ✅
- **UUID type for `workflow_histories.id`**: CHAR(36) UUID direct PK ✅
- **Redlock scope**: Transition-level Redlock (not document-numbering Redlock) ✅
- **Preview endpoint**: Use existing `/api/files/{publicId}` with `Content-Disposition: inline`
+216
View File
@@ -0,0 +1,216 @@
# Feature Specification: Unified Workflow Engine — Production Hardening & Integrated Context
**Feature Branch**: `003-unified-workflow-engine`
**Created**: 2026-05-02
**Status**: Draft
**References**: ADR-001 (Unified Workflow Engine v1.1), ADR-021 (Integrated Workflow Context & Step-specific Attachments)
---
## Clarifications
### Session 2026-05-02
- Q: How should the `WorkflowTransitionGuard` resolve DSL `require.role` values against the CASL permission system? → A: DSL `require.role` values map to **CASL ability checks** — each role string corresponds to a defined CASL `action:subject` permission pair (e.g., `"Admin"``workflow.manage`). The guard resolves permissions dynamically at transition time; it does NOT match DB role names directly.
- Q: What level of observability is required for workflow transition operations? → A: **Structured log + metrics** — one structured log entry per transition (instance ID, action, user UUID, duration ms, outcome: success/conflict/forbidden/error) plus a counter metric for transition throughput and a latency histogram. No distributed tracing required at this stage.
- Q: When a file has been moved to permanent storage but the DB transition subsequently fails, what is the recovery action? → A: **Move back to temp**`StorageService` moves the file from permanent back to temp on DB failure; temp files expire after a 24-hour TTL, allowing the user to retry the transition without re-uploading or re-scanning.
- Q: Does this feature include a frontend Admin UI for DSL authoring, or is API-only sufficient? → A: **Full Admin UI in scope** — a frontend page for Super Admins to create, edit (JSON editor), activate, and deactivate workflow definitions with inline DSL validation feedback. Visual workflow builder (drag-and-drop) remains Phase 2 / out of scope.
- Q: Which modules still need new Integrated Banner + Workflow Lifecycle integration work? → A: **All four modules need gap-filling** — RFA, Transmittal, Circulation, and Correspondence all have the banner component mounted but have incomplete data wiring (e.g., missing `availableActions`, no step-attachment upload support). None are fully complete; all require targeted completion work.
---
## User Scenarios & Testing _(mandatory)_
### User Story 1 — Workflow Transition with State Integrity (Priority: P1)
A Reviewer or Approver assigned to an active workflow step transitions a document from one state to the next (e.g., `PENDING_REVIEW``APPROVED`). The system must guarantee that only one transition occurs even if two users click "Approve" simultaneously, that the workflow history records who acted and when, and that downstream notifications are dispatched asynchronously without slowing down the response.
**Why this priority**: Core correctness of the Workflow Engine — without reliable, race-condition-free transitions the entire approval chain is unreliable.
**Independent Test**: Can be fully tested by submitting two concurrent approval requests and verifying only one succeeds (the other returns 409), and that the history table contains exactly one new record.
**Acceptance Scenarios**:
1. **Given** a document in `PENDING_REVIEW` state with `version_no = 5`, **When** an assigned handler submits the `APPROVE` action, **Then** the state transitions to `APPROVED`, `version_no` increments to `6`, and a new `workflow_histories` record is written within the same DB transaction.
2. **Given** two concurrent `APPROVE` requests for the same instance at the same `version_no`, **When** both reach the server simultaneously, **Then** exactly one succeeds (200) and the other receives 409 "Concurrent transition detected — please retry" without any data corruption.
3. **Given** a successful transition, **When** the transition commits, **Then** a BullMQ job is enqueued on the `workflow-events` queue within the same request (no inline notification call).
4. **Given** a `PENDING_REVIEW` instance and a user who is NOT the assigned handler and does NOT have the required CASL ability (e.g., `workflow.manage`) mapped from the DSL `require.role` value, **When** they attempt to transition, **Then** they receive 403 Forbidden.
---
### User Story 2 — Condition-Gated Transitions via DSL (Priority: P1)
A workflow step requires a condition to be met (e.g., `requiresLegal > 0`) before a transition is allowed. The DSL defines this as a JSON Logic rule, and the engine evaluates it against the current `context` at transition time.
**Why this priority**: Without reliable condition evaluation, automated gating (legal review, approval thresholds) fails and documents could bypass required steps.
**Independent Test**: Can be fully tested by configuring a DSL with a JSON Logic condition, providing a context that both satisfies and fails the condition, and observing that transitions are allowed/blocked accordingly.
**Acceptance Scenarios**:
1. **Given** a DSL transition with `{ "type": "json-logic", "rule": { ">": [{ "var": "requiresLegal" }, 0] } }` and context `{ "requiresLegal": 1 }`, **When** the `SUBMIT` action is triggered, **Then** the transition proceeds.
2. **Given** the same DSL and context `{ "requiresLegal": 0 }`, **When** `SUBMIT` is triggered, **Then** the transition is blocked and the caller receives a `ValidationException` (HTTP 422) with a field-level error.
3. **Given** a DSL that uses a raw JS string expression (`"context.x === true"`) instead of JSON Logic format, **When** an Admin attempts to save the DSL, **Then** the save is rejected with a validation error explaining only JSON Logic format is permitted.
---
### User Story 3 — Integrated Contextual Banner & Workflow Lifecycle View (Priority: P1)
A Reviewer opens a document detail page (RFA, Transmittal, Circulation, or Correspondence). Instead of navigating to a separate Workflow panel, the document header immediately shows the document number, current status, priority badge, and Approve/Reject action buttons. A "Workflow Engine" tab below displays a vertical timeline of all workflow steps — active step highlighted in indigo with a pulse animation.
**Why this priority**: Without the Integrated Banner and Lifecycle View (ADR-021 REQ-01 to REQ-03), Reviewers must switch between screens to understand context, increasing approval time and error rate.
**Independent Test**: Can be fully tested by opening any document in `PENDING_REVIEW` or `PENDING_APPROVAL` state and visually confirming the banner shows correct status + action buttons, and the timeline tab shows the active step in indigo.
**Acceptance Scenarios**:
1. **Given** an RFA in `PENDING_APPROVAL` state with priority `URGENT`, **When** the detail page loads, **Then** the banner at the top displays the document number, `PENDING_APPROVAL` status badge, `URGENT` priority badge, and `Approve`/`Reject` action buttons — all before the document body content.
2. **Given** a workflow with 4 steps (DRAFT → PENDING_REVIEW → PENDING_APPROVAL → APPROVED), **When** the document is in `PENDING_REVIEW`, **Then** step 2 shows indigo color with CSS pulse animation; steps 1, 3, 4 show no animation.
3. **Given** a completed document (`APPROVED` or `CLOSED`), **When** the detail page loads, **Then** the action buttons are disabled/hidden and no upload controls are visible.
---
### User Story 4 — Step-specific Attachment Upload & Preview (Priority: P2)
While reviewing a document in an active workflow step, a handler uploads evidence files (PDF, DWG, DOCX, XLSX, ZIP) to be linked specifically to that step's history record. Later, any authorized user can click the file to preview it inline via a modal without navigating away.
**Why this priority**: Step-specific attachments provide the audit trail required for compliance — files are traceable to the exact decision step. Preview reduces time spent downloading/opening files.
**Independent Test**: Can be fully tested by uploading a PDF during `PENDING_REVIEW`, transitioning to `APPROVED`, and verifying the file is visible under the `PENDING_REVIEW` history entry with inline preview working.
**Acceptance Scenarios**:
1. **Given** a document in `PENDING_REVIEW` state, **When** the assigned handler drags and drops a valid PDF onto the upload zone, **Then** the file is scanned by ClamAV, stored in permanent storage after a successful transition, and linked to the `workflow_histories` record for that step.
2. **Given** a document in `APPROVED` (terminal) state, **When** any user attempts to upload a file, **Then** the upload zone is disabled and the system returns HTTP 409 "Cannot upload to terminal state".
3. **Given** a file linked to a step, **When** any authorized user clicks the file name, **Then** a preview modal opens in-browser without navigating away from the detail page.
4. **Given** a file infected with malware detected by ClamAV, **When** upload is attempted, **Then** the temp file is deleted immediately, the upload is rejected, and the user sees "File rejected: security scan failed".
5. **Given** a duplicate upload request with the same `Idempotency-Key`, **When** the duplicate request arrives, **Then** the system returns the cached 201 response without creating a second record.
---
### User Story 5 — Workflow Definition Authoring (Super Admin Only) (Priority: P2)
A Super Admin creates or updates a workflow DSL definition via an **Admin UI page** (JSON editor with inline validation feedback). The system validates the DSL structure and activates the new version. In-progress workflow instances continue using their bound version until completion.
**Why this priority**: Without safe DSL authoring, new document types cannot be onboarded and workflow changes cannot be deployed without code releases.
**Independent Test**: Can be fully tested by creating a new DSL definition, activating it, and verifying existing in-progress instances still use the old version while new instances use the new version.
**Acceptance Scenarios**:
1. **Given** a Super Admin submits a valid DSL JSON, **When** the definition is saved and activated, **Then** the Redis cache key `wf:def:{workflow_code}:{version}` is invalidated immediately and new instances start using the new version.
2. **Given** an in-progress `workflow_instances` record bound to version 1, **When** version 2 is activated, **Then** the in-progress instance continues using version 1's `definition_id` until it reaches a terminal state.
3. **Given** a non-Super-Admin user, **When** they attempt to create or activate a DSL definition, **Then** they receive 403 Forbidden (`system.manage_all` required).
4. **Given** a context_schema with a `required` field, **When** a transition is triggered with a context missing that field, **Then** HTTP 422 is returned with `{ "field": "<context_field>", "message": "required field missing" }`.
---
### User Story 6 — Dead-letter Queue & Ops Recovery (Priority: P3)
A BullMQ `workflow-events` job fails all 3 retry attempts and moves to `workflow-events-failed`. Ops team is notified via n8n webhook and can manually requeue the job via Bull Board UI.
**Why this priority**: Without dead-letter recovery, failed event dispatches (notifications, downstream triggers) are silently lost, breaking audit trail integrity.
**Independent Test**: Can be fully tested by causing a simulated worker failure and verifying the n8n webhook fires and the job appears in the Bull Board dead-letter queue.
**Acceptance Scenarios**:
1. **Given** a `workflow-events` job that fails 3 times with exponential backoff, **When** attempts are exhausted, **Then** the job moves to `workflow-events-failed` queue and a webhook call is sent to `N8N_WEBHOOK_URL`.
2. **Given** a job in `workflow-events-failed`, **When** an Ops admin clicks "Retry" in Bull Board UI, **Then** the job re-enters `workflow-events` queue for processing.
3. **Given** a failed job, **When** the system auto-retries, **Then** it uses exponential backoff: attempt 1 immediately, attempt 2 after 500ms, attempt 3 after 1000ms — and does NOT auto-requeue after the dead-letter queue.
---
### Edge Cases
- What happens when Redis is down during a workflow transition (no Redlock available for state transition)? The optimistic lock (`version_no`) alone handles concurrency for transitions — Redis is NOT required for transitions (only for Document Numbering per ADR-002). Transition proceeds normally; only file-upload-plus-transition uses Redlock.
- What happens when a Redis Redlock fails during file-upload-plus-transition? Retry 3 times (500ms exponential backoff); if still failing, return HTTP 503 "Service temporarily unavailable" (Fail-closed — no partial state).
- What happens when a terminal-state workflow receives a transition request? The engine returns 409 `BusinessException` — "Workflow is already in a terminal state".
- What happens when `context_schema.required` field is missing at transition time? HTTP 422 `ValidationException` with field-level error — transition is blocked; caller must supply the missing context field and retry.
- What happens when a file is deleted from storage after being linked to a workflow step? The UI shows "File unavailable" for that attachment; the `workflow_histories` metadata record is preserved.
- What happens when two Admins concurrently activate different DSL versions for the same `workflow_code`? Last-write-wins on `is_active`; Redis cache is invalidated by both writes; existing instances are unaffected (already bound to a `definition_id`).
---
## Requirements _(mandatory)_
### Functional Requirements
**Workflow Engine Core (ADR-001)**
- **FR-001**: The system MUST evaluate workflow transition conditions using JSON Logic format (`{ "type": "json-logic", "rule": {...} }`) exclusively — no JavaScript string evaluation (`eval` / `new Function`).
- **FR-002**: The system MUST use optimistic locking (`version_no INT NOT NULL DEFAULT 1`) on `workflow_instances` to prevent concurrent double-transitions — only one transition per `(id, current_state, version_no)` tuple succeeds; the other receives HTTP 409.
- **FR-002a**: The `WorkflowTransitionGuard` MUST resolve DSL `require.role` values as **CASL ability checks** — each string value maps to a defined CASL `action:subject` pair (e.g., `"Admin"``workflow.manage`). Direct DB role-name matching is forbidden; permissions are evaluated dynamically at transition time via the CASL `AbilityFactory`.
- **FR-003**: The system MUST record every state transition in `workflow_histories`, including `action_by_user_id` (INT FK, internal, excluded from API) and `action_by_user_uuid` (VARCHAR 36, exposed in API per ADR-019).
- **FR-004**: All workflow events (notifications, side effects) MUST be dispatched via the dedicated BullMQ queue `workflow-events` — never inline within the request thread.
- **FR-005**: The `workflow-events` worker MUST be configured with concurrency 5, 3 retry attempts with exponential backoff, and a `workflow-events-failed` dead-letter queue.
- **FR-006**: When a job enters `workflow-events-failed`, the system MUST send a webhook to `N8N_WEBHOOK_URL` (env var, never hardcoded) to alert the ops team.
- **FR-007**: `workflow_definitions` MUST be cached in Redis with key `wf:def:{workflow_code}:{version}` (TTL: 1 hour), invalidated immediately when a Super Admin saves or activates a definition.
- **FR-008**: Context schema validation MUST occur in two phases: Phase 1 at definition save-time (structure), Phase 2 at transition-time (values against required fields) — missing required fields return HTTP 422 with field-level errors.
- **FR-009**: Only users with `system.manage_all` permission MAY create, update, activate, or deactivate workflow definitions.
- **FR-010**: In-progress `workflow_instances` MUST remain bound to the `definition_id` at time of creation — activating a new DSL version MUST NOT rebind in-progress instances.
**Integrated Banner & Lifecycle View (ADR-021 REQ-01 to REQ-03)**
- **FR-011**: Every document detail page (RFA, Transmittal, Circulation, Correspondence) MUST complete the Integrated Banner wiring — all four modules already have the component mounted but require gap-filling: live `workflowState`, `availableActions`, priority badge, and step-attachment upload support must be fully connected. No module is exempt.
- **FR-012**: The "Workflow Engine" tab on detail pages MUST display a vertical timeline of all workflow steps with: step role, handler name, description, and visual state (completed/active/pending).
- **FR-013**: The active step MUST be rendered with indigo color (`#6366f1`) and a CSS pulse animation; all other steps MUST NOT have the pulse animation.
**Step-specific Attachments (ADR-021 REQ-04 to REQ-05)**
- **FR-014**: The `attachments` table MUST have a nullable FK `workflow_history_id` — existing attachments without this FK are treated as main-document attachments.
- **FR-015**: Users MAY upload attachments only when the document is in an active-decision state (`PENDING_REVIEW` or `PENDING_APPROVAL`); uploads MUST be rejected with HTTP 409 when the document is in a terminal state (`APPROVED`, `REJECTED`, `CLOSED`).
- **FR-016**: Only the assigned step handler, organization admin, or Super Admin may upload step-specific attachments; unauthorized attempts return HTTP 403.
- **FR-017**: All uploaded files MUST be scanned by ClamAV before moving from temp to permanent storage; infected files MUST be deleted immediately and the user notified with "File rejected: security scan failed".
- **FR-018**: File uploads with a transition MUST require an `Idempotency-Key` header; duplicate requests with the same key return the cached result without re-processing.
- **FR-019**: Every step-specific attachment upload MUST be atomic with the workflow transition. Recovery on failure is: (1) if DB transition fails after file reaches permanent storage, `StorageService` MUST move the file back to temp storage; (2) temp files expire after a **24-hour TTL** and are automatically purged; (3) the user MAY retry the transition within the TTL window without re-uploading or re-scanning the file.
- **FR-020**: Any authorized user MAY preview PDF and image files inline via a modal without navigating away from the detail page.
**Admin UI — DSL Authoring (Super Admin)**
- **FR-024**: The system MUST provide an Admin UI page (accessible only to Super Admins) where DSL definitions can be created, edited (JSON editor), activated, and deactivated.
- **FR-025**: The DSL editor MUST display inline validation feedback — structure errors (Phase 1 save-time) are highlighted before the user saves; the page MUST NOT allow saving a DSL that fails Phase 1 validation.
**i18n (ADR-021 REQ-06)**
- **FR-021**: All UI text on new and updated components MUST use i18n keys — no hardcoded Thai or English strings.
**Observability**
- **FR-022**: The Workflow Engine MUST emit one structured log entry per transition containing: `instanceId`, `action`, `fromState`, `toState`, `userUuid`, `durationMs`, and `outcome` (`success` | `conflict` | `forbidden` | `validation_error` | `system_error`).
- **FR-023**: The Workflow Engine MUST record two metrics: (1) a **transition counter** labelled by `workflow_code`, `action`, and `outcome`; (2) a **transition latency histogram** (ms) labelled by `workflow_code`.
### Key Entities
- **WorkflowDefinition**: Versioned DSL template defining states, transitions, conditions, events, and context schema. Identified by `workflow_code` + `version`. One active version per code.
- **WorkflowInstance**: Running instance bound to a specific entity (RFA, Transmittal, Correspondence, Circulation). Tracks `current_state`, `context` (JSON), and `version_no` (optimistic lock).
- **WorkflowHistory**: Immutable record of every state transition. Linked to the acting user (both INT FK and UUID), comment, and metadata. Step-specific attachments link here.
- **Attachment**: File stored in permanent storage. May be a main-document attachment (`workflow_history_id = NULL`) or a step-specific attachment (`workflow_history_id` set).
---
## Success Criteria _(mandatory)_
### Measurable Outcomes
- **SC-001**: Zero concurrent double-approvals — a load test with 50 simultaneous `APPROVE` requests on the same workflow instance results in exactly 1 success and 49 responses with status 409.
- **SC-002**: Transition throughput — workflow state change (without file upload) completes in under 1 second (P95) for documents with up to 20 workflow history records under normal load.
- **SC-003**: Upload + transition SLA — `POST /workflow/:uuid/transition` with a file ≤ 10MB (including ClamAV scan, Redlock, and DB transaction) responds within 5 seconds (P95).
- **SC-004**: Event delivery reliability — less than 0.1% of `workflow-events` jobs reach the dead-letter queue under normal operating conditions.
- **SC-005**: DSL cache effectiveness — activating a new DSL version results in the stale cache entry being invalidated within 1 second on all app instances.
- **SC-006**: Integrated Banner adoption — 100% of document detail pages (RFA, Transmittal, Circulation, Correspondence) display the Integrated Banner and Workflow Engine tab after release.
- **SC-007**: No navigation required — reviewers complete document approval (view context + act) without leaving the detail page in 95%+ of sessions.
- **SC-008**: Audit completeness — every workflow transition has a corresponding `workflow_histories` record with user UUID, timestamp, action, and comment (if provided); zero orphaned transitions.
- **SC-009**: Observability coverage — 100% of workflow transitions (success, conflict, forbidden, error) produce a structured log entry and increment the transition counter metric; no silent failures.
---
## Assumptions
- ADR-001 Unified Workflow Engine backend infrastructure (`workflow_definitions`, `workflow_instances`, `workflow_histories` tables) is already partially implemented; this spec covers the production-hardening gaps (JSON Logic, `version_no`, dedicated BullMQ queue, context schema two-phase validation, ADR-019 UUID compliance for history records).
- ADR-021 Integrated Banner and Workflow Lifecycle components are **mounted but incompletely wired** across all four modules (RFA, Transmittal, Circulation, Correspondence). Common gaps include: missing live `availableActions`, no step-specific attachment upload zone, incomplete i18n. This spec closes all four modules to full completion.
- `json-logic-js` npm package is used for condition evaluation in `WorkflowDslService` (in-process, no external service).
- Redis and BullMQ infrastructure are available in all environments.
- ClamAV is available as a service and integrated via the existing `StorageService` two-phase upload pattern.
- `N8N_WEBHOOK_URL` environment variable will be set in `docker-compose.yml` for all environments before deploy.
- Bull Board UI (`@bull-board/nestjs`) will be installed for `workflow-events` and `workflow-events-failed` queue visibility.
+313
View File
@@ -0,0 +1,313 @@
# Tasks: Unified Workflow Engine — Production Hardening & Integrated Context
**Input**: Design documents from `specs/003-unified-workflow-engine/`
**Prerequisites**: plan.md ✅ | spec.md ✅ | data-model.md ✅ | research.md ✅ | contracts/ ✅ | quickstart.md ✅
**Tests**: Included for business-critical paths (per plan.md Test Plan)
**Organization**: Tasks grouped by user story (US1US5) enabling independent implementation and testing.
## Format: `[ID] [P?] [Story] Description`
- **[P]**: Can run in parallel (different files, no shared dependencies)
- **[Story]**: Which user story this task belongs to
- **Exact file paths** included in all descriptions
---
## Phase 1: Setup (Schema Deltas — DB Prerequisites)
**Purpose**: Create and apply schema changes that ALL subsequent code depends on. No code changes until Phase 1 is complete.
**⚠️ MUST apply to DB before writing any entity code**
- [ ] T001 Create `specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql``ALTER TABLE workflow_instances ADD COLUMN version_no INT NOT NULL DEFAULT 1` with `idx_wf_inst_version` index (per data-model.md §1 Delta 09)
- [ ] T002 Create `specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql``ALTER TABLE workflow_histories ADD COLUMN action_by_user_uuid VARCHAR(36) NULL` (per data-model.md §1 Delta 10)
- [ ] T003 Apply Delta 09 to MariaDB: `source specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql` — verify with `DESCRIBE workflow_instances`
- [ ] T004 Apply Delta 10 to MariaDB: `source specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql` — verify with `DESCRIBE workflow_histories`
**Checkpoint**: Run `DESCRIBE workflow_instances` and `DESCRIBE workflow_histories` — both new columns must be present before Phase 2 begins.
---
## Phase 2: Foundational (Entity & Module Setup — Blocking Prerequisites)
**Purpose**: Entity/DTO/module changes that ALL user story implementations depend on. No user story work until Phase 2 is complete.
**⚠️ CRITICAL — blocks all phases 3+**
- [ ] T005 [P] Add `versionNo: number` column to `backend/src/modules/workflow-engine/entities/workflow-instance.entity.ts``@Column({ name: 'version_no', type: 'int', default: 1 })` (per data-model.md §2.1)
- [ ] T006 [P] Add `actionByUserUuid?: string` column to `backend/src/modules/workflow-engine/entities/workflow-history.entity.ts``@Column({ name: 'action_by_user_uuid', length: 36, nullable: true })` (per data-model.md §2.2)
- [ ] T007 [P] Add `actorUuid?: string` field to `backend/src/modules/workflow-engine/dto/workflow-history-item.dto.ts` with `@ApiPropertyOptional` decorator (per data-model.md §2.3)
- [ ] T008 Register `workflow_transitions_total` Counter and `workflow_transition_duration_ms` Histogram in `backend/src/modules/workflow-engine/workflow-engine.module.ts` via `makeCounterProvider` / `makeHistogramProvider` from `@willsoto/nestjs-prometheus` (per data-model.md §4, plan.md Phase B5)
- [ ] T009 [P] Verify backend TypeScript compiles with no errors after T005T008: `pnpm tsc --noEmit` in `backend/`
**Checkpoint**: `pnpm tsc --noEmit` passes in backend. Existing workflow-engine tests still pass: `pnpm test --testPathPattern=workflow-engine`.
---
## Phase 3: User Story 1 — Workflow Transition with State Integrity (P1) 🎯 MVP
**Goal**: Guarantee race-condition-free state transitions with optimistic lock, CASL-mapped DSL role checks, structured observability, BullMQ dead-letter queue, and file rollback on DB failure.
**Independent Test**: POST 50 concurrent APPROVE requests on one instance → exactly 1 success (200) + 49 conflicts (409). Transition log entry appears for each outcome. Redlock metric increments.
### Implementation — US1 Core: Optimistic Lock
- [ ] T010 [US1] Update `processTransition()` signature in `backend/src/modules/workflow-engine/workflow-engine.service.ts` — add `userUuid: string` and `clientVersionNo?: number` parameters (per data-model.md §3, quickstart.md)
- [ ] T011 [US1] Add fast-fail optimistic lock check in `processTransition()` BEFORE Redlock acquisition: read `instance.versionNo`, compare with `clientVersionNo`, throw `ConflictException('WORKFLOW_VERSION_CONFLICT')` HTTP 409 on mismatch (per data-model.md §3 "Fast-fail check")
- [ ] T012 [US1] Add CAS version increment inside DB transaction in `processTransition()`: `UPDATE workflow_instances SET version_no = version_no + 1 WHERE id = :id AND version_no = :expected` — throw `ConflictException` if `affected === 0` (per data-model.md §3 "Version increment")
- [ ] T013 [US1] Populate `actionByUserUuid: userUuid` when creating `WorkflowHistory` record inside `processTransition()` (per data-model.md §3 "History creation")
- [ ] T014 [US1] Return `versionNo` (post-increment value) in the transition response DTO so clients can update their local version
### Implementation — US1: CASL DSL Role Mapping (FR-002a)
- [ ] T015 [US1] Add `DSL_ROLE_TO_CASL` config map constant in `backend/src/modules/workflow-engine/guards/workflow-transition.guard.ts`: map `Superadmin → system.manage_all`, `OrgAdmin → organization.manage_users`, `ContractMember → contract.view`, `AssignedHandler → __assigned__` (per research.md Decision 2, quickstart.md)
- [ ] T016 [US1] Add DSL role resolution step in `WorkflowTransitionGuard.canActivate()`: load compiled definition from instance, extract `require.role[]` for `currentState`, map each via `DSL_ROLE_TO_CASL`, check `userPermissions.includes(mapped)` — pass if any match; fall through to existing Level 3 check for `__assigned__` (per plan.md Phase B4, quickstart.md "DSL Role Mapping" pattern)
### Implementation — US1: Structured Observability (FR-022, FR-023)
- [ ] T017 [US1] Inject `workflow_transitions_total` Counter and `workflow_transition_duration_ms` Histogram via `@InjectMetric()` in `WorkflowEngineService` constructor (per data-model.md §4)
- [ ] T018 [US1] Wrap `processTransition()` body in `startMs = Date.now()` timer; add `try/catch/finally` block that: labels `outcome` from exception type, calls `transitionDuration.labels({workflow_code}).observe(durationMs)`, calls `transitionsTotal.labels({workflow_code, action, outcome}).inc()`, emits structured `this.logger.log(JSON.stringify({instanceId, action, fromState, toState, userUuid, durationMs, outcome, workflowCode}))` (per data-model.md §4, FR-022/023)
### Implementation — US1: BullMQ Dead-Letter Queue (FR-005, FR-006)
- [ ] T019 [US1] Register `workflow-events-failed` queue in `backend/src/modules/workflow-engine/workflow-engine.module.ts` — inject via `BullModule.registerQueue({ name: 'workflow-events-failed' })` (per plan.md Phase B7)
- [ ] T020 [US1] Add `@OnWorkerEvent('failed')` handler `onJobFailed(job, error)` in `backend/src/modules/workflow-engine/workflow-event.service.ts`: if `job.attemptsMade >= job.opts.attempts`, add job to `workflow-events-failed` queue; if `N8N_WEBHOOK_URL` env var set, POST JSON payload via `fetch`; else `logger.warn('N8N_WEBHOOK_URL not configured')` (per data-model.md §6, research.md Decision 5)
- [ ] T021 [US1] Verify worker default options in `workflow-engine.module.ts` have `concurrency: 5`, `attempts: 3`, `backoff: { type: 'exponential', delay: 500 }`, `removeOnFail: false` (per FR-005, plan.md Phase B7)
### Implementation — US1: File Rollback on DB Failure (FR-019)
- [ ] T022 [US1] In `processTransition()` `catch` block, after `queryRunner.rollbackTransaction()`, call `storageService.moveToTemp(attachmentPublicIds)` when `attachmentPublicIds` is non-empty — log rollback with attachment IDs for audit (per plan.md Phase B8, FR-019)
- [ ] T023 [US1] Inject `StorageService` (or `FileStorageService`) into `WorkflowEngineService` constructor for rollback call — add to `workflow-engine.module.ts` imports if not already present
### Tests — US1
- [ ] T024 [P] [US1] Write unit test in `backend/src/modules/workflow-engine/workflow-engine.service.spec.ts` — concurrent optimistic lock: mock two simultaneous calls with same `clientVersionNo`, assert first resolves success and second throws `ConflictException` with code `WORKFLOW_VERSION_CONFLICT`
- [ ] T025 [P] [US1] Write unit test in `backend/src/modules/workflow-engine/guards/workflow-transition.guard.spec.ts` — DSL role CASL mapping: assert `Superadmin` maps to `system.manage_all` pass, `OrgAdmin` with matching org passes, unknown role falls through to assignedUserId check
- [ ] T026 [P] [US1] Write unit test for `onJobFailed` in `workflow-event.service.ts` — assert `workflow-events-failed` queue receives dead-letter job and `fetch` is called with correct payload when `N8N_WEBHOOK_URL` is set; assert `logger.warn` when unset
**Checkpoint**: `pnpm test --testPathPattern=workflow-engine --coverage` — T024/T025/T026 green. Concurrent lock test passes.
---
## Phase 4: User Story 2 — Integrated Banner & Workflow Lifecycle View (P1)
**Goal**: All four document detail pages (RFA, Transmittal, Circulation, Correspondence) display live `workflowState`, `availableActions`, and priority badge with no navigation required for approval.
**Independent Test**: Open each detail page while a workflow instance is in `PENDING_REVIEW` — banner shows correct state + action buttons; Workflow Engine tab shows step timeline with active step highlighted in indigo + pulse animation.
### Implementation — US2: Correspondence Backend Gap-Fill
- [ ] T027 [US2] Update `backend/src/modules/correspondence/correspondence.service.ts` `findOneByUuid()` — call `workflowEngineService.getInstanceByEntity('correspondence', correspondence.uuid)` and expose `workflowInstanceId`, `workflowState`, `availableActions` in the response (same pattern as Transmittal/Circulation per v1.8.7 memory)
- [ ] T028 [US2] Update `backend/src/modules/correspondence/correspondence.module.ts` — import `WorkflowEngineModule` if not already imported
### Implementation — US2: Frontend Module Gap-Fill (all 4 modules)
- [ ] T029 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/correspondence/[uuid]/page.tsx` — wire live `workflowInstanceId`, `workflowState`, `availableActions`, `workflowPriority` into `<IntegratedBanner>` and `<WorkflowLifecycle>` components; update Correspondence type in `frontend/types/` to include workflow fields
- [ ] T030 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/rfa/[uuid]/page.tsx` — connect missing `availableActions` and `workflowPriority` props to `<IntegratedBanner>`; ensure `<WorkflowLifecycle>` receives live `instanceId`
- [ ] T031 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/transmittals/[uuid]/page.tsx` — add step-attachment upload zone props (`canUpload` flag computed from `currentState ∈ {PENDING_REVIEW, PENDING_APPROVAL}` AND user is assigned/org-admin/superadmin)
- [ ] T032 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/circulation/[uuid]/page.tsx` — same step-attachment upload zone props as T031
- [ ] T033 [US2] Update `frontend/types/correspondence.ts` (or equivalent) — add `workflowInstanceId?: string`, `workflowState?: string`, `availableActions?: string[]`, `workflowPriority?: 'URGENT' | 'HIGH' | 'MEDIUM' | 'LOW'` (ADR-019: string UUIDs only, no parseInt)
### Tests — US2
- [ ] T034 [P] [US2] Verify `pnpm tsc --noEmit` in `frontend/` passes after T029T033 — all four detail pages type-check correctly
**Checkpoint**: All four detail pages render `<IntegratedBanner>` with live data. Switch a document to `PENDING_REVIEW` — banner shows correct action buttons without page navigation.
---
## Phase 5: User Story 3 — Step-specific Attachments with Preview (P1)
**Goal**: Users in `PENDING_REVIEW` / `PENDING_APPROVAL` states can upload files via drag-and-drop, attached atomically to the workflow step. All users can preview PDFs/images inline without navigation.
**Independent Test**: Upload a PDF during `PENDING_REVIEW` → click Approve → history timeline shows the file chip → click chip → preview modal opens inline. Force-fail DB transaction → file appears back in temp, permanent storage unchanged.
### Implementation — US3: File Preview Modal (FR-020)
- [ ] T035 [P] [US3] Create `frontend/components/workflow/file-preview-modal.tsx` — shadcn/ui `Dialog` component; accepts `attachment: WorkflowAttachmentSummary | null` and `onClose: () => void` props; renders `<iframe src="/api/files/{publicId}/preview" />` for PDFs; `<img>` for image MIME types; download link fallback for other types (per plan.md Phase F1, quickstart.md "File Preview Modal")
- [ ] T036 [P] [US3] Add `WorkflowAttachmentSummary` interface to `frontend/types/workflow.ts` if not present: `{ publicId: string; originalFilename: string; mimeType: string; fileSize: number; createdAt: string }` (ADR-019: `publicId` only, no `id` or `uuid` alias)
### Implementation — US3: Step-Attachment Upload Zone (FR-014FR-019)
- [ ] T037 [US3] Update `frontend/components/workflow/integrated-banner.tsx` — add conditional upload zone rendered only when `props.currentState ∈ {PENDING_REVIEW, PENDING_APPROVAL}` AND `props.canUpload === true`; upload calls existing Two-Phase upload endpoint; appends returned `publicId` to `pendingAttachmentIds` state; passes `pendingAttachmentIds` to action button handler (per plan.md Phase F2)
- [ ] T038 [US3] Update `frontend/components/workflow/workflow-lifecycle.tsx` — for each history item render `attachments[]` as clickable file chips; on chip click open `<FilePreviewModal>`; import and use `FilePreviewModal` from T035 (per plan.md Phase F2)
- [ ] T039 [US3] Update `frontend/hooks/use-workflow-action.ts` — accept `attachmentPublicIds: string[]` parameter; include in POST body to `/workflow-engine/instances/:id/transition`; include `versionNo` from current instance state; on HTTP 409 show toast "เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช"; on 503 show toast "ระบบยุ่งชั่วคราว กรุณาลองใหม่" (per quickstart.md "Optimistic Lock — Client Side")
- [ ] T040 [US3] Update `backend/src/modules/workflow-engine/workflow-engine.controller.ts` — ensure `POST /instances/:id/transition` accepts `Idempotency-Key` header and passes `userUuid` (from JWT) and `clientVersionNo` to `processTransition()` (per contracts/workflow-transition.yaml)
- [ ] T041 [US3] Verify `WorkflowHistoryItemDto` exposes `attachments: AttachmentSummaryDto[]` in the history list endpoint response — update `getHistory()` method in `workflow-engine.service.ts` to eagerly load `attachments` relation per `workflow_history_id` (per data-model.md §3, FR-014)
### Tests — US3
- [ ] T042 [P] [US3] Write unit test in `backend/src/modules/workflow-engine/workflow-engine.service.spec.ts` — file rollback: mock `queryRunner.commitTransaction()` to throw; assert `storageService.moveToTemp()` is called with the correct `attachmentPublicIds` (per plan.md Test Plan)
- [ ] T043 [P] [US3] Write Vitest component test in `frontend/components/workflow/__tests__/file-preview-modal.test.tsx` — assert PDF renders `<iframe>`, image MIME type renders `<img>`, unsupported type renders download link, `onClose` called on dialog dismiss
**Checkpoint**: Upload a PDF on a document in `PENDING_REVIEW` → approve → check `workflow_histories` record has matching `workflow_history_id` in `attachments` table. Click the file chip → modal opens inline.
---
## Phase 6: User Story 4 — DSL Versioning & Instance Binding (P2)
**Goal**: Super Admins can activate new DSL versions; in-progress workflow instances continue on their bound definition version; Redis cache invalidates within 1 second of activation (SC-005).
**Independent Test**: Activate DSL v2 while v1 has an in-progress instance → existing instance still uses v1 DSL transitions; new instance created after activation uses v2.
### Implementation — US4: DSL Redis Cache Invalidation (FR-007, SC-005)
- [ ] T044 [US4] In `workflow-engine.service.ts` `createDefinition()` — after `workflowDefRepo.save()`, call `cacheManager.set('wf:def:${code}:${version}', saved, 3600000)` (1h TTL in ms) (per data-model.md §5, research.md Decision 4)
- [ ] T045 [US4] In `workflow-engine.service.ts` `update()` — before save, call `cacheManager.del('wf:def:${code}:${oldVersion}')` when DSL changes; when `is_active` toggles to `true`, call `redis.del('wf:def:${code}:active')` then set updated pointer; when `is_active` toggles to `false`, call `redis.del('wf:def:${code}:active')` (per data-model.md §5 "Invalidation triggers")
- [ ] T046 [US4] Add read-through cache in `getDefinitionById()`: call `cacheManager.get('wf:def:${id}')` first; fall back to `workflowDefRepo.findOne()` on miss; store result in cache before returning (per research.md Decision 4)
- [ ] T047 [US4] Verify `createInstance()` always uses latest active definition from DB (not cache) to prevent stale binding — confirm `findOne({ where: { workflow_code, is_active: true }, order: { version: 'DESC' } })` pattern is authoritative (per FR-010)
### Tests — US4
- [ ] T048 [P] [US4] Write unit test in `workflow-engine.service.spec.ts` — DSL activate cache invalidation: mock `cacheManager.del`, call `update({ is_active: true })`, assert `cacheManager.del` called with correct key within the same tick (per plan.md Test Plan)
**Checkpoint**: Activate DSL v2 via `PATCH /workflow-engine/definitions/:id` → Redis key `wf:def:{code}:active` updated immediately. In-progress v1 instance transitions still resolve against v1 compiled DSL.
---
## Phase 7: User Story 5 — Workflow Definition Authoring (Super Admin) (P2)
**Goal**: Super Admins can list, create, edit (JSON editor with inline validation), activate, and deactivate DSL definitions from an Admin UI page without touching the API directly.
**Independent Test**: Log in as Super Admin → navigate to `/admin/workflows/definitions` → create a new definition with an invalid DSL → see inline validation error before saving → fix → save → new definition appears in list.
### Implementation — US5: Backend `/validate` Endpoint (FR-025)
- [ ] T049 [US5] Add `POST /workflow-engine/definitions/validate` endpoint to `backend/src/modules/workflow-engine/workflow-engine.controller.ts` — accepts `{ dsl: object }`, calls `dslService.compile(dto.dsl)` in try/catch, returns `{ valid: true }` or `{ valid: false, errors: [{ path, message }] }` (per contracts/workflow-definitions.yaml, FR-025)
### Implementation — US5: TanStack Query Hooks
- [ ] T050 [P] [US5] Create `frontend/hooks/use-workflow-definitions.ts``useWorkflowDefinitions()` (GET list), `useWorkflowDefinition(id)` (GET single), `useCreateDefinition()` (POST mutation), `useUpdateDefinition()` (PATCH mutation), `useValidateDsl()` (POST validate mutation) — all using TanStack Query v5 patterns (per quickstart.md)
### Implementation — US5: Admin DSL List Page
- [ ] T051 [US5] Create `frontend/app/(admin)/admin/workflows/definitions/page.tsx` — Server Component shell + Client Component table; columns: `workflow_code`, `version`, `is_active` badge, created date, Actions (Edit link, Activate/Deactivate toggle button); uses `useWorkflowDefinitions()` hook; Activate/Deactivate calls `useUpdateDefinition()` mutation with `{ is_active: true/false }`; requires `system.manage_all` permission (CASL guard on page) (per plan.md Phase F4, FR-024)
### Implementation — US5: Admin DSL Editor Page
- [ ] T052 [US5] Create `frontend/app/(admin)/admin/workflows/definitions/[id]/page.tsx` — loads definition via `useWorkflowDefinition(id)`; renders Monaco Editor via `dynamic(() => import('@monaco-editor/react'), { ssr: false })`; `onChange` handler debounced 800ms calls `useValidateDsl()` mutation; displays validation errors as inline error list below editor; Save button disabled when `validationErrors.length > 0` (FR-025); on Save calls `useUpdateDefinition()` and shows success toast; i18n keys for all UI text (per research.md Decision 6, quickstart.md "Admin DSL Editor")
- [ ] T053 [US5] Create `frontend/app/(admin)/admin/workflows/definitions/new/page.tsx` — same editor as T052 but calls `useCreateDefinition()` mutation; `workflow_code` input field with validation; redirect to list page on success
### Tests — US5
- [ ] T054 [P] [US5] Write Vitest test for `frontend/app/(admin)/admin/workflows/definitions/[id]/page.tsx` — assert Save button is disabled when validation errors present; assert Save button enabled when `validationErrors` is empty; assert `useValidateDsl` is called on editor change (per plan.md Test Plan)
**Checkpoint**: Navigate to `/admin/workflows/definitions` — list renders all definitions. Click Edit → Monaco editor loads definition DSL. Paste invalid DSL → Save button disables and errors display inline. Fix DSL → Save enabled → save succeeds.
---
## Phase 8: Polish & Cross-Cutting Concerns
**Purpose**: i18n coverage, SC-009 verification, and spec compliance checks across all user stories.
- [ ] T055 [P] Audit all new UI text in `frontend/components/workflow/` and `frontend/app/(admin)/admin/workflows/` — replace any hardcoded Thai/English strings with i18n keys; add missing keys to `frontend/public/locales/th/` and `frontend/public/locales/en/` translation files (FR-021)
- [ ] T056 [P] Run full backend test suite: `pnpm test --coverage` in `backend/` — confirm no regressions; coverage ≥ 70% overall, ≥ 80% on `workflow-engine.service.ts` business logic (per plan.md Test Plan)
- [ ] T057 [P] Run full frontend typecheck: `pnpm tsc --noEmit` in `frontend/` — zero errors across all modified files
- [ ] T058 Verify SC-009 observability coverage: trigger one transition of each outcome type (success, conflict, forbidden, validation_error) and confirm structured log entries appear in the NestJS log output with all required fields (`instanceId`, `action`, `fromState`, `toState`, `userUuid`, `durationMs`, `outcome`, `workflowCode`)
- [ ] T059 Update `specs/003-unified-workflow-engine/spec.md` Status field from `Draft` to `Implemented` after all phases complete
---
## Dependencies & Execution Order
### Phase Dependencies
- **Phase 1 (Setup)**: No dependencies — start immediately
- **Phase 2 (Foundational)**: Depends on Phase 1 DB columns applied — **BLOCKS Phases 37**
- **Phase 3 (US1)**: Depends on Phase 2 — can start as soon as entities compile
- **Phase 4 (US2)**: Depends on Phase 2 — independent of Phase 3 (different files)
- **Phase 5 (US3)**: Depends on Phase 3 (uses updated `processTransition` + `use-workflow-action`) and Phase 4 (upload zone sits inside `IntegratedBanner`)
- **Phase 6 (US4)**: Depends on Phase 2 — independent of US1/US2/US3
- **Phase 7 (US5)**: Depends on Phase 6 (T049 validate endpoint, T044 cache) — `/validate` endpoint needed for editor inline feedback
- **Phase 8 (Polish)**: Depends on all phases complete
### User Story Dependencies
- **US1 (P1)**: Starts after Phase 2 — no US dependencies
- **US2 (P1)**: Starts after Phase 2 — no US dependencies (parallel with US1)
- **US3 (P1)**: Starts after US1 (T039 needs updated hook signature) and US2 (upload zone in banner)
- **US4 (P2)**: Starts after Phase 2 — independent (parallel with US1/US2)
- **US5 (P2)**: Starts after US4 (T049 validate endpoint depends on DSL cache from T044)
### Within Each Phase
- Schema before entities → entities before services → services before controllers → backend before frontend
- [P] tasks within a phase can run in parallel (different files)
---
## Parallel Execution Examples
### Phase 2 Parallel (T005T007 run together)
```
T005: workflow-instance.entity.ts ← add versionNo
T006: workflow-history.entity.ts ← add actionByUserUuid
T007: workflow-history-item.dto.ts ← add actorUuid
```
### Phase 3 Parallel Groups
```
Group A (processTransition core): T010 → T011 → T012 → T013 → T014 (sequential)
Group B (guard): T015 → T016 (sequential, different file from Group A — parallel with Group A)
Group C (observability): T017 → T018 (different file — parallel with Groups A+B)
Group D (BullMQ): T019 → T020 → T021 (different service file — parallel with Groups A+B+C)
Tests: T024, T025, T026 (parallel with each other after Groups A+B+D complete)
```
### Phase 4 + Phase 6 Parallel (different feature areas)
```
Phase 4 (US2): T027T034 — Correspondence backend + frontend gap-fill
Phase 6 (US4): T044T048 — DSL cache invalidation
(Run simultaneously — no shared files)
```
---
## Implementation Strategy
### MVP Scope (US1 + US2 + US3 — all P1)
```
Phase 1 → Phase 2 → Phase 3 (US1) → Phase 4 (US2) → Phase 5 (US3) → Phase 8 Polish
```
Delivers: Race-condition-free transitions, live banner on all 4 modules, step-specific attachments with preview.
### Full Delivery (adds P2 stories)
```
MVP + Phase 6 (US4) + Phase 7 (US5)
```
Adds: Redis cache invalidation, Admin DSL editor.
### Suggested First Commit
After T001T009 (schema + entities compile) → commit:
```
chore(schema): delta-09 version_no, delta-10 action_by_user_uuid (ADR-009)
feat(workflow-engine): add versionNo + actionByUserUuid entities + metrics registration (FR-002/003)
```
---
## Summary
| Phase | User Story | Tasks | Parallel Opportunities |
|-------|-----------|-------|----------------------|
| 1 — Setup | Schema | T001T004 | T001+T002 parallel |
| 2 — Foundational | — | T005T009 | T005+T006+T007 parallel |
| 3 — P1 US1 | Transition Integrity | T010T026 | Guard + observability + BullMQ parallel; tests parallel |
| 4 — P1 US2 | Banner Gap-Fill | T027T034 | T029+T030+T031+T032+T033 parallel |
| 5 — P1 US3 | Step Attachments | T035T043 | T035+T036 parallel; tests parallel |
| 6 — P2 US4 | DSL Versioning | T044T048 | T044+T046+T047 parallel |
| 7 — P2 US5 | Admin DSL Editor | T049T054 | T050+T054 parallel |
| 8 — Polish | Cross-cutting | T055T059 | T055+T056+T057 parallel |
| **Total** | | **59 tasks** | **~22 parallel opportunities** |
**MVP**: T001T043 (43 tasks, Phases 15, all P1 stories)
**Full**: T001T059 (59 tasks, all phases)
@@ -0,0 +1,17 @@
-- ============================================================
-- Delta 09: ADR-001 v1.1 — Optimistic Lock for Workflow Transitions
-- เพิ่ม version_no ใน workflow_instances สำหรับ Optimistic Concurrency Control
-- ============================================================
-- Feature: 003-unified-workflow-engine (FR-002)
-- Date: 2026-05-03
-- ข้อควรระวัง: Existing rows จะได้ค่า DEFAULT 1 อัตโนมัติ — ไม่มี Data Loss
-- Rollback: ALTER TABLE workflow_instances DROP INDEX idx_wf_inst_version;
-- ALTER TABLE workflow_instances DROP COLUMN version_no;
ALTER TABLE workflow_instances
ADD COLUMN version_no INT NOT NULL DEFAULT 1
COMMENT 'Optimistic lock counter — incremented on every successful transition (ADR-001 v1.1 FR-002). Client sends current value; server rejects with 409 if mismatch.';
-- Index เพื่อรองรับ CAS check: WHERE id = ? AND version_no = ?
CREATE INDEX idx_wf_inst_version
ON workflow_instances (id, version_no);
@@ -0,0 +1,14 @@
-- ============================================================
-- Delta 10: ADR-001 v1.1 / ADR-019 UUID Compliance
-- เพิ่ม action_by_user_uuid ใน workflow_histories
-- เพื่อ expose User identity ผ่าน API โดยไม่ต้องเปิดเผย INT PK (ADR-019)
-- ============================================================
-- Feature: 003-unified-workflow-engine (FR-003)
-- Date: 2026-05-03
-- ข้อควรระวัง: NULL สำหรับ Historical records ที่สร้างก่อน delta นี้ — เป็น Acceptable
-- NULL ในบริบทนี้ = "System Action" หรือ "Pre-migration record"
-- Rollback: ALTER TABLE workflow_histories DROP COLUMN action_by_user_uuid;
ALTER TABLE workflow_histories
ADD COLUMN action_by_user_uuid VARCHAR(36) NULL
COMMENT 'UUID ของ User ผู้ดำเนินการ — ใช้ใน API Response แทน INT FK (ADR-019). NULL = System Action หรือ Pre-migration record';
@@ -2,6 +2,7 @@
**Status:** Accepted
**Date:** 2026-02-24
**Last Amended:** 2026-05-02
**Decision Makers:** Development Team, System Architect
**Related Documents:**
@@ -42,6 +43,34 @@ LCBP3-DMS ต้องจัดการเอกสารหลายประ
---
## Clarifications
### Session 2026-05-02 (Round 1 — ADR-001-add.md merge)
- Q: Event handling — Outbox Pattern หรือ BullMQ (ADR-008)? → A: **BullMQ only** — WorkflowEngine enqueues BullMQ job โดยตรง ไม่มี outbox table; สอดคล้อง ADR-008
- Q: Concurrency control — Optimistic Lock vs Redis Redlock vs แยก concern? → A: **แยก concern**`version_no` optimistic lock สำหรับ state transition; Redis Redlock เฉพาะ Document Numbering (ADR-002)
- Q: Context schema — validate ที่ไหน และ scope ระดับใด? → A: **Two-phase validation** (save-time + transition-time); schema scope **per `workflow_definition` version**
- Q: Condition Engine library? → A: **`json-logic-js` in-process** ใน `WorkflowDslService`; fallback to custom parser if production issues
- Q: Auto-action worker — extend existing หรือ dedicated queue? → A: **Dedicated `workflow-events` BullMQ queue** แยกจาก `notification-queue`
### Session 2026-05-02 (Round 2 — ADR-001 full review)
- Q: DDL gap — เพิ่ม `version_no` + `context_schema` ใน DDL? → A: **yes**`version_no INT NOT NULL DEFAULT 1` ใน `workflow_instances`; `context_schema JSON NULL` ใน `workflow_definitions`
- Q: ConflictException retry strategy? → A: **409 ขึ้น frontend** via `BusinessException` (ADR-007); frontend แสดง toast "กรุณาลองใหม่" — ไม่ auto-retry
- Q: Redis cache TTL/invalidation strategy? → A: **TTL 1h + event invalidation** เมื่อ admin save/activate DSL; key `wf:def:{workflow_code}:{version}`
- Q: WorkflowEventsWorker concurrency/retry config? → A: **concurrency 5, retry 3 + exponential backoff + dead-letter queue**
- Q: RBAC สำหรับ DSL authoring? → A: **Super Admin เท่านั้น** (`system.manage_all`) — create/update/activate/deactivate workflow definitions
### Session 2026-05-02 (Round 3 — ADR-019 compliance + ops)
- Q: `action_by_user_id INT NULL` ใน `workflow_histories` — ADR-019 compliance? → A: **คง INT FK + `@Exclude()`** บน Entity; เพิ่ม `action_by_user_uuid VARCHAR(36) NULL` สำหรับ API response
- Q: `validateContext()` fail ที่ transition-time — HTTP status? → A: **422 Unprocessable Entity** via `ValidationException` (ADR-007 Validation tier) พร้อม field-level errors
- Q: Dead-letter queue `workflow-events-failed` — ops procedure? → A: **n8n webhook alert + Bull Board UI** สำหรับ manual requeue
- Q: n8n webhook URL — เก็บที่ไหน? → A: **`N8N_WEBHOOK_URL` environment variable** ใน `docker-compose.yml`; อ่านผ่าน `ConfigService`
- Q: `context_schema.required` — enforce จริงหรือไม่? → A: **enforce strictly** — required field หาย → throw 422 `ValidationException`; ไม่ block transition
---
## Decision Drivers
- **DRY Principle:** Don't Repeat Yourself - ลดการเขียน Code ซ้ำ
@@ -206,8 +235,9 @@ CREATE TABLE workflow_definitions (
workflow_code VARCHAR(50) NOT NULL,
version INT NOT NULL DEFAULT 1,
description TEXT NULL,
dsl JSON NOT NULL, -- Raw DSL from user
compiled JSON NOT NULL, -- Validated and optimized for Runtime
dsl JSON NOT NULL, -- Raw DSL from user
compiled JSON NOT NULL, -- Validated and optimized for Runtime
context_schema JSON NULL, -- JSON Schema for context validation (two-phase)
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
@@ -221,6 +251,7 @@ CREATE TABLE workflow_instances (
entity_type VARCHAR(50) NOT NULL, -- e.g. "correspondence", "rfa"
entity_id VARCHAR(50) NOT NULL,
current_state VARCHAR(50) NOT NULL,
version_no INT NOT NULL DEFAULT 1, -- Optimistic lock (@VersionColumn) — ป้องกัน race condition
status ENUM('ACTIVE', 'COMPLETED', 'CANCELLED', 'TERMINATED') DEFAULT 'ACTIVE',
context JSON NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
@@ -235,7 +266,8 @@ CREATE TABLE workflow_histories (
from_state VARCHAR(50) NOT NULL,
to_state VARCHAR(50) NOT NULL,
action VARCHAR(50) NOT NULL,
action_by_user_id INT NULL,
action_by_user_id INT NULL, -- Internal FK (@Exclude() in Entity) — ห้าม expose ใน API
action_by_user_uuid VARCHAR(36) NULL, -- UUID สำหรับ API response (ADR-019)
comment TEXT NULL,
metadata JSON NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
@@ -250,6 +282,14 @@ CREATE TABLE workflow_histories (
"workflow": "CORRESPONDENCE_ROUTING",
"version": 1,
"description": "Standard correspondence routing",
"context_schema": {
"type": "object",
"properties": {
"requiresLegal": { "type": "number" },
"hasRecipient": { "type": "boolean" }
},
"required": []
},
"states": [
{
"name": "DRAFT",
@@ -261,7 +301,10 @@ CREATE TABLE workflow_histories (
"role": ["Admin"],
"user": "123"
},
"condition": "context.requiresLegal > 0",
"condition": {
"type": "json-logic",
"rule": { ">": [{ "var": "requiresLegal" }, 0] }
},
"events": [
{
"type": "notify",
@@ -299,6 +342,8 @@ CREATE TABLE workflow_histories (
}
```
> **⚠️ หมายเหตุ:** `condition` ต้องใช้ JSON Logic format (`{ "type": "json-logic", "rule": {...} }`) เท่านั้น — ห้ามใช้ JS string expression (`"context.x === true"`) เพราะเป็น security risk (code injection)
### NestJS Module Structure
```typescript
@@ -325,6 +370,11 @@ export class WorkflowEngineService {
order: { version: 'DESC' },
});
// Validate initial context against context_schema (save-time phase 1)
if (definition.compiled.contextSchema) {
this.dslService.validateContext(initialContext, definition.compiled.contextSchema);
}
// Initial state directly from compiled DSL
const initialState = definition.compiled.initialState;
@@ -333,6 +383,7 @@ export class WorkflowEngineService {
entityType,
entityId,
currentState: initialState,
versionNo: 1, // TypeORM @VersionColumn — optimistic lock
status: WorkflowStatus.ACTIVE,
context: initialContext,
});
@@ -345,19 +396,46 @@ export class WorkflowEngineService {
comment?: string,
payload: Record<string, unknown> = {}
) {
// Evaluation via WorkflowDslService
// Validate context values against schema (transition-time phase 2)
if (definition.compiled.contextSchema) {
this.dslService.validateContext(instance.context, definition.compiled.contextSchema);
}
// Evaluation via WorkflowDslService (uses json-logic-js in-process)
const evaluation = this.dslService.evaluate(compiled, instance.currentState, action, context);
// Update state to target State
instance.currentState = evaluation.nextState;
// Optimistic lock: update state only if current_state + version_no match
// ❌ ไม่ใช้ Redis Redlock ใน workflow transition (Redlock เฉพาะ Document Numbering ADR-002)
const updated = await this.instanceRepo
.createQueryBuilder()
.update(WorkflowInstance)
.set({
currentState: evaluation.nextState,
versionNo: () => 'version_no + 1',
})
.where('id = :id AND current_state = :state AND version_no = :ver', {
id: instance.id,
state: instance.currentState,
ver: instance.versionNo,
})
.execute();
if (updated.affected === 0) {
throw new ConflictException('Concurrent transition detected — please retry');
}
if (compiled.states[evaluation.nextState].terminal) {
instance.status = WorkflowStatus.COMPLETED;
}
// Process background events asynchronously
// Dispatch events async via dedicated BullMQ queue 'workflow-events' (ADR-008)
// ❌ ห้าม dispatch events แบบ sync ใน request thread
if (evaluation.events && evaluation.events.length > 0) {
this.eventService.dispatchEvents(instance.id, evaluation.events, context);
await this.workflowEventsQueue.add('dispatch', {
instanceId: instance.id,
events: evaluation.events,
context,
});
}
}
}
@@ -365,6 +443,80 @@ export class WorkflowEngineService {
---
## 🏭 Production Architecture
### Runtime Flow
```
[ API / Service Layer ]
[ WorkflowEngineService ]
- validate context (two-phase: save-time + transition-time)
- evaluate condition (json-logic-js in-process, WorkflowDslService)
- optimistic lock: UPDATE WHERE current_state = ? AND version_no = ?
- write workflow_histories
- enqueue BullMQ job → queue: 'workflow-events'
[ DB (workflow_instances + workflow_histories) ]
↓ (async, dedicated queue)
[ WorkflowEventsWorker (BullMQ: 'workflow-events') ]
┌───────────────┐
│ n8n │ (webhook / notification dispatch)
└───────────────┘
```
### Production Rules (Non-Negotiable)
| # | Rule | Detail |
|---|------|--------|
| 1 | **Source of Truth** | Workflow state = DB only — ห้ามเก็บ state ใน memory/cache |
| 2 | **Deterministic Execution** | ทุก transition MUST declared ใน DSL — ห้าม dynamic transition |
| 3 | **No Inline Code Execution** | Condition MUST ใช้ JSON Logic format — ห้าม JS string eval |
| 4 | **Async Side Effects** | ทุก event MUST ผ่าน BullMQ `workflow-events` queue — ห้าม sync dispatch |
| 5 | **Idempotency** | Transition MUST safe to retry — optimistic lock ป้องกัน double-apply |
| 6 | **Instance Isolation** | In-progress instances ใช้ `workflow_definition` version เดิม — ห้าม rebind |
### Concurrency Control (แยก concern)
| Concern | Mechanism | Scope |
|---------|-----------|-------|
| Workflow state transition | `version_no` optimistic lock (TypeORM `@VersionColumn`) | `workflow_instances` table |
| Document Numbering | Redis Redlock (ADR-002) | Number generation only |
> ❌ **ห้ามใช้ Redis Redlock ใน workflow transition layer** — Redlock เฉพาะ Document Numbering
### Condition Engine
- **Library:** `json-logic-js` (npm) — evaluate in-process ใน `WorkflowDslService`
- **Fallback:** migrate to custom parser เมื่อพบปัญหา performance/complexity ใน production
- **Forbidden:** arbitrary JS string evaluation (`eval`, `new Function`, string conditions)
### Context Schema Validation
- `context_schema` stored per `workflow_definition` version (รองรับ schema evolution)
- **Phase 1 (Save-time):** validate schema structure เมื่อ admin save DSL
- **Phase 2 (Transition-time):** validate context values ตรง schema ก่อน evaluate condition
- **Required field enforcement:** `required` array ใน schema **enforce strictly** — missing required field → throw `ValidationException` (ADR-007) → HTTP 422 + field-level errors
- **Failure response:** `{ field: "<context_field>", message: "required field missing" }` — ไม่ block transition — caller ต้องแก้ context แล้ว retry
### Event Queue
- Queue name: `workflow-events` (dedicated BullMQ queue — แยกจาก `notification-queue`)
- Worker: `WorkflowEventsWorker` — config:
- **concurrency:** 5
- **attempts:** 3 (exponential backoff)
- **dead-letter queue:** `workflow-events-failed` หลัง attempts หมด
- **n8n webhook URL:** `N8N_WEBHOOK_URL` env var (ใน `docker-compose.yml`) — อ่านผ่าน `ConfigService`; ห้าม hardcode
- **Dead-letter ops:**
- เมื่อ job ตกใน `workflow-events-failed` → trigger n8n webhook แจ้ง ops team
- Manual requeue ผ่าน **Bull Board UI** (admin panel)
- ❌ ไม่ auto-requeue — ป้องกัน retry loop ถ้าเป็น permanent bug
- ❌ ไม่ใช้ Outbox Pattern (polling DB table) — BullMQ มี retry/dead-letter/persistence อยู่แล้ว
---
## Consequences
### Positive
@@ -388,7 +540,8 @@ export class WorkflowEngineService {
- **Complexity:** สร้าง UI Builder สำหรับ Workflow Design ในอนาคต
- **Learning Curve:** เขียน Documentation และ Examples ที่ชัดเจน
- **Performance:** ใช้ Redis Cache สำหรับ Workflow Definitions
- **Performance:** Redis Cache สำหรับ `workflow_definitions` — key: `wf:def:{workflow_code}:{version}`, TTL: 1h, invalidate ทันทีเมื่อ admin save/activate DSL ใหม่
- **Concurrency Conflict:** `ConflictException` ส่ง `BusinessException` (ADR-007) → 409 ไป frontend; user retry ด้วยตัวเอง — ไม่ auto-retry
- **Debugging:** สร้าง Workflow Visualization Tool
- **Testing:** เขียน Comprehensive Unit Tests สำหรับ Engine
@@ -400,6 +553,9 @@ export class WorkflowEngineService {
- [Backend Guidelines](../05-Engineering-Guidelines/05-02-backend-guidelines.md#workflow-engine-integration) - Unified Workflow Engine
- [Unified Workflow Requirements](../01-Requirements/01-03-modules/01-03-06-unified-workflow.md) - Unified Workflow Specification
- [ADR-007 Error Handling](./ADR-007-error-handling-strategy.md) - `BusinessException` + 409 conflict response pattern
- [ADR-008 Notifications](./ADR-008-email-notification-strategy.md) - BullMQ `workflow-events` queue pattern
- [ADR-016 Security](./ADR-016-security-authentication.md) - `system.manage_all` required for DSL authoring
---
@@ -409,6 +565,8 @@ export class WorkflowEngineService {
- Admin UI สำหรับจัดการ Workflow จะพัฒนาใน Phase 2
- ต้องมี Migration Tool สำหรับ Workflow Definition Changes
- พิจารณาใช้ BPMN 2.0 Notation ในอนาคต (ถ้าต้องการ Visual Workflow Designer)
- **Required env vars:** `N8N_WEBHOOK_URL` ต้องตั้งใน `docker-compose.yml` ทุก environment ก่อน deploy
- **Bull Board UI:** ติดตั้ง `@bull-board/nestjs` สำหรับ visibility ของ `workflow-events` และ `workflow-events-failed` queues
---
@@ -429,12 +587,16 @@ export class WorkflowEngineService {
| Version | Date | Changes | Status |
|---------|------|---------|--------|
| 1.0 | 2026-02-24 | Initial version - DSL-based Unified Workflow Engine | ✅ Active |
| 1.1 | 2026-05-02 | Production hardening: JSON Logic condition engine, optimistic lock concurrency, BullMQ dedicated queue, context schema two-phase validation, async-only auto-action rule | ✅ Active |
---
## Related ADRs
- [ADR-002: Document Numbering Strategy](./ADR-002-document-numbering-strategy.md) - ใช้ Workflow Engine trigger Document Number Generation
- [ADR-002: Document Numbering Strategy](./ADR-002-document-numbering-strategy.md) - ใช้ Workflow Engine trigger Document Number Generation; Redis Redlock เฉพาะ numbering
- [ADR-007: Error Handling Strategy](./ADR-007-error-handling-strategy.md) - `ConflictException``BusinessException` → 409 pattern
- [ADR-008: Email/Notification Strategy](./ADR-008-email-notification-strategy.md) - BullMQ `workflow-events` dedicated queue
- [ADR-016: Security & Authentication](./ADR-016-security-authentication.md) - `system.manage_all` RBAC guard สำหรับ DSL authoring
- [RBAC Matrix](../01-Requirements/01-02-business-rules/01-02-01-rbac-matrix.md) - Permission Guards ใน Workflow Transitions
---