np-dms/lcbp3

Fork 0

Files

T

admin 2c24991f88

CI / CD Pipeline / build (push) Failing after 6m6s

Details

CI / CD Pipeline / deploy (push) Has been skipped

Details

690503:0135 Update workflow #01

2026-05-03 01:36:37 +07:00

8.2 KiB

Raw Blame History

Research: Unified Workflow Engine — Production Hardening Decisions

Phase 0 Output | Generated: 2026-05-02
Builds on: specs/08-Tasks/ADR-021-workflow-context/research.md (attachment strategy, FK structure, UUID type — all resolved previously)

Decision 1: Optimistic Lock Strategy for `processTransition()` (FR-002)

Question: processTransition() already uses pessimistic_write DB lock. ADR-001 v1.1 requires adding version_no optimistic lock. Should they co-exist or replace?

Option A: Replace pessimistic with optimistic (Selected ❌)

Remove lock: { mode: 'pessimistic_write' } and rely solely on version_no CAS.

Cons:

Two concurrent requests with different version_no values still cause a race window between the DB read and the UPDATE
Redlock already acquired before DB transaction — removing pessimistic adds no benefit to latency

Option B: Dual-layer defense-in-depth (Selected ✅)

Keep pessimistic_write inside the transaction. Add version_no check as a fast-fail before Redlock acquisition.

Flow:

Client sends { action, version_no: N }
    ↓
[Fast-fail] Read instance.version_no from DB (no lock)
    If N ≠ instance.version_no → HTTP 409 immediately (no Redlock acquired)
    ↓
[Acquire Redlock]
    ↓
[DB Transaction with pessimistic_write]
    Re-check version_no under lock (TOCTOU defense)
    If still mismatch → 409 and release lock
    Else: commit + increment version_no

Pros:

Fast-fail saves Redlock round-trip for stale clients (SC-001 — no double approvals)
Inner pessimistic lock prevents any residual race within the DB transaction
Defense-in-depth: two independent barriers

Decision: Option B — dual-layer.
Rationale: Zero latency regression for non-conflicting requests; stale-client 409 fired before lock acquisition; inner lock remains for cross-process correctness.

Decision 2: DSL `require.role` → CASL Ability Mapping (FR-002a)

Question: How should the guard resolve DSL require.role: ["Admin"] against the CASL permission model?

Option A: Static config map in guard (Selected ✅)

const DSL_ROLE_TO_CASL: Record<string, string> = {
  'Superadmin':      'system.manage_all',
  'OrgAdmin':        'organization.manage_users',
  'ContractMember':  'contract.view',
  'AssignedHandler': '__assigned__',
};

Guard resolves each DSL role string → CASL permission string → userPermissions.includes(mapped).

Pros:

No new DB tables or config entities
Testable in isolation (mock userPermissions)
Backward-compatible: unknown DSL roles fall through to __assigned__ check

Option B: Dynamic mapping table in DB (Rejected ❌)

Store DSL role → CASL ability mappings in a new workflow_role_mappings table.

Cons:

New table requires ADR-009 delta + entity + service
Over-engineering for a mapping that changes rarely
Adds DB query to every transition guard check

Decision: Option A — static config map.
Rationale: The mapping is stable (tied to ADR-016 RBAC levels); config-driven in code is sufficient and avoids over-engineering.

Decision 3: Per-Transition Prometheus Metrics (FR-023)

Question: The existing service has Redlock-specific metrics. Where should workflow-level transition metrics be registered?

Existing metrics (keep)

workflow_redlock_acquire_duration_ms (Histogram)
workflow_redlock_acquire_failures_total (Counter)

New metrics needed

workflow_transitions_total (Counter, labels: workflow_code, action, outcome)
workflow_transition_duration_ms (Histogram, labels: workflow_code)

Registration approach: Add makeCounterProvider and makeHistogramProvider in workflow-engine.module.ts via @willsoto/nestjs-prometheus. Inject with @InjectMetric('workflow_transitions_total').

Outcome label values:

success — transition committed
conflict — optimistic lock mismatch (409) or TOCTOU
forbidden — CASL guard rejection (403)
validation_error — DSL condition failed (422)
system_error — unexpected exception (500)

Decision: Register in WorkflowEngineModule; inject into WorkflowEngineService; record in processTransition() try/catch/finally block.

Decision 4: DSL Definition Redis Cache Pattern (FR-007)

Question: Cache key format, TTL, and invalidation strategy for workflow_definitions.

Cache key design

wf:def:{workflow_code}:{version}   → single definition
wf:def:{workflow_code}:active      → pointer to active version (for fast active lookup)

Invalidation triggers

Event	Action
`createDefinition()`	SET new key; leave old active pointer
`update()` — DSL change	DEL old key; SET updated key
`is_active = true`	SET `wf:def:{code}:active`; DEL previous active pointer
`is_active = false`	DEL `wf:def:{code}:active`

TTL: 3600 seconds (1 hour). Acceptable stale window for inactive definitions; active pointer is always invalidated on toggle.

Decision: Two-key pattern with a separate :active pointer key. Invalidate pointer immediately on is_active change → satisfies SC-005 (< 1s invalidation).

Decision 5: BullMQ Dead-Letter Queue Architecture (FR-005, FR-006)

Question: How to implement workflow-events-failed DLQ with n8n webhook notification?

Option A: Separate `workflow-events-failed` queue (Selected ✅)

// WorkflowEventProcessor
@OnWorkerEvent('failed')
async onFailed(job: Job, error: Error) {
  if (job.attemptsMade >= job.opts.attempts) {
    // All retries exhausted → DLQ + webhook
    await this.failedQueue.add('dead-letter', { jobId: job.id, ...job.data });
    await this.notifyOps(job, error);
  }
}

Pros:

Failed jobs visible in Bull Board under workflow-events-failed
Can be requeued manually via Bull Board UI
n8n only notified on final failure (not on intermediate retries)

Option B: BullMQ native `removeOnFail: false` only (Rejected ❌)

Keep jobs in workflow-events completed/failed states, no separate queue.

Cons:

Bull Board has no separate DLQ view
No ops notification mechanism
Harder to isolate and requeue failed jobs

Decision: Option A — separate workflow-events-failed queue.
n8n webhook: Send via fetch(process.env.N8N_WEBHOOK_URL, { method: 'POST', body: JSON.stringify(payload) }). Guard with if (!process.env.N8N_WEBHOOK_URL) to avoid hard failure in dev environment.

Decision 6: Admin DSL Editor — JSON Editor Library (FR-024, FR-025)

Question: Which JSON editor library for the DSL authoring UI?

Option A: Monaco Editor (`@monaco-editor/react`) (Selected ✅)

Full VS Code-like editor with JSON syntax highlighting, bracket matching, and inline error markers.

Pros:

Inline error decoration (squiggle underlines) — satisfies FR-025 inline validation feedback
JSON schema validation via monaco.languages.json.jsonDefaults.setDiagnosticsOptions()
Familiar to developers
Already potentially used elsewhere in admin UIs

Cons:

~2MB bundle (lazy loaded via dynamic(() => import('@monaco-editor/react'), { ssr: false }))

Option B: CodeMirror 6 (`@uiw/react-codemirror`) (Alternative)

Lighter (~400KB for JSON extension).

Cons:

No native JSON Schema validation; requires custom linting extension
Inline error decoration requires manual setup

Decision: Option A — Monaco Editor, lazy-loaded. The bundle cost is acceptable for an Admin-only page (not user-facing); inline validation is a critical FR-025 requirement.
DSL Schema: Provide the compiled DSL JSON Schema to Monaco for inline validation → errors shown before the user clicks Save.

Carry-Forward from Prior Research

The following decisions from specs/08-Tasks/ADR-021-workflow-context/research.md remain valid and are not re-litigated:

File attachment strategy: Upload-then-reference (Two-Phase, ADR-016) ✅
FK structure: Direct workflow_history_id on attachments table ✅
UUID type for workflow_histories.id: CHAR(36) UUID direct PK ✅
Redlock scope: Transition-level Redlock (not document-numbering Redlock) ✅
Preview endpoint: Use existing /api/files/{publicId} with Content-Disposition: inline ✅

8.2 KiB Raw Blame History

Research: Unified Workflow Engine — Production Hardening Decisions

Decision 1: Optimistic Lock Strategy for processTransition() (FR-002)

Option A: Replace pessimistic with optimistic (Selected ❌)

Option B: Dual-layer defense-in-depth (Selected ✅)

Decision 2: DSL require.role → CASL Ability Mapping (FR-002a)

Option A: Static config map in guard (Selected ✅)

Option B: Dynamic mapping table in DB (Rejected ❌)

Decision 3: Per-Transition Prometheus Metrics (FR-023)

Existing metrics (keep)

New metrics needed

Decision 4: DSL Definition Redis Cache Pattern (FR-007)

Cache key design

Invalidation triggers

TTL: 3600 seconds (1 hour). Acceptable stale window for inactive definitions; active pointer is always invalidated on toggle.

Decision 5: BullMQ Dead-Letter Queue Architecture (FR-005, FR-006)

Option A: Separate workflow-events-failed queue (Selected ✅)

Option B: BullMQ native removeOnFail: false only (Rejected ❌)

Decision 6: Admin DSL Editor — JSON Editor Library (FR-024, FR-025)

Option A: Monaco Editor (@monaco-editor/react) (Selected ✅)

Option B: CodeMirror 6 (@uiw/react-codemirror) (Alternative)

Carry-Forward from Prior Research

8.2 KiB

Raw Blame History

Decision 1: Optimistic Lock Strategy for `processTransition()` (FR-002)

Decision 2: DSL `require.role` → CASL Ability Mapping (FR-002a)

Option A: Separate `workflow-events-failed` queue (Selected ✅)

Option B: BullMQ native `removeOnFail: false` only (Rejected ❌)

Option A: Monaco Editor (`@monaco-editor/react`) (Selected ✅)

Option B: CodeMirror 6 (`@uiw/react-codemirror`) (Alternative)