690503:0135 Update workflow #01

2026-05-03 01:35:05 +07:00
parent d239b58387
commit 2c24991f88
85 changed files with 6335 additions and 100 deletions
@@ -0,0 +1,117 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
+2. **Curl / HTTP script** against a running dev server.
+3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
+4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
+5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation.
+6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
+7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs.
+10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it.
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step "<instruction>"          → show instruction, wait for Enter
+#   capture VAR "<question>"      → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
@@ -0,0 +1,47 @@
+# ADR Format
+
+ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc.
+
+Create the `docs/adr/` directory lazily — only when the first ADR is needed.
+
+## Template
+
+```md
+# {Short title of the decision}
+
+{1-3 sentences: what's the context, what did we decide, and why.}
+```
+
+That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections.
+
+## Optional sections
+
+Only include these when they add genuine value. Most ADRs won't need them.
+
+- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited
+- **Considered Options** — only when the rejected alternatives are worth remembering
+- **Consequences** — only when non-obvious downstream effects need to be called out
+
+## Numbering
+
+Scan `docs/adr/` for the highest existing number and increment by one.
+
+## When to offer an ADR
+
+All three of these must be true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing."
+
+### What qualifies
+
+- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres."
+- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP."
+- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out.
+- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s.
+- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate.
+- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract."
+- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months.
@@ -0,0 +1,77 @@
+# CONTEXT.md Format
+
+## Structure
+
+```md
+# {Context Name}
+
+{One or two sentence description of what this context is and why it exists.}
+
+## Language
+
+**Order**:
+{A concise description of the term}
+_Avoid_: Purchase, transaction
+
+**Invoice**:
+A request for payment sent to a customer after delivery.
+_Avoid_: Bill, payment request
+
+**Customer**:
+A person or organization that places orders.
+_Avoid_: Client, buyer, account
+
+## Relationships
+
+- An **Order** produces one or more **Invoices**
+- An **Invoice** belongs to exactly one **Customer**
+
+## Example dialogue
+
+> **Dev:** "When a **Customer** places an **Order**, do we create the **Invoice** immediately?"
+> **Domain expert:** "No — an **Invoice** is only generated once a **Fulfillment** is confirmed."
+
+## Flagged ambiguities
+
+- "account" was used to mean both **Customer** and **User** — resolved: these are distinct concepts.
+```
+
+## Rules
+
+- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others as aliases to avoid.
+- **Flag conflicts explicitly.** If a term is used ambiguously, call it out in "Flagged ambiguities" with a clear resolution.
+- **Keep definitions tight.** One sentence max. Define what it IS, not what it does.
+- **Show relationships.** Use bold term names and express cardinality where obvious.
+- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs.
+- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine.
+- **Write an example dialogue.** A conversation between a dev and a domain expert that demonstrates how the terms interact naturally and clarifies boundaries between related concepts.
+
+## Single vs multi-context repos
+
+**Single context (most repos):** One `CONTEXT.md` at the repo root.
+
+**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other:
+
+```md
+# Context Map
+
+## Contexts
+
+- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders
+- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments
+- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping
+
+## Relationships
+
+- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking
+- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices
+- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money`
+```
+
+The skill infers which structure applies:
+
+- If `CONTEXT-MAP.md` exists, read it to find contexts
+- If only a root `CONTEXT.md` exists, single context
+- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved
+
+When multiple contexts exist, infer which one the current topic relates to. If unclear, ask.
@@ -0,0 +1,88 @@
+---
+name: grill-with-docs
+description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
+---
+
+<what-to-do>
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback on each question before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead.
+
+</what-to-do>
+
+<supporting-info>
+
+## Domain awareness
+
+During codebase exploration, also look for existing documentation:
+
+### File structure
+
+Most repos have a single context:
+
+```
+/
+├── CONTEXT.md
+├── docs/
+│   └── adr/
+│       ├── 0001-event-sourced-orders.md
+│       └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives:
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/
+│   └── adr/                          ← system-wide decisions
+├── src/
+│   ├── ordering/
+│   │   ├── CONTEXT.md
+│   │   └── docs/adr/                 ← context-specific decisions
+│   └── billing/
+│       ├── CONTEXT.md
+│       └── docs/adr/
+```
+
+Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed.
+
+## During the session
+
+### Challenge against the glossary
+
+When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?"
+
+### Sharpen fuzzy language
+
+When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things."
+
+### Discuss concrete scenarios
+
+When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts.
+
+### Cross-reference with code
+
+When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?"
+
+### Update CONTEXT.md inline
+
+When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md).
+
+Don't couple `CONTEXT.md` to implementation details. Only include terms that are meaningful to domain experts.
+
+### Offer ADRs sparingly
+
+Only offer to create an ADR when all three are true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will wonder "why did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md).
+
+</supporting-info>
@@ -0,0 +1,121 @@
+---
+name: setup-matt-pocock-skills
+description: Sets up an `## Agent skills` block in AGENTS.md/CLAUDE.md and `docs/agents/` so the engineering skills know this repo's issue tracker (GitHub or local markdown), triage label vocabulary, and domain doc layout. Run before first use of `to-issues`, `to-prd`, `triage`, `diagnose`, `tdd`, `improve-codebase-architecture`, or `zoom-out` — or if those skills appear to be missing context about the issue tracker, triage labels, or domain docs.
+disable-model-invocation: true
+---
+
+# Setup Matt Pocock's Skills
+
+Scaffold the per-repo configuration that the engineering skills assume:
+
+- **Issue tracker** — where issues live (GitHub by default; local markdown is also supported out of the box)
+- **Triage labels** — the strings used for the five canonical triage roles
+- **Domain docs** — where `CONTEXT.md` and ADRs live, and the consumer rules for reading them
+
+This is a prompt-driven skill, not a deterministic script. Explore, present what you found, confirm with the user, then write.
+
+## Process
+
+### 1. Explore
+
+Look at the current repo to understand its starting state. Read whatever exists; don't assume:
+
+- `git remote -v` and `.git/config` — is this a GitHub repo? Which one?
+- `AGENTS.md` and `CLAUDE.md` at the repo root — does either exist? Is there already an `## Agent skills` section in either?
+- `CONTEXT.md` and `CONTEXT-MAP.md` at the repo root
+- `docs/adr/` and any `src/*/docs/adr/` directories
+- `docs/agents/` — does this skill's prior output already exist?
+- `.scratch/` — sign that a local-markdown issue tracker convention is already in use
+
+### 2. Present findings and ask
+
+Summarise what's present and what's missing. Then walk the user through the three decisions **one at a time** — present a section, get the user's answer, then move to the next. Don't dump all three at once.
+
+Assume the user does not know what these terms mean. Each section starts with a short explainer (what it is, why these skills need it, what changes if they pick differently). Then show the choices and the default.
+
+**Section A — Issue tracker.**
+
+> Explainer: The "issue tracker" is where issues live for this repo. Skills like `to-issues`, `triage`, `to-prd`, and `qa` read from and write to it — they need to know whether to call `gh issue create`, write a markdown file under `.scratch/`, or follow some other workflow you describe. Pick the place you actually track work for this repo.
+
+Default posture: these skills were designed for GitHub. If a `git remote` points at GitHub, propose that. If a `git remote` points at GitLab (`gitlab.com` or a self-hosted host), propose GitLab. Otherwise (or if the user prefers), offer:
+
+- **GitHub** — issues live in the repo's GitHub Issues (uses the `gh` CLI)
+- **GitLab** — issues live in the repo's GitLab Issues (uses the [`glab`](https://gitlab.com/gitlab-org/cli) CLI)
+- **Local markdown** — issues live as files under `.scratch/<feature>/` in this repo (good for solo projects or repos without a remote)
+- **Other** (Jira, Linear, etc.) — ask the user to describe the workflow in one paragraph; the skill will record it as freeform prose
+
+**Section B — Triage label vocabulary.**
+
+> Explainer: When the `triage` skill processes an incoming issue, it moves it through a state machine — needs evaluation, waiting on reporter, ready for an AFK agent to pick up, ready for a human, or won't fix. To do that, it needs to apply labels (or the equivalent in your issue tracker) that match strings *you've actually configured*. If your repo already uses different label names (e.g. `bug:triage` instead of `needs-triage`), map them here so the skill applies the right ones instead of creating duplicates.
+
+The five canonical roles:
+
+- `needs-triage` — maintainer needs to evaluate
+- `needs-info` — waiting on reporter
+- `ready-for-agent` — fully specified, AFK-ready (an agent can pick it up with no human context)
+- `ready-for-human` — needs human implementation
+- `wontfix` — will not be actioned
+
+Default: each role's string equals its name. Ask the user if they want to override any. If their issue tracker has no existing labels, the defaults are fine.
+
+**Section C — Domain docs.**
+
+> Explainer: Some skills (`improve-codebase-architecture`, `diagnose`, `tdd`) read a `CONTEXT.md` file to learn the project's domain language, and `docs/adr/` for past architectural decisions. They need to know whether the repo has one global context or multiple (e.g. a monorepo with separate frontend/backend contexts) so they look in the right place.
+
+Confirm the layout:
+
+- **Single-context** — one `CONTEXT.md` + `docs/adr/` at the repo root. Most repos are this.
+- **Multi-context** — `CONTEXT-MAP.md` at the root pointing to per-context `CONTEXT.md` files (typically a monorepo).
+
+### 3. Confirm and edit
+
+Show the user a draft of:
+
+- The `## Agent skills` block to add to whichever of `CLAUDE.md` / `AGENTS.md` is being edited (see step 4 for selection rules)
+- The contents of `docs/agents/issue-tracker.md`, `docs/agents/triage-labels.md`, `docs/agents/domain.md`
+
+Let them edit before writing.
+
+### 4. Write
+
+**Pick the file to edit:**
+
+- If `CLAUDE.md` exists, edit it.
+- Else if `AGENTS.md` exists, edit it.
+- If neither exists, ask the user which one to create — don't pick for them.
+
+Never create `AGENTS.md` when `CLAUDE.md` already exists (or vice versa) — always edit the one that's already there.
+
+If an `## Agent skills` block already exists in the chosen file, update its contents in-place rather than appending a duplicate. Don't overwrite user edits to the surrounding sections.
+
+The block:
+
+```markdown
+## Agent skills
+
+### Issue tracker
+
+[one-line summary of where issues are tracked]. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+[one-line summary of the label vocabulary]. See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+[one-line summary of layout — "single-context" or "multi-context"]. See `docs/agents/domain.md`.
+```
+
+Then write the three docs files using the seed templates in this skill folder as a starting point:
+
+- [issue-tracker-github.md](./issue-tracker-github.md) — GitHub issue tracker
+- [issue-tracker-gitlab.md](./issue-tracker-gitlab.md) — GitLab issue tracker
+- [issue-tracker-local.md](./issue-tracker-local.md) — local-markdown issue tracker
+- [triage-labels.md](./triage-labels.md) — label mapping
+- [domain.md](./domain.md) — domain doc consumer rules + layout
+
+For "other" issue trackers, write `docs/agents/issue-tracker.md` from scratch using the user's description.
+
+### 5. Done
+
+Tell the user the setup is complete and which engineering skills will now read from these files. Mention they can edit `docs/agents/*.md` directly later — re-running this skill is only necessary if they want to switch issue trackers or restart from scratch.
@@ -0,0 +1,51 @@
+# Domain Docs
+
+How the engineering skills should consume this repo's domain documentation when exploring the codebase.
+
+## Before exploring, read these
+
+- **`CONTEXT.md`** at the repo root, or
+- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
+- **`docs/adr/`** — read ADRs that touch the area you're about to work in. In multi-context repos, also check `src/<context>/docs/adr/` for context-scoped decisions.
+
+If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
+
+## File structure
+
+Single-context repo (most repos):
+
+```
+/
+├── CONTEXT.md
+├── docs/adr/
+│   ├── 0001-event-sourced-orders.md
+│   └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+Multi-context repo (presence of `CONTEXT-MAP.md` at the root):
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/adr/                          ← system-wide decisions
+└── src/
+    ├── ordering/
+    │   ├── CONTEXT.md
+    │   └── docs/adr/                  ← context-specific decisions
+    └── billing/
+        ├── CONTEXT.md
+        └── docs/adr/
+```
+
+## Use the glossary's vocabulary
+
+When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.
+
+If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
+
+## Flag ADR conflicts
+
+If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:
+
+> _Contradicts ADR-0007 (event-sourced orders) — but worth reopening because…_
@@ -0,0 +1,22 @@
+# Issue tracker: GitHub
+
+Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
+- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
+- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
+- **Comment on an issue**: `gh issue comment <number> --body "..."`
+- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
+- **Close**: `gh issue close <number> --comment "..."`
+
+Infer the repo from `git remote -v` — `gh` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitHub issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `gh issue view <number> --comments`.
@@ -0,0 +1,23 @@
+# Issue tracker: GitLab
+
+Issues and PRDs for this repo live as GitLab issues. Use the [`glab`](https://gitlab.com/gitlab-org/cli) CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `glab issue create --title "..." --description "..."`. Use a heredoc for multi-line descriptions. Pass `--description -` to open an editor.
+- **Read an issue**: `glab issue view <number> --comments`. Use `-F json` for machine-readable output.
+- **List issues**: `glab issue list --state opened -F json` with appropriate `--label` filters. Note that GitLab uses `opened` (not `open`) for the state value.
+- **Comment on an issue**: `glab issue note <number> --message "..."`. GitLab calls comments "notes".
+- **Apply / remove labels**: `glab issue update <number> --label "..."` / `--unlabel "..."`. Multiple labels can be comma-separated or by repeating the flag.
+- **Close**: `glab issue close <number>`. `glab issue close` does not accept a closing comment, so post the explanation first with `glab issue note <number> --message "..."`, then close.
+- **Merge requests**: GitLab calls PRs "merge requests". Use `glab mr create`, `glab mr view`, `glab mr note`, etc. — the same shape as `gh pr ...` with `mr` in place of `pr` and `note`/`--message` in place of `comment`/`--body`.
+
+Infer the repo from `git remote -v` — `glab` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitLab issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `glab issue view <number> --comments`.
@@ -0,0 +1,19 @@
+# Issue tracker: Local Markdown
+
+Issues and PRDs for this repo live as markdown files in `.scratch/`.
+
+## Conventions
+
+- One feature per directory: `.scratch/<feature-slug>/`
+- The PRD is `.scratch/<feature-slug>/PRD.md`
+- Implementation issues are `.scratch/<feature-slug>/issues/<NN>-<slug>.md`, numbered from `01`
+- Triage state is recorded as a `Status:` line near the top of each issue file (see `triage-labels.md` for the role strings)
+- Comments and conversation history append to the bottom of the file under a `## Comments` heading
+
+## When a skill says "publish to the issue tracker"
+
+Create a new file under `.scratch/<feature-slug>/` (creating the directory if needed).
+
+## When a skill says "fetch the relevant ticket"
+
+Read the file at the referenced path. The user will normally pass the path or the issue number directly.
@@ -0,0 +1,15 @@
+# Triage Labels
+
+The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
+
+| Label in mattpocock/skills | Label in our tracker | Meaning                                  |
+| -------------------------- | -------------------- | ---------------------------------------- |
+| `needs-triage`             | `needs-triage`       | Maintainer needs to evaluate this issue  |
+| `needs-info`               | `needs-info`         | Waiting on reporter for more information |
+| `ready-for-agent`          | `ready-for-agent`    | Fully specified, ready for an AFK agent  |
+| `ready-for-human`          | `ready-for-human`    | Requires human implementation            |
+| `wontfix`                  | `wontfix`            | Will not be actioned                     |
+
+When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
+
+Edit the right-hand column to match whatever vocabulary you actually use.
@@ -0,0 +1,117 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
+2. **Curl / HTTP script** against a running dev server.
+3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
+4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
+5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation.
+6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
+7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs.
+10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it.
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step "<instruction>"          → show instruction, wait for Enter
+#   capture VAR "<question>"      → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
@@ -0,0 +1,47 @@
+# ADR Format
+
+ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc.
+
+Create the `docs/adr/` directory lazily — only when the first ADR is needed.
+
+## Template
+
+```md
+# {Short title of the decision}
+
+{1-3 sentences: what's the context, what did we decide, and why.}
+```
+
+That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections.
+
+## Optional sections
+
+Only include these when they add genuine value. Most ADRs won't need them.
+
+- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited
+- **Considered Options** — only when the rejected alternatives are worth remembering
+- **Consequences** — only when non-obvious downstream effects need to be called out
+
+## Numbering
+
+Scan `docs/adr/` for the highest existing number and increment by one.
+
+## When to offer an ADR
+
+All three of these must be true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing."
+
+### What qualifies
+
+- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres."
+- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP."
+- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out.
+- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s.
+- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate.
+- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract."
+- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months.
@@ -0,0 +1,77 @@
+# CONTEXT.md Format
+
+## Structure
+
+```md
+# {Context Name}
+
+{One or two sentence description of what this context is and why it exists.}
+
+## Language
+
+**Order**:
+{A concise description of the term}
+_Avoid_: Purchase, transaction
+
+**Invoice**:
+A request for payment sent to a customer after delivery.
+_Avoid_: Bill, payment request
+
+**Customer**:
+A person or organization that places orders.
+_Avoid_: Client, buyer, account
+
+## Relationships
+
+- An **Order** produces one or more **Invoices**
+- An **Invoice** belongs to exactly one **Customer**
+
+## Example dialogue
+
+> **Dev:** "When a **Customer** places an **Order**, do we create the **Invoice** immediately?"
+> **Domain expert:** "No — an **Invoice** is only generated once a **Fulfillment** is confirmed."
+
+## Flagged ambiguities
+
+- "account" was used to mean both **Customer** and **User** — resolved: these are distinct concepts.
+```
+
+## Rules
+
+- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others as aliases to avoid.
+- **Flag conflicts explicitly.** If a term is used ambiguously, call it out in "Flagged ambiguities" with a clear resolution.
+- **Keep definitions tight.** One sentence max. Define what it IS, not what it does.
+- **Show relationships.** Use bold term names and express cardinality where obvious.
+- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs.
+- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine.
+- **Write an example dialogue.** A conversation between a dev and a domain expert that demonstrates how the terms interact naturally and clarifies boundaries between related concepts.
+
+## Single vs multi-context repos
+
+**Single context (most repos):** One `CONTEXT.md` at the repo root.
+
+**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other:
+
+```md
+# Context Map
+
+## Contexts
+
+- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders
+- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments
+- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping
+
+## Relationships
+
+- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking
+- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices
+- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money`
+```
+
+The skill infers which structure applies:
+
+- If `CONTEXT-MAP.md` exists, read it to find contexts
+- If only a root `CONTEXT.md` exists, single context
+- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved
+
+When multiple contexts exist, infer which one the current topic relates to. If unclear, ask.
@@ -0,0 +1,88 @@
+---
+name: grill-with-docs
+description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
+---
+
+<what-to-do>
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback on each question before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead.
+
+</what-to-do>
+
+<supporting-info>
+
+## Domain awareness
+
+During codebase exploration, also look for existing documentation:
+
+### File structure
+
+Most repos have a single context:
+
+```
+/
+├── CONTEXT.md
+├── docs/
+│   └── adr/
+│       ├── 0001-event-sourced-orders.md
+│       └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives:
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/
+│   └── adr/                          ← system-wide decisions
+├── src/
+│   ├── ordering/
+│   │   ├── CONTEXT.md
+│   │   └── docs/adr/                 ← context-specific decisions
+│   └── billing/
+│       ├── CONTEXT.md
+│       └── docs/adr/
+```
+
+Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed.
+
+## During the session
+
+### Challenge against the glossary
+
+When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?"
+
+### Sharpen fuzzy language
+
+When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things."
+
+### Discuss concrete scenarios
+
+When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts.
+
+### Cross-reference with code
+
+When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?"
+
+### Update CONTEXT.md inline
+
+When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md).
+
+Don't couple `CONTEXT.md` to implementation details. Only include terms that are meaningful to domain experts.
+
+### Offer ADRs sparingly
+
+Only offer to create an ADR when all three are true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will wonder "why did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md).
+
+</supporting-info>
@@ -0,0 +1,121 @@
+---
+name: setup-matt-pocock-skills
+description: Sets up an `## Agent skills` block in AGENTS.md/CLAUDE.md and `docs/agents/` so the engineering skills know this repo's issue tracker (GitHub or local markdown), triage label vocabulary, and domain doc layout. Run before first use of `to-issues`, `to-prd`, `triage`, `diagnose`, `tdd`, `improve-codebase-architecture`, or `zoom-out` — or if those skills appear to be missing context about the issue tracker, triage labels, or domain docs.
+disable-model-invocation: true
+---
+
+# Setup Matt Pocock's Skills
+
+Scaffold the per-repo configuration that the engineering skills assume:
+
+- **Issue tracker** — where issues live (GitHub by default; local markdown is also supported out of the box)
+- **Triage labels** — the strings used for the five canonical triage roles
+- **Domain docs** — where `CONTEXT.md` and ADRs live, and the consumer rules for reading them
+
+This is a prompt-driven skill, not a deterministic script. Explore, present what you found, confirm with the user, then write.
+
+## Process
+
+### 1. Explore
+
+Look at the current repo to understand its starting state. Read whatever exists; don't assume:
+
+- `git remote -v` and `.git/config` — is this a GitHub repo? Which one?
+- `AGENTS.md` and `CLAUDE.md` at the repo root — does either exist? Is there already an `## Agent skills` section in either?
+- `CONTEXT.md` and `CONTEXT-MAP.md` at the repo root
+- `docs/adr/` and any `src/*/docs/adr/` directories
+- `docs/agents/` — does this skill's prior output already exist?
+- `.scratch/` — sign that a local-markdown issue tracker convention is already in use
+
+### 2. Present findings and ask
+
+Summarise what's present and what's missing. Then walk the user through the three decisions **one at a time** — present a section, get the user's answer, then move to the next. Don't dump all three at once.
+
+Assume the user does not know what these terms mean. Each section starts with a short explainer (what it is, why these skills need it, what changes if they pick differently). Then show the choices and the default.
+
+**Section A — Issue tracker.**
+
+> Explainer: The "issue tracker" is where issues live for this repo. Skills like `to-issues`, `triage`, `to-prd`, and `qa` read from and write to it — they need to know whether to call `gh issue create`, write a markdown file under `.scratch/`, or follow some other workflow you describe. Pick the place you actually track work for this repo.
+
+Default posture: these skills were designed for GitHub. If a `git remote` points at GitHub, propose that. If a `git remote` points at GitLab (`gitlab.com` or a self-hosted host), propose GitLab. Otherwise (or if the user prefers), offer:
+
+- **GitHub** — issues live in the repo's GitHub Issues (uses the `gh` CLI)
+- **GitLab** — issues live in the repo's GitLab Issues (uses the [`glab`](https://gitlab.com/gitlab-org/cli) CLI)
+- **Local markdown** — issues live as files under `.scratch/<feature>/` in this repo (good for solo projects or repos without a remote)
+- **Other** (Jira, Linear, etc.) — ask the user to describe the workflow in one paragraph; the skill will record it as freeform prose
+
+**Section B — Triage label vocabulary.**
+
+> Explainer: When the `triage` skill processes an incoming issue, it moves it through a state machine — needs evaluation, waiting on reporter, ready for an AFK agent to pick up, ready for a human, or won't fix. To do that, it needs to apply labels (or the equivalent in your issue tracker) that match strings *you've actually configured*. If your repo already uses different label names (e.g. `bug:triage` instead of `needs-triage`), map them here so the skill applies the right ones instead of creating duplicates.
+
+The five canonical roles:
+
+- `needs-triage` — maintainer needs to evaluate
+- `needs-info` — waiting on reporter
+- `ready-for-agent` — fully specified, AFK-ready (an agent can pick it up with no human context)
+- `ready-for-human` — needs human implementation
+- `wontfix` — will not be actioned
+
+Default: each role's string equals its name. Ask the user if they want to override any. If their issue tracker has no existing labels, the defaults are fine.
+
+**Section C — Domain docs.**
+
+> Explainer: Some skills (`improve-codebase-architecture`, `diagnose`, `tdd`) read a `CONTEXT.md` file to learn the project's domain language, and `docs/adr/` for past architectural decisions. They need to know whether the repo has one global context or multiple (e.g. a monorepo with separate frontend/backend contexts) so they look in the right place.
+
+Confirm the layout:
+
+- **Single-context** — one `CONTEXT.md` + `docs/adr/` at the repo root. Most repos are this.
+- **Multi-context** — `CONTEXT-MAP.md` at the root pointing to per-context `CONTEXT.md` files (typically a monorepo).
+
+### 3. Confirm and edit
+
+Show the user a draft of:
+
+- The `## Agent skills` block to add to whichever of `CLAUDE.md` / `AGENTS.md` is being edited (see step 4 for selection rules)
+- The contents of `docs/agents/issue-tracker.md`, `docs/agents/triage-labels.md`, `docs/agents/domain.md`
+
+Let them edit before writing.
+
+### 4. Write
+
+**Pick the file to edit:**
+
+- If `CLAUDE.md` exists, edit it.
+- Else if `AGENTS.md` exists, edit it.
+- If neither exists, ask the user which one to create — don't pick for them.
+
+Never create `AGENTS.md` when `CLAUDE.md` already exists (or vice versa) — always edit the one that's already there.
+
+If an `## Agent skills` block already exists in the chosen file, update its contents in-place rather than appending a duplicate. Don't overwrite user edits to the surrounding sections.
+
+The block:
+
+```markdown
+## Agent skills
+
+### Issue tracker
+
+[one-line summary of where issues are tracked]. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+[one-line summary of the label vocabulary]. See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+[one-line summary of layout — "single-context" or "multi-context"]. See `docs/agents/domain.md`.
+```
+
+Then write the three docs files using the seed templates in this skill folder as a starting point:
+
+- [issue-tracker-github.md](./issue-tracker-github.md) — GitHub issue tracker
+- [issue-tracker-gitlab.md](./issue-tracker-gitlab.md) — GitLab issue tracker
+- [issue-tracker-local.md](./issue-tracker-local.md) — local-markdown issue tracker
+- [triage-labels.md](./triage-labels.md) — label mapping
+- [domain.md](./domain.md) — domain doc consumer rules + layout
+
+For "other" issue trackers, write `docs/agents/issue-tracker.md` from scratch using the user's description.
+
+### 5. Done
+
+Tell the user the setup is complete and which engineering skills will now read from these files. Mention they can edit `docs/agents/*.md` directly later — re-running this skill is only necessary if they want to switch issue trackers or restart from scratch.
@@ -0,0 +1,51 @@
+# Domain Docs
+
+How the engineering skills should consume this repo's domain documentation when exploring the codebase.
+
+## Before exploring, read these
+
+- **`CONTEXT.md`** at the repo root, or
+- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
+- **`docs/adr/`** — read ADRs that touch the area you're about to work in. In multi-context repos, also check `src/<context>/docs/adr/` for context-scoped decisions.
+
+If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
+
+## File structure
+
+Single-context repo (most repos):
+
+```
+/
+├── CONTEXT.md
+├── docs/adr/
+│   ├── 0001-event-sourced-orders.md
+│   └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+Multi-context repo (presence of `CONTEXT-MAP.md` at the root):
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/adr/                          ← system-wide decisions
+└── src/
+    ├── ordering/
+    │   ├── CONTEXT.md
+    │   └── docs/adr/                  ← context-specific decisions
+    └── billing/
+        ├── CONTEXT.md
+        └── docs/adr/
+```
+
+## Use the glossary's vocabulary
+
+When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.
+
+If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
+
+## Flag ADR conflicts
+
+If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:
+
+> _Contradicts ADR-0007 (event-sourced orders) — but worth reopening because…_
@@ -0,0 +1,22 @@
+# Issue tracker: GitHub
+
+Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
+- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
+- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
+- **Comment on an issue**: `gh issue comment <number> --body "..."`
+- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
+- **Close**: `gh issue close <number> --comment "..."`
+
+Infer the repo from `git remote -v` — `gh` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitHub issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `gh issue view <number> --comments`.
@@ -0,0 +1,23 @@
+# Issue tracker: GitLab
+
+Issues and PRDs for this repo live as GitLab issues. Use the [`glab`](https://gitlab.com/gitlab-org/cli) CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `glab issue create --title "..." --description "..."`. Use a heredoc for multi-line descriptions. Pass `--description -` to open an editor.
+- **Read an issue**: `glab issue view <number> --comments`. Use `-F json` for machine-readable output.
+- **List issues**: `glab issue list --state opened -F json` with appropriate `--label` filters. Note that GitLab uses `opened` (not `open`) for the state value.
+- **Comment on an issue**: `glab issue note <number> --message "..."`. GitLab calls comments "notes".
+- **Apply / remove labels**: `glab issue update <number> --label "..."` / `--unlabel "..."`. Multiple labels can be comma-separated or by repeating the flag.
+- **Close**: `glab issue close <number>`. `glab issue close` does not accept a closing comment, so post the explanation first with `glab issue note <number> --message "..."`, then close.
+- **Merge requests**: GitLab calls PRs "merge requests". Use `glab mr create`, `glab mr view`, `glab mr note`, etc. — the same shape as `gh pr ...` with `mr` in place of `pr` and `note`/`--message` in place of `comment`/`--body`.
+
+Infer the repo from `git remote -v` — `glab` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitLab issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `glab issue view <number> --comments`.
@@ -0,0 +1,19 @@
+# Issue tracker: Local Markdown
+
+Issues and PRDs for this repo live as markdown files in `.scratch/`.
+
+## Conventions
+
+- One feature per directory: `.scratch/<feature-slug>/`
+- The PRD is `.scratch/<feature-slug>/PRD.md`
+- Implementation issues are `.scratch/<feature-slug>/issues/<NN>-<slug>.md`, numbered from `01`
+- Triage state is recorded as a `Status:` line near the top of each issue file (see `triage-labels.md` for the role strings)
+- Comments and conversation history append to the bottom of the file under a `## Comments` heading
+
+## When a skill says "publish to the issue tracker"
+
+Create a new file under `.scratch/<feature-slug>/` (creating the directory if needed).
+
+## When a skill says "fetch the relevant ticket"
+
+Read the file at the referenced path. The user will normally pass the path or the issue number directly.
@@ -0,0 +1,15 @@
+# Triage Labels
+
+The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
+
+| Label in mattpocock/skills | Label in our tracker | Meaning                                  |
+| -------------------------- | -------------------- | ---------------------------------------- |
+| `needs-triage`             | `needs-triage`       | Maintainer needs to evaluate this issue  |
+| `needs-info`               | `needs-info`         | Waiting on reporter for more information |
+| `ready-for-agent`          | `ready-for-agent`    | Fully specified, ready for an AFK agent  |
+| `ready-for-human`          | `ready-for-human`    | Requires human implementation            |
+| `wontfix`                  | `wontfix`            | Will not be actioned                     |
+
+When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
+
+Edit the right-hand column to match whatever vocabulary you actually use.
@@ -0,0 +1,117 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
+2. **Curl / HTTP script** against a running dev server.
+3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
+4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
+5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation.
+6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
+7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs.
+10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it.
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step "<instruction>"          → show instruction, wait for Enter
+#   capture VAR "<question>"      → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
@@ -0,0 +1,47 @@
+# ADR Format
+
+ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc.
+
+Create the `docs/adr/` directory lazily — only when the first ADR is needed.
+
+## Template
+
+```md
+# {Short title of the decision}
+
+{1-3 sentences: what's the context, what did we decide, and why.}
+```
+
+That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections.
+
+## Optional sections
+
+Only include these when they add genuine value. Most ADRs won't need them.
+
+- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited
+- **Considered Options** — only when the rejected alternatives are worth remembering
+- **Consequences** — only when non-obvious downstream effects need to be called out
+
+## Numbering
+
+Scan `docs/adr/` for the highest existing number and increment by one.
+
+## When to offer an ADR
+
+All three of these must be true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing."
+
+### What qualifies
+
+- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres."
+- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP."
+- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out.
+- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s.
+- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate.
+- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract."
+- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months.
@@ -0,0 +1,77 @@
+# CONTEXT.md Format
+
+## Structure
+
+```md
+# {Context Name}
+
+{One or two sentence description of what this context is and why it exists.}
+
+## Language
+
+**Order**:
+{A concise description of the term}
+_Avoid_: Purchase, transaction
+
+**Invoice**:
+A request for payment sent to a customer after delivery.
+_Avoid_: Bill, payment request
+
+**Customer**:
+A person or organization that places orders.
+_Avoid_: Client, buyer, account
+
+## Relationships
+
+- An **Order** produces one or more **Invoices**
+- An **Invoice** belongs to exactly one **Customer**
+
+## Example dialogue
+
+> **Dev:** "When a **Customer** places an **Order**, do we create the **Invoice** immediately?"
+> **Domain expert:** "No — an **Invoice** is only generated once a **Fulfillment** is confirmed."
+
+## Flagged ambiguities
+
+- "account" was used to mean both **Customer** and **User** — resolved: these are distinct concepts.
+```
+
+## Rules
+
+- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others as aliases to avoid.
+- **Flag conflicts explicitly.** If a term is used ambiguously, call it out in "Flagged ambiguities" with a clear resolution.
+- **Keep definitions tight.** One sentence max. Define what it IS, not what it does.
+- **Show relationships.** Use bold term names and express cardinality where obvious.
+- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs.
+- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine.
+- **Write an example dialogue.** A conversation between a dev and a domain expert that demonstrates how the terms interact naturally and clarifies boundaries between related concepts.
+
+## Single vs multi-context repos
+
+**Single context (most repos):** One `CONTEXT.md` at the repo root.
+
+**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other:
+
+```md
+# Context Map
+
+## Contexts
+
+- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders
+- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments
+- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping
+
+## Relationships
+
+- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking
+- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices
+- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money`
+```
+
+The skill infers which structure applies:
+
+- If `CONTEXT-MAP.md` exists, read it to find contexts
+- If only a root `CONTEXT.md` exists, single context
+- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved
+
+When multiple contexts exist, infer which one the current topic relates to. If unclear, ask.
@@ -0,0 +1,88 @@
+---
+name: grill-with-docs
+description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
+---
+
+<what-to-do>
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback on each question before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead.
+
+</what-to-do>
+
+<supporting-info>
+
+## Domain awareness
+
+During codebase exploration, also look for existing documentation:
+
+### File structure
+
+Most repos have a single context:
+
+```
+/
+├── CONTEXT.md
+├── docs/
+│   └── adr/
+│       ├── 0001-event-sourced-orders.md
+│       └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives:
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/
+│   └── adr/                          ← system-wide decisions
+├── src/
+│   ├── ordering/
+│   │   ├── CONTEXT.md
+│   │   └── docs/adr/                 ← context-specific decisions
+│   └── billing/
+│       ├── CONTEXT.md
+│       └── docs/adr/
+```
+
+Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed.
+
+## During the session
+
+### Challenge against the glossary
+
+When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?"
+
+### Sharpen fuzzy language
+
+When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things."
+
+### Discuss concrete scenarios
+
+When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts.
+
+### Cross-reference with code
+
+When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?"
+
+### Update CONTEXT.md inline
+
+When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md).
+
+Don't couple `CONTEXT.md` to implementation details. Only include terms that are meaningful to domain experts.
+
+### Offer ADRs sparingly
+
+Only offer to create an ADR when all three are true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will wonder "why did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md).
+
+</supporting-info>
@@ -0,0 +1,121 @@
+---
+name: setup-matt-pocock-skills
+description: Sets up an `## Agent skills` block in AGENTS.md/CLAUDE.md and `docs/agents/` so the engineering skills know this repo's issue tracker (GitHub or local markdown), triage label vocabulary, and domain doc layout. Run before first use of `to-issues`, `to-prd`, `triage`, `diagnose`, `tdd`, `improve-codebase-architecture`, or `zoom-out` — or if those skills appear to be missing context about the issue tracker, triage labels, or domain docs.
+disable-model-invocation: true
+---
+
+# Setup Matt Pocock's Skills
+
+Scaffold the per-repo configuration that the engineering skills assume:
+
+- **Issue tracker** — where issues live (GitHub by default; local markdown is also supported out of the box)
+- **Triage labels** — the strings used for the five canonical triage roles
+- **Domain docs** — where `CONTEXT.md` and ADRs live, and the consumer rules for reading them
+
+This is a prompt-driven skill, not a deterministic script. Explore, present what you found, confirm with the user, then write.
+
+## Process
+
+### 1. Explore
+
+Look at the current repo to understand its starting state. Read whatever exists; don't assume:
+
+- `git remote -v` and `.git/config` — is this a GitHub repo? Which one?
+- `AGENTS.md` and `CLAUDE.md` at the repo root — does either exist? Is there already an `## Agent skills` section in either?
+- `CONTEXT.md` and `CONTEXT-MAP.md` at the repo root
+- `docs/adr/` and any `src/*/docs/adr/` directories
+- `docs/agents/` — does this skill's prior output already exist?
+- `.scratch/` — sign that a local-markdown issue tracker convention is already in use
+
+### 2. Present findings and ask
+
+Summarise what's present and what's missing. Then walk the user through the three decisions **one at a time** — present a section, get the user's answer, then move to the next. Don't dump all three at once.
+
+Assume the user does not know what these terms mean. Each section starts with a short explainer (what it is, why these skills need it, what changes if they pick differently). Then show the choices and the default.
+
+**Section A — Issue tracker.**
+
+> Explainer: The "issue tracker" is where issues live for this repo. Skills like `to-issues`, `triage`, `to-prd`, and `qa` read from and write to it — they need to know whether to call `gh issue create`, write a markdown file under `.scratch/`, or follow some other workflow you describe. Pick the place you actually track work for this repo.
+
+Default posture: these skills were designed for GitHub. If a `git remote` points at GitHub, propose that. If a `git remote` points at GitLab (`gitlab.com` or a self-hosted host), propose GitLab. Otherwise (or if the user prefers), offer:
+
+- **GitHub** — issues live in the repo's GitHub Issues (uses the `gh` CLI)
+- **GitLab** — issues live in the repo's GitLab Issues (uses the [`glab`](https://gitlab.com/gitlab-org/cli) CLI)
+- **Local markdown** — issues live as files under `.scratch/<feature>/` in this repo (good for solo projects or repos without a remote)
+- **Other** (Jira, Linear, etc.) — ask the user to describe the workflow in one paragraph; the skill will record it as freeform prose
+
+**Section B — Triage label vocabulary.**
+
+> Explainer: When the `triage` skill processes an incoming issue, it moves it through a state machine — needs evaluation, waiting on reporter, ready for an AFK agent to pick up, ready for a human, or won't fix. To do that, it needs to apply labels (or the equivalent in your issue tracker) that match strings *you've actually configured*. If your repo already uses different label names (e.g. `bug:triage` instead of `needs-triage`), map them here so the skill applies the right ones instead of creating duplicates.
+
+The five canonical roles:
+
+- `needs-triage` — maintainer needs to evaluate
+- `needs-info` — waiting on reporter
+- `ready-for-agent` — fully specified, AFK-ready (an agent can pick it up with no human context)
+- `ready-for-human` — needs human implementation
+- `wontfix` — will not be actioned
+
+Default: each role's string equals its name. Ask the user if they want to override any. If their issue tracker has no existing labels, the defaults are fine.
+
+**Section C — Domain docs.**
+
+> Explainer: Some skills (`improve-codebase-architecture`, `diagnose`, `tdd`) read a `CONTEXT.md` file to learn the project's domain language, and `docs/adr/` for past architectural decisions. They need to know whether the repo has one global context or multiple (e.g. a monorepo with separate frontend/backend contexts) so they look in the right place.
+
+Confirm the layout:
+
+- **Single-context** — one `CONTEXT.md` + `docs/adr/` at the repo root. Most repos are this.
+- **Multi-context** — `CONTEXT-MAP.md` at the root pointing to per-context `CONTEXT.md` files (typically a monorepo).
+
+### 3. Confirm and edit
+
+Show the user a draft of:
+
+- The `## Agent skills` block to add to whichever of `CLAUDE.md` / `AGENTS.md` is being edited (see step 4 for selection rules)
+- The contents of `docs/agents/issue-tracker.md`, `docs/agents/triage-labels.md`, `docs/agents/domain.md`
+
+Let them edit before writing.
+
+### 4. Write
+
+**Pick the file to edit:**
+
+- If `CLAUDE.md` exists, edit it.
+- Else if `AGENTS.md` exists, edit it.
+- If neither exists, ask the user which one to create — don't pick for them.
+
+Never create `AGENTS.md` when `CLAUDE.md` already exists (or vice versa) — always edit the one that's already there.
+
+If an `## Agent skills` block already exists in the chosen file, update its contents in-place rather than appending a duplicate. Don't overwrite user edits to the surrounding sections.
+
+The block:
+
+```markdown
+## Agent skills
+
+### Issue tracker
+
+[one-line summary of where issues are tracked]. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+[one-line summary of the label vocabulary]. See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+[one-line summary of layout — "single-context" or "multi-context"]. See `docs/agents/domain.md`.
+```
+
+Then write the three docs files using the seed templates in this skill folder as a starting point:
+
+- [issue-tracker-github.md](./issue-tracker-github.md) — GitHub issue tracker
+- [issue-tracker-gitlab.md](./issue-tracker-gitlab.md) — GitLab issue tracker
+- [issue-tracker-local.md](./issue-tracker-local.md) — local-markdown issue tracker
+- [triage-labels.md](./triage-labels.md) — label mapping
+- [domain.md](./domain.md) — domain doc consumer rules + layout
+
+For "other" issue trackers, write `docs/agents/issue-tracker.md` from scratch using the user's description.
+
+### 5. Done
+
+Tell the user the setup is complete and which engineering skills will now read from these files. Mention they can edit `docs/agents/*.md` directly later — re-running this skill is only necessary if they want to switch issue trackers or restart from scratch.
@@ -0,0 +1,51 @@
+# Domain Docs
+
+How the engineering skills should consume this repo's domain documentation when exploring the codebase.
+
+## Before exploring, read these
+
+- **`CONTEXT.md`** at the repo root, or
+- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
+- **`docs/adr/`** — read ADRs that touch the area you're about to work in. In multi-context repos, also check `src/<context>/docs/adr/` for context-scoped decisions.
+
+If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
+
+## File structure
+
+Single-context repo (most repos):
+
+```
+/
+├── CONTEXT.md
+├── docs/adr/
+│   ├── 0001-event-sourced-orders.md
+│   └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+Multi-context repo (presence of `CONTEXT-MAP.md` at the root):
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/adr/                          ← system-wide decisions
+└── src/
+    ├── ordering/
+    │   ├── CONTEXT.md
+    │   └── docs/adr/                  ← context-specific decisions
+    └── billing/
+        ├── CONTEXT.md
+        └── docs/adr/
+```
+
+## Use the glossary's vocabulary
+
+When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.
+
+If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
+
+## Flag ADR conflicts
+
+If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:
+
+> _Contradicts ADR-0007 (event-sourced orders) — but worth reopening because…_
@@ -0,0 +1,22 @@
+# Issue tracker: GitHub
+
+Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
+- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
+- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
+- **Comment on an issue**: `gh issue comment <number> --body "..."`
+- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
+- **Close**: `gh issue close <number> --comment "..."`
+
+Infer the repo from `git remote -v` — `gh` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitHub issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `gh issue view <number> --comments`.
@@ -0,0 +1,23 @@
+# Issue tracker: GitLab
+
+Issues and PRDs for this repo live as GitLab issues. Use the [`glab`](https://gitlab.com/gitlab-org/cli) CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `glab issue create --title "..." --description "..."`. Use a heredoc for multi-line descriptions. Pass `--description -` to open an editor.
+- **Read an issue**: `glab issue view <number> --comments`. Use `-F json` for machine-readable output.
+- **List issues**: `glab issue list --state opened -F json` with appropriate `--label` filters. Note that GitLab uses `opened` (not `open`) for the state value.
+- **Comment on an issue**: `glab issue note <number> --message "..."`. GitLab calls comments "notes".
+- **Apply / remove labels**: `glab issue update <number> --label "..."` / `--unlabel "..."`. Multiple labels can be comma-separated or by repeating the flag.
+- **Close**: `glab issue close <number>`. `glab issue close` does not accept a closing comment, so post the explanation first with `glab issue note <number> --message "..."`, then close.
+- **Merge requests**: GitLab calls PRs "merge requests". Use `glab mr create`, `glab mr view`, `glab mr note`, etc. — the same shape as `gh pr ...` with `mr` in place of `pr` and `note`/`--message` in place of `comment`/`--body`.
+
+Infer the repo from `git remote -v` — `glab` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitLab issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `glab issue view <number> --comments`.
@@ -0,0 +1,19 @@
+# Issue tracker: Local Markdown
+
+Issues and PRDs for this repo live as markdown files in `.scratch/`.
+
+## Conventions
+
+- One feature per directory: `.scratch/<feature-slug>/`
+- The PRD is `.scratch/<feature-slug>/PRD.md`
+- Implementation issues are `.scratch/<feature-slug>/issues/<NN>-<slug>.md`, numbered from `01`
+- Triage state is recorded as a `Status:` line near the top of each issue file (see `triage-labels.md` for the role strings)
+- Comments and conversation history append to the bottom of the file under a `## Comments` heading
+
+## When a skill says "publish to the issue tracker"
+
+Create a new file under `.scratch/<feature-slug>/` (creating the directory if needed).
+
+## When a skill says "fetch the relevant ticket"
+
+Read the file at the referenced path. The user will normally pass the path or the issue number directly.
@@ -0,0 +1,15 @@
+# Triage Labels
+
+The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
+
+| Label in mattpocock/skills | Label in our tracker | Meaning                                  |
+| -------------------------- | -------------------- | ---------------------------------------- |
+| `needs-triage`             | `needs-triage`       | Maintainer needs to evaluate this issue  |
+| `needs-info`               | `needs-info`         | Waiting on reporter for more information |
+| `ready-for-agent`          | `ready-for-agent`    | Fully specified, ready for an AFK agent  |
+| `ready-for-human`          | `ready-for-human`    | Requires human implementation            |
+| `wontfix`                  | `wontfix`            | Will not be actioned                     |
+
+When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
+
+Edit the right-hand column to match whatever vocabulary you actually use.
@@ -0,0 +1,117 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
+2. **Curl / HTTP script** against a running dev server.
+3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
+4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
+5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation.
+6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
+7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs.
+10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it.
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step "<instruction>"          → show instruction, wait for Enter
+#   capture VAR "<question>"      → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
@@ -0,0 +1,47 @@
+# ADR Format
+
+ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc.
+
+Create the `docs/adr/` directory lazily — only when the first ADR is needed.
+
+## Template
+
+```md
+# {Short title of the decision}
+
+{1-3 sentences: what's the context, what did we decide, and why.}
+```
+
+That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections.
+
+## Optional sections
+
+Only include these when they add genuine value. Most ADRs won't need them.
+
+- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited
+- **Considered Options** — only when the rejected alternatives are worth remembering
+- **Consequences** — only when non-obvious downstream effects need to be called out
+
+## Numbering
+
+Scan `docs/adr/` for the highest existing number and increment by one.
+
+## When to offer an ADR
+
+All three of these must be true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing."
+
+### What qualifies
+
+- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres."
+- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP."
+- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out.
+- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s.
+- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate.
+- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract."
+- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months.
@@ -0,0 +1,77 @@
+# CONTEXT.md Format
+
+## Structure
+
+```md
+# {Context Name}
+
+{One or two sentence description of what this context is and why it exists.}
+
+## Language
+
+**Order**:
+{A concise description of the term}
+_Avoid_: Purchase, transaction
+
+**Invoice**:
+A request for payment sent to a customer after delivery.
+_Avoid_: Bill, payment request
+
+**Customer**:
+A person or organization that places orders.
+_Avoid_: Client, buyer, account
+
+## Relationships
+
+- An **Order** produces one or more **Invoices**
+- An **Invoice** belongs to exactly one **Customer**
+
+## Example dialogue
+
+> **Dev:** "When a **Customer** places an **Order**, do we create the **Invoice** immediately?"
+> **Domain expert:** "No — an **Invoice** is only generated once a **Fulfillment** is confirmed."
+
+## Flagged ambiguities
+
+- "account" was used to mean both **Customer** and **User** — resolved: these are distinct concepts.
+```
+
+## Rules
+
+- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others as aliases to avoid.
+- **Flag conflicts explicitly.** If a term is used ambiguously, call it out in "Flagged ambiguities" with a clear resolution.
+- **Keep definitions tight.** One sentence max. Define what it IS, not what it does.
+- **Show relationships.** Use bold term names and express cardinality where obvious.
+- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs.
+- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine.
+- **Write an example dialogue.** A conversation between a dev and a domain expert that demonstrates how the terms interact naturally and clarifies boundaries between related concepts.
+
+## Single vs multi-context repos
+
+**Single context (most repos):** One `CONTEXT.md` at the repo root.
+
+**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other:
+
+```md
+# Context Map
+
+## Contexts
+
+- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders
+- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments
+- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping
+
+## Relationships
+
+- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking
+- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices
+- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money`
+```
+
+The skill infers which structure applies:
+
+- If `CONTEXT-MAP.md` exists, read it to find contexts
+- If only a root `CONTEXT.md` exists, single context
+- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved
+
+When multiple contexts exist, infer which one the current topic relates to. If unclear, ask.
@@ -0,0 +1,88 @@
+---
+name: grill-with-docs
+description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
+---
+
+<what-to-do>
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback on each question before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead.
+
+</what-to-do>
+
+<supporting-info>
+
+## Domain awareness
+
+During codebase exploration, also look for existing documentation:
+
+### File structure
+
+Most repos have a single context:
+
+```
+/
+├── CONTEXT.md
+├── docs/
+│   └── adr/
+│       ├── 0001-event-sourced-orders.md
+│       └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives:
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/
+│   └── adr/                          ← system-wide decisions
+├── src/
+│   ├── ordering/
+│   │   ├── CONTEXT.md
+│   │   └── docs/adr/                 ← context-specific decisions
+│   └── billing/
+│       ├── CONTEXT.md
+│       └── docs/adr/
+```
+
+Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed.
+
+## During the session
+
+### Challenge against the glossary
+
+When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?"
+
+### Sharpen fuzzy language
+
+When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things."
+
+### Discuss concrete scenarios
+
+When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts.
+
+### Cross-reference with code
+
+When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?"
+
+### Update CONTEXT.md inline
+
+When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md).
+
+Don't couple `CONTEXT.md` to implementation details. Only include terms that are meaningful to domain experts.
+
+### Offer ADRs sparingly
+
+Only offer to create an ADR when all three are true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will wonder "why did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md).
+
+</supporting-info>
@@ -0,0 +1,121 @@
+---
+name: setup-matt-pocock-skills
+description: Sets up an `## Agent skills` block in AGENTS.md/CLAUDE.md and `docs/agents/` so the engineering skills know this repo's issue tracker (GitHub or local markdown), triage label vocabulary, and domain doc layout. Run before first use of `to-issues`, `to-prd`, `triage`, `diagnose`, `tdd`, `improve-codebase-architecture`, or `zoom-out` — or if those skills appear to be missing context about the issue tracker, triage labels, or domain docs.
+disable-model-invocation: true
+---
+
+# Setup Matt Pocock's Skills
+
+Scaffold the per-repo configuration that the engineering skills assume:
+
+- **Issue tracker** — where issues live (GitHub by default; local markdown is also supported out of the box)
+- **Triage labels** — the strings used for the five canonical triage roles
+- **Domain docs** — where `CONTEXT.md` and ADRs live, and the consumer rules for reading them
+
+This is a prompt-driven skill, not a deterministic script. Explore, present what you found, confirm with the user, then write.
+
+## Process
+
+### 1. Explore
+
+Look at the current repo to understand its starting state. Read whatever exists; don't assume:
+
+- `git remote -v` and `.git/config` — is this a GitHub repo? Which one?
+- `AGENTS.md` and `CLAUDE.md` at the repo root — does either exist? Is there already an `## Agent skills` section in either?
+- `CONTEXT.md` and `CONTEXT-MAP.md` at the repo root
+- `docs/adr/` and any `src/*/docs/adr/` directories
+- `docs/agents/` — does this skill's prior output already exist?
+- `.scratch/` — sign that a local-markdown issue tracker convention is already in use
+
+### 2. Present findings and ask
+
+Summarise what's present and what's missing. Then walk the user through the three decisions **one at a time** — present a section, get the user's answer, then move to the next. Don't dump all three at once.
+
+Assume the user does not know what these terms mean. Each section starts with a short explainer (what it is, why these skills need it, what changes if they pick differently). Then show the choices and the default.
+
+**Section A — Issue tracker.**
+
+> Explainer: The "issue tracker" is where issues live for this repo. Skills like `to-issues`, `triage`, `to-prd`, and `qa` read from and write to it — they need to know whether to call `gh issue create`, write a markdown file under `.scratch/`, or follow some other workflow you describe. Pick the place you actually track work for this repo.
+
+Default posture: these skills were designed for GitHub. If a `git remote` points at GitHub, propose that. If a `git remote` points at GitLab (`gitlab.com` or a self-hosted host), propose GitLab. Otherwise (or if the user prefers), offer:
+
+- **GitHub** — issues live in the repo's GitHub Issues (uses the `gh` CLI)
+- **GitLab** — issues live in the repo's GitLab Issues (uses the [`glab`](https://gitlab.com/gitlab-org/cli) CLI)
+- **Local markdown** — issues live as files under `.scratch/<feature>/` in this repo (good for solo projects or repos without a remote)
+- **Other** (Jira, Linear, etc.) — ask the user to describe the workflow in one paragraph; the skill will record it as freeform prose
+
+**Section B — Triage label vocabulary.**
+
+> Explainer: When the `triage` skill processes an incoming issue, it moves it through a state machine — needs evaluation, waiting on reporter, ready for an AFK agent to pick up, ready for a human, or won't fix. To do that, it needs to apply labels (or the equivalent in your issue tracker) that match strings *you've actually configured*. If your repo already uses different label names (e.g. `bug:triage` instead of `needs-triage`), map them here so the skill applies the right ones instead of creating duplicates.
+
+The five canonical roles:
+
+- `needs-triage` — maintainer needs to evaluate
+- `needs-info` — waiting on reporter
+- `ready-for-agent` — fully specified, AFK-ready (an agent can pick it up with no human context)
+- `ready-for-human` — needs human implementation
+- `wontfix` — will not be actioned
+
+Default: each role's string equals its name. Ask the user if they want to override any. If their issue tracker has no existing labels, the defaults are fine.
+
+**Section C — Domain docs.**
+
+> Explainer: Some skills (`improve-codebase-architecture`, `diagnose`, `tdd`) read a `CONTEXT.md` file to learn the project's domain language, and `docs/adr/` for past architectural decisions. They need to know whether the repo has one global context or multiple (e.g. a monorepo with separate frontend/backend contexts) so they look in the right place.
+
+Confirm the layout:
+
+- **Single-context** — one `CONTEXT.md` + `docs/adr/` at the repo root. Most repos are this.
+- **Multi-context** — `CONTEXT-MAP.md` at the root pointing to per-context `CONTEXT.md` files (typically a monorepo).
+
+### 3. Confirm and edit
+
+Show the user a draft of:
+
+- The `## Agent skills` block to add to whichever of `CLAUDE.md` / `AGENTS.md` is being edited (see step 4 for selection rules)
+- The contents of `docs/agents/issue-tracker.md`, `docs/agents/triage-labels.md`, `docs/agents/domain.md`
+
+Let them edit before writing.
+
+### 4. Write
+
+**Pick the file to edit:**
+
+- If `CLAUDE.md` exists, edit it.
+- Else if `AGENTS.md` exists, edit it.
+- If neither exists, ask the user which one to create — don't pick for them.
+
+Never create `AGENTS.md` when `CLAUDE.md` already exists (or vice versa) — always edit the one that's already there.
+
+If an `## Agent skills` block already exists in the chosen file, update its contents in-place rather than appending a duplicate. Don't overwrite user edits to the surrounding sections.
+
+The block:
+
+```markdown
+## Agent skills
+
+### Issue tracker
+
+[one-line summary of where issues are tracked]. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+[one-line summary of the label vocabulary]. See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+[one-line summary of layout — "single-context" or "multi-context"]. See `docs/agents/domain.md`.
+```
+
+Then write the three docs files using the seed templates in this skill folder as a starting point:
+
+- [issue-tracker-github.md](./issue-tracker-github.md) — GitHub issue tracker
+- [issue-tracker-gitlab.md](./issue-tracker-gitlab.md) — GitLab issue tracker
+- [issue-tracker-local.md](./issue-tracker-local.md) — local-markdown issue tracker
+- [triage-labels.md](./triage-labels.md) — label mapping
+- [domain.md](./domain.md) — domain doc consumer rules + layout
+
+For "other" issue trackers, write `docs/agents/issue-tracker.md` from scratch using the user's description.
+
+### 5. Done
+
+Tell the user the setup is complete and which engineering skills will now read from these files. Mention they can edit `docs/agents/*.md` directly later — re-running this skill is only necessary if they want to switch issue trackers or restart from scratch.
@@ -0,0 +1,51 @@
+# Domain Docs
+
+How the engineering skills should consume this repo's domain documentation when exploring the codebase.
+
+## Before exploring, read these
+
+- **`CONTEXT.md`** at the repo root, or
+- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
+- **`docs/adr/`** — read ADRs that touch the area you're about to work in. In multi-context repos, also check `src/<context>/docs/adr/` for context-scoped decisions.
+
+If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
+
+## File structure
+
+Single-context repo (most repos):
+
+```
+/
+├── CONTEXT.md
+├── docs/adr/
+│   ├── 0001-event-sourced-orders.md
+│   └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+Multi-context repo (presence of `CONTEXT-MAP.md` at the root):
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/adr/                          ← system-wide decisions
+└── src/
+    ├── ordering/
+    │   ├── CONTEXT.md
+    │   └── docs/adr/                  ← context-specific decisions
+    └── billing/
+        ├── CONTEXT.md
+        └── docs/adr/
+```
+
+## Use the glossary's vocabulary
+
+When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.
+
+If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
+
+## Flag ADR conflicts
+
+If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:
+
+> _Contradicts ADR-0007 (event-sourced orders) — but worth reopening because…_
@@ -0,0 +1,22 @@
+# Issue tracker: GitHub
+
+Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
+- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
+- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
+- **Comment on an issue**: `gh issue comment <number> --body "..."`
+- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
+- **Close**: `gh issue close <number> --comment "..."`
+
+Infer the repo from `git remote -v` — `gh` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitHub issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `gh issue view <number> --comments`.
@@ -0,0 +1,23 @@
+# Issue tracker: GitLab
+
+Issues and PRDs for this repo live as GitLab issues. Use the [`glab`](https://gitlab.com/gitlab-org/cli) CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `glab issue create --title "..." --description "..."`. Use a heredoc for multi-line descriptions. Pass `--description -` to open an editor.
+- **Read an issue**: `glab issue view <number> --comments`. Use `-F json` for machine-readable output.
+- **List issues**: `glab issue list --state opened -F json` with appropriate `--label` filters. Note that GitLab uses `opened` (not `open`) for the state value.
+- **Comment on an issue**: `glab issue note <number> --message "..."`. GitLab calls comments "notes".
+- **Apply / remove labels**: `glab issue update <number> --label "..."` / `--unlabel "..."`. Multiple labels can be comma-separated or by repeating the flag.
+- **Close**: `glab issue close <number>`. `glab issue close` does not accept a closing comment, so post the explanation first with `glab issue note <number> --message "..."`, then close.
+- **Merge requests**: GitLab calls PRs "merge requests". Use `glab mr create`, `glab mr view`, `glab mr note`, etc. — the same shape as `gh pr ...` with `mr` in place of `pr` and `note`/`--message` in place of `comment`/`--body`.
+
+Infer the repo from `git remote -v` — `glab` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitLab issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `glab issue view <number> --comments`.
@@ -0,0 +1,19 @@
+# Issue tracker: Local Markdown
+
+Issues and PRDs for this repo live as markdown files in `.scratch/`.
+
+## Conventions
+
+- One feature per directory: `.scratch/<feature-slug>/`
+- The PRD is `.scratch/<feature-slug>/PRD.md`
+- Implementation issues are `.scratch/<feature-slug>/issues/<NN>-<slug>.md`, numbered from `01`
+- Triage state is recorded as a `Status:` line near the top of each issue file (see `triage-labels.md` for the role strings)
+- Comments and conversation history append to the bottom of the file under a `## Comments` heading
+
+## When a skill says "publish to the issue tracker"
+
+Create a new file under `.scratch/<feature-slug>/` (creating the directory if needed).
+
+## When a skill says "fetch the relevant ticket"
+
+Read the file at the referenced path. The user will normally pass the path or the issue number directly.
@@ -0,0 +1,15 @@
+# Triage Labels
+
+The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
+
+| Label in mattpocock/skills | Label in our tracker | Meaning                                  |
+| -------------------------- | -------------------- | ---------------------------------------- |
+| `needs-triage`             | `needs-triage`       | Maintainer needs to evaluate this issue  |
+| `needs-info`               | `needs-info`         | Waiting on reporter for more information |
+| `ready-for-agent`          | `ready-for-agent`    | Fully specified, ready for an AFK agent  |
+| `ready-for-human`          | `ready-for-human`    | Requires human implementation            |
+| `wontfix`                  | `wontfix`            | Will not be actioned                     |
+
+When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
+
+Edit the right-hand column to match whatever vocabulary you actually use.
@@ -375,6 +375,22 @@ When user asks about... check these files:

 ---

+## Agent skills
+
+### Issue tracker
+
+Issues live in the self-hosted Gitea repo at git.np-dms.work:2222. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+Default label vocabulary (no custom mapping). See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+Single-context repo with domain documentation in `specs/`. See `docs/agents/domain.md`.
+
+---
+
 ## 📚 Full Documentation

 This file is a **quick reference**. For detailed information:
@@ -551,7 +551,19 @@ export class CorrespondenceService {
    if (!correspondence) {
      throw new NotFoundException('Correspondence', publicId);
    }
-    return correspondence;
+
+    // ADR-021: expose live workflow state (null-safe — Draft \u0e22\u0e31\u0e07\u0e44\u0e21\u0e48\u0e21\u0e35 workflow instance)
+    const workflowInstance = await this.workflowEngine.getInstanceByEntity(
+      'correspondence',
+      correspondence.publicId
+    );
+
+    return {
+      ...correspondence,
+      workflowInstanceId: workflowInstance?.id ?? null,
+      workflowState: workflowInstance?.currentState ?? null,
+      availableActions: workflowInstance?.availableActions ?? [],
+    };
  }

  async addReference(id: number, dto: AddReferenceDto) {
@@ -12,6 +12,8 @@ export class WorkflowHistoryItemDto {
  toState!: string;
  action!: string;
  actionByUserId?: number;
+  // ADR-019: UUID ของ User ผู้ดำเนินการ — expose แทน INT PK ในทุก API Response
+  actorUuid?: string;
  comment?: string;
  metadata?: Record<string, unknown>;
  attachments!: AttachmentSummaryDto[];
@@ -4,11 +4,13 @@ import { ApiProperty, ApiPropertyOptional } from '@nestjs/swagger';
 import {
  ArrayMaxSize,
  IsArray,
+  IsInt,
  IsNotEmpty,
  IsObject,
  IsOptional,
  IsString,
  IsUUID,
+  Min,
 } from 'class-validator';

 export class WorkflowTransitionDto {
@@ -47,4 +49,15 @@ export class WorkflowTransitionDto {
  @ArrayMaxSize(20)
  @IsOptional()
  attachmentPublicIds?: string[];
+
+  @ApiPropertyOptional({
+    description:
+      'Optimistic lock version — ส่งค่าที่ได้จาก GET /instances/:id เพื่อป้องกัน Double-approval (ADR-001 v1.1 FR-002). Server ตอบ 409 ถ้าค่าไม่ตรง',
+    example: 5,
+    minimum: 1,
+  })
+  @IsInt()
+  @Min(1)
+  @IsOptional()
+  versionNo?: number;
 }
@@ -47,6 +47,17 @@ export class WorkflowHistory {
  })
  actionByUserId?: number;

+  // ADR-019: UUID ของ User ผู้ดำเนินการ — expose ใน API Response แทน INT PK
+  // NULL = System Action หรือ Pre-migration record (Delta 10)
+  @Column({
+    name: 'action_by_user_uuid',
+    length: 36,
+    nullable: true,
+    comment:
+      'UUID ของ User ผู้ดำเนินการ — ใช้ใน API Response per ADR-019. INT FK action_by_user_id ยังคงอยู่สำหรับ Internal use',
+  })
+  actionByUserUuid?: string;
+
  @Column({ type: 'text', nullable: true, comment: 'ความเห็นประกอบการอนุมัติ' })
  comment?: string;

@@ -85,4 +85,15 @@ export class WorkflowInstance {

  @UpdateDateColumn({ name: 'updated_at' })
  updatedAt!: Date;
+
+  // ADR-001 v1.1 FR-002: Optimistic lock — incremented on every successful transition
+  // Client ส่งค่านี้มาด้วยทุกครั้งที่ transition; Server reject HTTP 409 ถ้าไม่ตรง
+  @Column({
+    name: 'version_no',
+    type: 'int',
+    default: 1,
+    comment:
+      'Optimistic lock counter — incremented on each successful transition (ADR-001 v1.1 FR-002)',
+  })
+  versionNo!: number;
 }
@@ -34,9 +34,11 @@ describe('WorkflowTransitionGuard', () => {

  const mockRequest = (
    params: Record<string, string> = {},
-    user: MockUserPayload = mockUser
+    user: MockUserPayload = mockUser,
+    action = 'APPROVE'
  ): Partial<RequestWithUser> => ({
    params,
+    body: { action },
    user: user as RequestWithUser['user'],
  });

@@ -120,6 +122,7 @@ describe('WorkflowTransitionGuard', () => {
      expect(userService.getUserPermissions).toHaveBeenCalledWith(123);
      expect(instanceRepo.findOne).toHaveBeenCalledWith({
        where: { id: 'instance-123' },
+        relations: ['definition'],
      });
    });

@@ -276,6 +279,130 @@ describe('WorkflowTransitionGuard', () => {
    });
  });

+  // T025: DSL require.role → CASL ability mapping tests
+  describe('DSL CASL Role Mapping (FR-002a)', () => {
+    it('should allow access when DSL requires OrgAdmin role and user has organization.manage_users', async () => {
+      userService.getUserPermissions.mockResolvedValue([
+        'organization.manage_users',
+      ]);
+      const mockInstance = {
+        id: 'instance-dsl-1',
+        currentState: 'PENDING_REVIEW',
+        context: { organizationId: 99 }, // Different org — Level 2 would deny
+        contractId: null,
+        definition: {
+          compiled: {
+            states: {
+              PENDING_REVIEW: {
+                transitions: {
+                  APPROVE: { requirements: { roles: ['OrgAdmin'] } },
+                },
+              },
+            },
+          },
+        },
+      };
+      instanceRepo.findOne.mockResolvedValue(mockInstance);
+      const context = mockContext(
+        mockRequest({ id: 'instance-dsl-1' }, mockUser, 'APPROVE')
+      );
+
+      const result = await guard.canActivate(context);
+
+      expect(result).toBe(true);
+    });
+
+    it('should allow access when DSL requires ContractMember and user has contract.view', async () => {
+      userService.getUserPermissions.mockResolvedValue(['contract.view']);
+      const mockInstance = {
+        id: 'instance-dsl-2',
+        currentState: 'REVIEW',
+        context: { organizationId: 99 },
+        contractId: null,
+        definition: {
+          compiled: {
+            states: {
+              REVIEW: {
+                transitions: {
+                  SUBMIT: { requirements: { roles: ['ContractMember'] } },
+                },
+              },
+            },
+          },
+        },
+      };
+      instanceRepo.findOne.mockResolvedValue(mockInstance);
+      const context = mockContext(
+        mockRequest({ id: 'instance-dsl-2' }, mockUser, 'SUBMIT')
+      );
+
+      const result = await guard.canActivate(context);
+
+      expect(result).toBe(true);
+    });
+
+    it('should deny when DSL requires OrgAdmin but user only has contract.view', async () => {
+      userService.getUserPermissions.mockResolvedValue(['contract.view']);
+      const mockInstance = {
+        id: 'instance-dsl-3',
+        currentState: 'PENDING',
+        context: { organizationId: 99 },
+        contractId: null,
+        definition: {
+          compiled: {
+            states: {
+              PENDING: {
+                transitions: {
+                  APPROVE: { requirements: { roles: ['OrgAdmin'] } },
+                },
+              },
+            },
+          },
+        },
+      };
+      instanceRepo.findOne.mockResolvedValue(mockInstance);
+      const context = mockContext(
+        mockRequest({ id: 'instance-dsl-3' }, mockUser, 'APPROVE')
+      );
+
+      await expect(guard.canActivate(context)).rejects.toThrow(
+        ForbiddenException
+      );
+    });
+
+    it('should fall through to Level 3 when DSL role is AssignedHandler', async () => {
+      userService.getUserPermissions.mockResolvedValue(['document.view']);
+      const mockInstance = {
+        id: 'instance-dsl-4',
+        currentState: 'ASSIGNED',
+        context: { organizationId: 99, assignedUserId: 123 }, // same as mockUser.user_id
+        contractId: null,
+        definition: {
+          compiled: {
+            states: {
+              ASSIGNED: {
+                transitions: {
+                  COMPLETE: {
+                    requirements: { roles: ['AssignedHandler'] },
+                  },
+                },
+              },
+            },
+          },
+        },
+      };
+      instanceRepo.findOne.mockResolvedValue(mockInstance);
+      const context = mockContext(
+        mockRequest({ id: 'instance-dsl-4' }, mockUser, 'COMPLETE')
+      );
+
+      // AssignedHandler → falls to Level 3 check → passes because assignedUserId === user_id
+      const result = await guard.canActivate(context);
+
+      expect(result).toBe(true);
+    });
+  });
+
  describe('Level 4: Unauthorized Users', () => {
    it('should deny access for regular users without any special permissions', async () => {
      // Arrange
@@ -12,7 +12,17 @@ import {
 import { InjectRepository } from '@nestjs/typeorm';
 import { DataSource, Repository } from 'typeorm';
 import { WorkflowInstance } from '../entities/workflow-instance.entity';
+import { CompiledWorkflow } from '../workflow-dsl.service';
 import { UserService } from '../../../modules/user/user.service';
+
+// FR-002a: DSL require.role → CASL ability สตาติก mapping (research.md Decision 2)
+// 'ไม่รู้จัก' DSL role → fall through ไป Level 3 (assignedUserId) check
+const DSL_ROLE_TO_CASL: Record<string, string> = {
+  Superadmin: 'system.manage_all',
+  OrgAdmin: 'organization.manage_users',
+  ContractMember: 'contract.view',
+  AssignedHandler: '__assigned__', // ไม่ map ไป CASL — จัดการโดย Level 3 check
+};
 import type { RequestWithUser } from '../../../common/interfaces/request-with-user.interface';

 /**
@@ -39,6 +49,8 @@ export class WorkflowTransitionGuard implements CanActivate {
  async canActivate(context: ExecutionContext): Promise<boolean> {
    const request = context.switchToHttp().getRequest<RequestWithUser>();
    const instanceId = request.params['id'];
+    // FR-002a: action \u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a DSL role check (\u0e15\u0e23\u0e27\u0e08\u0e2a\u0e2d\u0e1a requirements.roles \u0e02\u0e2d\u0e07 transition \u0e17\u0e35\u0e48\u0e15\u0e49\u0e2d\u0e07\u0e01\u0e32\u0e23\u0e17\u0e33)
+    const action = (request.body as { action?: string }).action ?? '';
    const user = request.user;

    // ดึงสิทธิ์ทั้งหมดของ User จาก DB (ตาม pattern เดียวกับ RbacGuard)
@@ -51,15 +63,37 @@ export class WorkflowTransitionGuard implements CanActivate {
      return true;
    }

-    // ดึง Instance เพื่อตรวจสอบ Context
+    // ดึง Instance + Definition เพื่อตรวจสอบ Context และ DSL require.role
    const instance = await this.instanceRepo.findOne({
      where: { id: instanceId },
+      relations: ['definition'],
    });

    if (!instance) {
      throw new NotFoundException('Workflow Instance', instanceId);
    }

+    // FR-002a: DSL require.role → CASL ability check
+    // ตรวจสอบ requirements.roles ของ CompiledTransition ที่ตรงกับ action ที่ Request ขอ
+    // (ยังต้องผ่าน contract membership check Level 2.5)
+    const compiled = instance.definition?.compiled as
+      | CompiledWorkflow
+      | undefined;
+    const stateConfig = compiled?.states?.[instance.currentState];
+    // CompiledTransition.requirements.roles — ไม่ใช่ stateConfig.require (ซึ่งไม่มี)
+    const requiredDslRoles: string[] =
+      stateConfig?.transitions?.[action]?.requirements?.roles ?? [];
+    let dslRoleAuthorized = false;
+    for (const dslRole of requiredDslRoles) {
+      const caslAbility = DSL_ROLE_TO_CASL[dslRole];
+      if (caslAbility && caslAbility !== '__assigned__') {
+        if (userPermissions.includes(caslAbility)) {
+          dslRoleAuthorized = true;
+          break;
+        }
+      }
+    }
+
    // Level 2: Org Admin — organization.manage_users + สังกัดองค์กรเดียวกับเอกสาร
    const docOrgId = instance.context?.organizationId as number | undefined;
    if (
@@ -99,16 +133,21 @@ export class WorkflowTransitionGuard implements CanActivate {
      }
    }

-    // Level 3: Assigned Handler — User นี้ถูก Assign มาให้ทำ Step นี้โดยตรง
+    // Level 3: Assigned Handler หรือ DSL CASL-authorized role
+    // FR-002a: ถ้า DSL require.role ตรงกับ CASL ability ของ User → ผ่าน
+    // (กรณี AssignedHandler ใน DSL → ตรวจสอบผ่าน assignedUserId ใน context)
    const assignedUserId = instance.context?.assignedUserId as
      | number
      | undefined;
-    if (assignedUserId !== undefined && user.user_id === assignedUserId) {
+    if (
+      dslRoleAuthorized ||
+      (assignedUserId !== undefined && user.user_id === assignedUserId)
+    ) {
      return true;
    }

    this.logger.warn(
-      `Unauthorized transition attempt: User ${user.user_id} on Instance ${instanceId}`
+      `Unauthorized transition attempt: User ${user.user_id} on Instance ${instanceId} (DSL roles: [${requiredDslRoles.join(', ')}])`
    );
    throw new ForbiddenException({
      userMessage: 'คุณไม่มีสิทธิ์ดำเนินการในขั้นตอนนี้',
@@ -93,6 +93,19 @@ export class WorkflowEngineController {
    return this.workflowService.evaluate(dto);
  }

+  @Post('definitions/validate')
+  @ApiOperation({
+    summary: 'FR-025: ตรวจสอบความถูกต้องของ DSL โดยไม่บันทึกข้อมูล',
+  })
+  @ApiResponse({
+    status: 200,
+    description: '{ valid: true } หรือ { valid: false, errors: [...] }',
+  })
+  @RequirePermission('system.manage_all')
+  validateDefinition(@Body() body: { dsl: Record<string, unknown> }) {
+    return this.workflowService.validateDsl(body.dsl);
+  }
+
  // =================================================================
  // Runtime Engine (User Actions)
  // =================================================================
@@ -117,6 +130,8 @@ export class WorkflowEngineController {
    }

    const userId = req.user.user_id;
+    // ADR-019: ใช้ publicId (UUID) แทน INT PK สำหรับ History record
+    const userUuid = req.user.publicId;

    // ตรวจ Redis ว่า Request นี้ถูกส่งมาแล้วหรือไม่ (key ผูกกับ userId ป้องกัน cross-user replay)
    const cacheKey = `idempotency:transition:${idempotencyKey}:${userId}`;
@@ -131,7 +146,9 @@ export class WorkflowEngineController {
      userId,
      dto.comment,
      dto.payload,
-      dto.attachmentPublicIds // ADR-021: step-specific attachments
+      dto.attachmentPublicIds, // ADR-021: step-specific attachments
+      userUuid, // ADR-019: UUID สำหรับ history record
+      dto.versionNo // ADR-001 v1.1 FR-002: Optimistic lock
    );

    // เก็บใน Redis 24 ชั่วโมง (86400 วินาที = 86400000 ms ใน cache-manager v7)
@@ -2,6 +2,7 @@

 import { Module } from '@nestjs/common';
 import { TypeOrmModule } from '@nestjs/typeorm';
+import { BullModule } from '@nestjs/bullmq';
 import {
  makeCounterProvider,
  makeHistogramProvider,
@@ -16,7 +17,8 @@ import { Attachment } from '../../common/file-storage/entities/attachment.entity
 // Services
 import { WorkflowDslService } from './workflow-dsl.service';
 import { WorkflowEngineService } from './workflow-engine.service';
-import { WorkflowEventService } from './workflow-event.service'; // [NEW]
+import { WorkflowEventService } from './workflow-event.service';
+import { WorkflowEventProcessor } from './workflow-event.processor';

 // Guards
 import { WorkflowTransitionGuard } from './guards/workflow-transition.guard';
@@ -33,6 +35,9 @@ import { WorkflowEngineController } from './workflow-engine.controller';
      WorkflowHistory,
      Attachment, // ADR-021: ใช้ link attachments ประจำ Step
    ]),
+    // FR-005/006: BullMQ queues สำหรับ workflow events + Dead-Letter Queue
+    BullModule.registerQueue({ name: 'workflow-events' }),
+    BullModule.registerQueue({ name: 'workflow-events-failed' }),
    UserModule,
  ],
  controllers: [WorkflowEngineController],
@@ -40,6 +45,7 @@ import { WorkflowEngineController } from './workflow-engine.controller';
    WorkflowEngineService,
    WorkflowDslService,
    WorkflowEventService,
+    WorkflowEventProcessor, // FR-005: BullMQ Processor + DLQ handler
    WorkflowTransitionGuard,
    // ADR-021 S1: Redlock observability — Prometheus metrics
    makeHistogramProvider({
@@ -52,6 +58,18 @@ import { WorkflowEngineController } from './workflow-engine.controller';
      name: 'workflow_redlock_acquire_failures_total',
      help: 'จำนวนครั้งที่ Redlock acquire ล้มเหลวหลัง retry ครบ (Fail-closed HTTP 503)',
    }),
+    // FR-023: Per-transition metrics — labelled by workflow_code, action, outcome
+    makeCounterProvider({
+      name: 'workflow_transitions_total',
+      help: 'จำนวน workflow transitions ทั้งหมด จำแนกตาม workflow_code, action และ outcome',
+      labelNames: ['workflow_code', 'action', 'outcome'],
+    }),
+    makeHistogramProvider({
+      name: 'workflow_transition_duration_ms',
+      help: 'เวลาที่ใช้ในการ process workflow transition ทั้งหมด (ms) รวม Redlock + DB transaction',
+      labelNames: ['workflow_code'],
+      buckets: [50, 100, 250, 500, 1000, 2500, 5000, 10000],
+    }),
  ],
  exports: [WorkflowEngineService], // Export Service ให้ Module อื่น (Correspondence, RFA) เรียกใช้
 })
@@ -35,13 +35,22 @@ import { CreateWorkflowDefinitionDto } from './dto/create-workflow-definition.dt
 const DEFAULT_REDIS_TOKEN = 'default_IORedisModuleConnectionToken';

 describe('WorkflowEngineService', () => {
+  let compiledModule: TestingModule;
  let service: WorkflowEngineService;
  let defRepo: Repository<WorkflowDefinition>;
  let instanceRepo: Repository<WorkflowInstance>;
+  let attachmentRepo: { find: jest.Mock; update: jest.Mock };
  let dslService: WorkflowDslService;
  let eventService: WorkflowEventService;

  // Mock Objects
+  const mockCasQueryBuilder = {
+    update: jest.fn().mockReturnThis(),
+    set: jest.fn().mockReturnThis(),
+    where: jest.fn().mockReturnThis(),
+    execute: jest.fn().mockResolvedValue({ affected: 1 }),
+  };
+
  const mockQueryRunner = {
    connect: jest.fn(),
    startTransaction: jest.fn(),
@@ -52,6 +61,8 @@ describe('WorkflowEngineService', () => {
      findOne: jest.fn(),
      save: jest.fn(),
      update: jest.fn(),
+      // ADR-001 v1.1 FR-002: CAS version increment mock
+      createQueryBuilder: jest.fn().mockReturnValue(mockCasQueryBuilder),
    },
  };

@@ -85,7 +96,7 @@ describe('WorkflowEngineService', () => {
    });
    mockRedlockRelease.mockClear();

-    const module: TestingModule = await Test.createTestingModule({
+    compiledModule = await Test.createTestingModule({
      providers: [
        WorkflowEngineService,
        {
@@ -151,14 +162,30 @@ describe('WorkflowEngineService', () => {
            inc: jest.fn(),
          },
        },
+        // FR-023: Per-transition metrics mocks
+        {
+          provide: 'PROM_METRIC_WORKFLOW_TRANSITIONS_TOTAL',
+          useValue: {
+            labels: jest.fn().mockReturnThis(),
+            inc: jest.fn(),
+          },
+        },
+        {
+          provide: 'PROM_METRIC_WORKFLOW_TRANSITION_DURATION_MS',
+          useValue: {
+            labels: jest.fn().mockReturnThis(),
+            observe: jest.fn(),
+          },
+        },
      ],
    }).compile();

-    service = module.get<WorkflowEngineService>(WorkflowEngineService);
-    defRepo = module.get(getRepositoryToken(WorkflowDefinition));
-    instanceRepo = module.get(getRepositoryToken(WorkflowInstance));
-    dslService = module.get(WorkflowDslService);
-    eventService = module.get(WorkflowEventService);
+    service = compiledModule.get<WorkflowEngineService>(WorkflowEngineService);
+    defRepo = compiledModule.get(getRepositoryToken(WorkflowDefinition));
+    instanceRepo = compiledModule.get(getRepositoryToken(WorkflowInstance));
+    attachmentRepo = compiledModule.get(getRepositoryToken(Attachment));
+    dslService = compiledModule.get(WorkflowDslService);
+    eventService = compiledModule.get(WorkflowEventService);
  });

  it('should be defined', () => {
@@ -563,11 +590,13 @@ describe('WorkflowEngineService', () => {
          id: 'inst-1',
          currentState: 'PENDING_REVIEW',
          status: WorkflowStatus.ACTIVE,
-          definition: { compiled: mockCompiledWorkflow },
+          definition: { compiled: mockCompiledWorkflow, workflow_code: 'WF01' },
          context: {},
+          versionNo: 1,
        });
        mockQueryRunner.manager.save.mockResolvedValue({ id: 'history-1' });
        mockQueryRunner.manager.update.mockResolvedValue({ affected: 1 });
+        mockCasQueryBuilder.execute.mockResolvedValue({ affected: 1 });
        mockDslService.evaluate.mockReturnValue({
          nextState: 'APPROVED',
          events: [],
@@ -585,4 +614,283 @@ describe('WorkflowEngineService', () => {
      });
    });
  });
+
+  // ============================================================
+  // T024: ADR-001 v1.1 FR-002 — Optimistic Lock Tests
+  // ============================================================
+  describe('Optimistic Lock (FR-002)', () => {
+    const baseInstance = {
+      id: 'inst-opt-1',
+      currentState: 'PENDING_REVIEW',
+      status: WorkflowStatus.ACTIVE,
+      definition: { compiled: mockCompiledWorkflow, workflow_code: 'WF01' },
+      context: {},
+      versionNo: 5,
+    };
+
+    it('T024a: should throw ConflictException (409) when clientVersionNo does not match current versionNo (fast-fail)', async () => {
+      // Arrange: DB มี version_no=5, client ส่ง version_no=3 (ล้าสมัย)
+      (instanceRepo.findOne as jest.Mock).mockResolvedValue({
+        id: 'inst-opt-1',
+        versionNo: 5,
+      });
+
+      // Act + Assert
+      await expect(
+        service.processTransition(
+          'inst-opt-1',
+          'APPROVE',
+          1,
+          undefined,
+          {},
+          undefined,
+          'user-uuid-123',
+          3 // clientVersionNo ล้าสมัย
+        )
+      ).rejects.toThrow(ConflictException);
+
+      // Fast-fail: Redlock ต้องไม่ถูกเรียก (ผ่าน check ก่อน acquire)
+      expect(mockRedlockAcquire).not.toHaveBeenCalled();
+    });
+
+    it('T024b: should pass fast-fail and proceed when clientVersionNo matches current versionNo', async () => {
+      // Arrange: clientVersionNo ตรงกับ DB
+      (instanceRepo.findOne as jest.Mock).mockResolvedValue({
+        id: 'inst-opt-1',
+        currentState: 'PENDING_REVIEW',
+        versionNo: 5,
+      });
+      mockQueryRunner.manager.findOne.mockResolvedValue({
+        ...baseInstance,
+        versionNo: 5,
+      });
+      mockQueryRunner.manager.save.mockResolvedValue({ id: 'history-1' });
+      mockCasQueryBuilder.execute.mockResolvedValue({ affected: 1 });
+      mockDslService.evaluate.mockReturnValue({
+        nextState: 'APPROVED',
+        events: [],
+      });
+
+      // Act
+      const result = await service.processTransition(
+        'inst-opt-1',
+        'APPROVE',
+        1,
+        undefined,
+        {},
+        undefined,
+        'user-uuid-123',
+        5 // clientVersionNo ตรง
+      );
+
+      // Assert: สำเร็จ + คืน versionNo ใหม่
+      expect(result.success).toBe(true);
+      expect(result.versionNo).toBe(6); // 5 + 1
+      expect(mockRedlockAcquire).toHaveBeenCalled();
+    });
+
+    it('T024c: should throw ConflictException when CAS update returns affected=0 (TOCTOU edge case)', async () => {
+      // Arrange: fast-fail ผ่าน (ไม่ส่ง clientVersionNo), แต่ CAS ล้มเหลว
+      (instanceRepo.findOne as jest.Mock).mockResolvedValue({
+        id: 'inst-opt-1',
+        currentState: 'PENDING_REVIEW',
+        versionNo: 5,
+      });
+      mockQueryRunner.manager.findOne.mockResolvedValue({
+        ...baseInstance,
+        versionNo: 5,
+      });
+      mockQueryRunner.manager.save.mockResolvedValue({ id: 'history-1' });
+      // CAS: เกิด TOCTOU — version_no ถูกเปลี่ยนระหว่าง Redlock acquire กับ CAS update
+      mockCasQueryBuilder.execute.mockResolvedValue({ affected: 0 });
+      mockDslService.evaluate.mockReturnValue({
+        nextState: 'APPROVED',
+        events: [],
+      });
+
+      // Act + Assert
+      await expect(
+        service.processTransition(
+          'inst-opt-1',
+          'APPROVE',
+          1,
+          undefined,
+          {},
+          undefined
+          // ไม่ส่ง clientVersionNo — TOCTOU ถูกตรวจโดย CAS layer
+        )
+      ).rejects.toThrow(ConflictException);
+
+      expect(mockQueryRunner.rollbackTransaction).toHaveBeenCalled();
+      expect(mockQueryRunner.commitTransaction).not.toHaveBeenCalled();
+    });
+
+    it('T024d: should rollback attachments to temp when DB transaction fails (FR-019)', async () => {
+      // Arrange: commit ล้มเหลว — คาดว่า attachments จะถูก revert กลับเป็น temp
+      (instanceRepo.findOne as jest.Mock).mockResolvedValue(null); // no pre-check needed (no attachment state)
+      mockQueryRunner.manager.findOne.mockResolvedValue({
+        ...baseInstance,
+        versionNo: 5,
+      });
+      mockQueryRunner.manager.save.mockResolvedValue({ id: 'history-1' });
+      // CAS สำเร็จ
+      mockCasQueryBuilder.execute.mockResolvedValue({ affected: 1 });
+      // commitTransaction ล้มเหลว
+      mockQueryRunner.commitTransaction.mockRejectedValueOnce(
+        new Error('DB connection lost')
+      );
+      mockDslService.evaluate.mockReturnValue({
+        nextState: 'APPROVED',
+        events: [],
+      });
+
+      // Act + Assert
+      await expect(
+        service.processTransition(
+          'inst-opt-1',
+          'APPROVE',
+          1,
+          undefined,
+          {},
+          ['att-rollback-1', 'att-rollback-2'] // แนบไฟล์ 2 ไฟล์
+        )
+      ).rejects.toThrow(Error);
+
+      // FR-019: attachmentRepo.update ต้องถูกเรียกเพื่อ revert ไฟล์กลับเป็น temp
+      expect(attachmentRepo.update).toHaveBeenCalledWith(
+        expect.objectContaining({
+          publicId: ['att-rollback-1', 'att-rollback-2'],
+        }),
+        expect.objectContaining({ isTemporary: true })
+      );
+    });
+  });
+
+  // ============================================================
+  // T048: ADR-001 FR-007 — DSL Redis Cache Invalidation Tests
+  // ============================================================
+  describe('DSL Redis Cache Invalidation (FR-007, SC-005)', () => {
+    it('T048a: update() should invalidate cache when DSL changes', async () => {
+      // Arrange
+      const mockDef = {
+        id: 'def-cache-1',
+        workflow_code: 'RFA_V1',
+        version: 2,
+        is_active: false,
+        dsl: {},
+        compiled: {},
+      };
+      (defRepo.findOne as jest.Mock).mockResolvedValue(mockDef);
+      (defRepo.save as jest.Mock).mockResolvedValue({ ...mockDef, version: 2 });
+      mockDslService.compile.mockReturnValue(mockCompiledWorkflow);
+
+      const cacheManager = compiledModule.get<{
+        del: jest.Mock;
+        set: jest.Mock;
+        get: jest.Mock;
+      }>(CACHE_MANAGER);
+
+      // Act
+      await service.update('def-cache-1', {
+        dsl: {
+          workflow: 'RFA_V1',
+          states: [],
+        } as unknown as import('./dto/create-workflow-definition.dto').CreateWorkflowDefinitionDto['dsl'],
+      });
+
+      // Assert: cache del เรียกด้วย version key
+      expect(cacheManager.del).toHaveBeenCalledWith('wf:def:RFA_V1:2');
+      // Assert: re-cache เรียกหลัง del
+      expect(cacheManager.set).toHaveBeenCalledWith(
+        'wf:def:RFA_V1:2',
+        expect.any(Object),
+        3_600_000
+      );
+    });
+
+    it('T048b: update() should invalidate active pointer when is_active toggles to true', async () => {
+      // Arrange: definition เดิม is_active = false
+      const mockDef = {
+        id: 'def-cache-2',
+        workflow_code: 'TRANSMITTAL_V1',
+        version: 1,
+        is_active: false,
+        dsl: {},
+        compiled: {},
+      };
+      (defRepo.findOne as jest.Mock).mockResolvedValue(mockDef);
+      (defRepo.save as jest.Mock).mockResolvedValue({
+        ...mockDef,
+        is_active: true,
+      });
+
+      const cacheManager = compiledModule.get<{
+        del: jest.Mock;
+        set: jest.Mock;
+        get: jest.Mock;
+      }>(CACHE_MANAGER);
+
+      // Act: activate definition
+      await service.update('def-cache-2', { is_active: true });
+
+      // Assert: active pointer ถูกลบออกจาก cache
+      expect(cacheManager.del).toHaveBeenCalledWith(
+        'wf:def:TRANSMITTAL_V1:active'
+      );
+    });
+
+    it('T048c: createDefinition() should set cache with version key after save', async () => {
+      // Arrange
+      (defRepo.findOne as jest.Mock).mockResolvedValue({ version: 3 });
+      (defRepo.create as jest.Mock).mockReturnValue({
+        workflow_code: 'WF_CACHE',
+        version: 4,
+      });
+      (defRepo.save as jest.Mock).mockResolvedValue({
+        workflow_code: 'WF_CACHE',
+        version: 4,
+      });
+      mockDslService.compile.mockReturnValue(mockCompiledWorkflow);
+      const cacheManager = compiledModule.get<{
+        del: jest.Mock;
+        set: jest.Mock;
+        get: jest.Mock;
+      }>(CACHE_MANAGER);
+
+      // Act
+      await service.createDefinition({
+        workflow_code: 'WF_CACHE',
+        dsl: {},
+      } as import('./dto/create-workflow-definition.dto').CreateWorkflowDefinitionDto);
+
+      // Assert: cache set ด้วย version key
+      expect(cacheManager.set).toHaveBeenCalledWith(
+        'wf:def:WF_CACHE:4',
+        expect.objectContaining({ workflow_code: 'WF_CACHE', version: 4 }),
+        3_600_000
+      );
+    });
+
+    it('T048d: getDefinitionById() should return from cache on cache hit', async () => {
+      // Arrange: cache มีข้อมูลอยู่แล้ว
+      const cachedDef = {
+        id: 'def-hit-1',
+        workflow_code: 'CACHED_WF',
+        version: 1,
+      };
+      const cacheManager = compiledModule.get<{
+        del: jest.Mock;
+        set: jest.Mock;
+        get: jest.Mock;
+      }>(CACHE_MANAGER);
+      cacheManager.get.mockResolvedValueOnce(cachedDef);
+
+      // Act
+      const result = await service.getDefinitionById('def-hit-1');
+
+      // Assert: ไม่ต้องออก DB
+      expect(result).toEqual(cachedDef);
+      expect(defRepo.findOne).not.toHaveBeenCalled();
+    });
+  });
 });
@@ -32,7 +32,11 @@ import { CreateWorkflowDefinitionDto } from './dto/create-workflow-definition.dt
 import { EvaluateWorkflowDto } from './dto/evaluate-workflow.dto';
 import { UpdateWorkflowDefinitionDto } from './dto/update-workflow-definition.dto';
 import { WorkflowHistoryItemDto } from './dto/workflow-history-item.dto';
-import { CompiledWorkflow, WorkflowDslService } from './workflow-dsl.service';
+import {
+  CompiledWorkflow,
+  RawWorkflowDSL,
+  WorkflowDslService,
+} from './workflow-dsl.service';
 import { WorkflowEventService } from './workflow-event.service'; // [NEW] Import Event Service

 // Legacy Interface (Backward Compatibility)
@@ -79,7 +83,12 @@ export class WorkflowEngineService {
    @InjectMetric('workflow_redlock_acquire_duration_ms')
    private readonly redlockAcquireDuration: Histogram<string>,
    @InjectMetric('workflow_redlock_acquire_failures_total')
-    private readonly redlockAcquireFailures: Counter<string>
+    private readonly redlockAcquireFailures: Counter<string>,
+    // FR-023: Per-transition metrics — labelled by workflow_code, action, outcome
+    @InjectMetric('workflow_transitions_total')
+    private readonly transitionsTotal: Counter<string>,
+    @InjectMetric('workflow_transition_duration_ms')
+    private readonly transitionDuration: Histogram<string>
  ) {
    // ADR-021 Clarify Q2 (C1): Redlock Fail-closed
    // Retry 3 ครั้ง × 500ms เพิ่ม jitter → ถ้ายังไม่ได้ throw HTTP 503
@@ -95,6 +104,30 @@ export class WorkflowEngineService {
  // [PART 1] Definition Management (Phase 6A)
  // =================================================================

+  /**
+   * FR-025: ตรวจสอบ DSL โดยไม่บันทึก — ใช้สำหรับ inline validation ใน Admin Editor
+   */
+  validateDsl(
+    dsl: Record<string, unknown>
+  ):
+    | { valid: true }
+    | { valid: false; errors: { path: string; message: string }[] } {
+    try {
+      this.dslService.compile(dsl as unknown as RawWorkflowDSL);
+      return { valid: true };
+    } catch (error: unknown) {
+      return {
+        valid: false,
+        errors: [
+          {
+            path: '',
+            message: error instanceof Error ? error.message : String(error),
+          },
+        ],
+      };
+    }
+  }
+
  /**
   * สร้างหรืออัปเดต Workflow Definition ใหม่ (Auto Versioning)
   */
@@ -122,6 +155,12 @@ export class WorkflowEngineService {
    });

    const saved = await this.workflowDefRepo.save(entity);
+    // T044: Cache definition per version (TTL 1h, SC-005)
+    await this.cacheManager.set(
+      `wf:def:${saved.workflow_code}:${saved.version}`,
+      saved,
+      3_600_000
+    );
    this.logger.log(
      `Created Workflow Definition: ${saved.workflow_code} v${saved.version}`
    );
@@ -155,10 +194,30 @@ export class WorkflowEngineService {
      }
    }

+    const prevIsActive = definition.is_active;
    if (dto.is_active !== undefined) definition.is_active = dto.is_active;
    if (dto.workflow_code) definition.workflow_code = dto.workflow_code;

-    return this.workflowDefRepo.save(definition);
+    const updated = await this.workflowDefRepo.save(definition);
+
+    // T045: Invalidate version cache เมื่อ DSL เปลี่ยน
+    if (dto.dsl) {
+      await this.cacheManager.del(
+        `wf:def:${updated.workflow_code}:${updated.version}`
+      );
+    }
+    // T045: Invalidate active pointer เมื่อ is_active เปลี่ยน
+    if (dto.is_active !== undefined && dto.is_active !== prevIsActive) {
+      await this.cacheManager.del(`wf:def:${updated.workflow_code}:active`);
+    }
+    // T045: Re-cache updated definition
+    await this.cacheManager.set(
+      `wf:def:${updated.workflow_code}:${updated.version}`,
+      updated,
+      3_600_000
+    );
+
+    return updated;
  }

  /**
@@ -181,10 +240,17 @@ export class WorkflowEngineService {
   * ดึง Workflow Definition ตาม ID หรือ Code
   */
  async getDefinitionById(id: string): Promise<WorkflowDefinition> {
+    // T046: Read-through cache (TTL 1h, SC-005)
+    const cacheKey = `wf:def:id:${id}`;
+    const cached = await this.cacheManager.get<WorkflowDefinition>(cacheKey);
+    if (cached) return cached;
+
    const definition = await this.workflowDefRepo.findOne({ where: { id } });
    if (!definition) {
      throw new NotFoundException('Workflow Definition', id);
    }
+
+    await this.cacheManager.set(cacheKey, definition, 3_600_000);
    return definition;
  }

@@ -317,7 +383,7 @@ export class WorkflowEngineService {
      : [];

    return {
-      id: instance.id,
+      id: instance.id, // publicId (UUID) ของ workflow instance
      currentState: instance.currentState,
      availableActions,
    };
@@ -333,11 +399,49 @@ export class WorkflowEngineService {
    comment?: string,
    payload: Record<string, unknown> = {},
    // ADR-021: publicIds ของไฟล์แนบประจำ Step นี้ (Two-Phase upload ก่อนแล้ว)
-    attachmentPublicIds?: string[]
+    attachmentPublicIds?: string[],
+    // ADR-019: UUID ของ User สำหรับ history record (ไม่ expose INT PK)
+    userUuid?: string,
+    // ADR-001 v1.1 FR-002: Optimistic lock — Client ส่งมาเพื่อป้องกัน Double-approval
+    clientVersionNo?: number
  ) {
+    // FR-022/023: เริ่มจับเวลาทั้ง method เพื่อบันทึก latency metric
+    const startMs = Date.now();
+    let outcome:
+      | 'success'
+      | 'conflict'
+      | 'forbidden'
+      | 'validation_error'
+      | 'system_error' = 'system_error';
+    let workflowCode = 'unknown';
+    let fromState: string | undefined;
+    let toState: string | undefined;
    const hasAttachments =
      attachmentPublicIds !== undefined && attachmentPublicIds.length > 0;

+    // ==============================================================
+    // ADR-001 v1.1 FR-002: Fast-fail Optimistic Lock Check (ก่อน Redlock)
+    // ลดภาระ Redlock สำหรับ Client ที่ส่ง version_no ล้าสมัยมา
+    // ==============================================================
+    if (clientVersionNo !== undefined) {
+      const current = await this.instanceRepo.findOne({
+        where: { id: instanceId },
+        select: ['id', 'versionNo'],
+      });
+      if (!current) {
+        throw new NotFoundException('Workflow Instance', instanceId);
+      }
+      if (current.versionNo !== clientVersionNo) {
+        outcome = 'conflict';
+        throw new ConflictException(
+          'WORKFLOW_VERSION_CONFLICT',
+          `Fast-fail: expected version_no=${clientVersionNo}, actual=${current.versionNo}`,
+          'เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรชและลองใหม่',
+          ['รีเฟรชหน้าแล้วดูสถานะล่าสุดก่อนดำเนินการ']
+        );
+      }
+    }
+
    // ==============================================================
    // ADR-021 Clarify Q1 (C3): ตรวจสถานะก่อน acquire Redlock
    // อนุญาตให้แนบไฟล์เฉพาะในสถานะ PENDING_REVIEW / PENDING_APPROVAL
@@ -453,8 +557,10 @@ export class WorkflowEngineService {
        context
      );

-      const fromState = instance.currentState;
-      const toState = evaluation.nextState;
+      fromState = instance.currentState;
+      toState = evaluation.nextState;
+      // FR-023: บันทึก workflowCode สำหรับ metric labels
+      workflowCode = instance.definition?.workflow_code ?? 'unknown';

      // 3. อัปเดต Instance
      instance.currentState = toState;
@@ -474,6 +580,8 @@ export class WorkflowEngineService {
        toState,
        action,
        actionByUserId: userId,
+        // ADR-019 FR-003: UUID ของ User สำหรับ API Response (INT PK ไม่ expose)
+        actionByUserUuid: userUuid,
        comment,
        metadata: {
          events: evaluation.events,
@@ -516,6 +624,27 @@ export class WorkflowEngineService {
        }
      }

+      // ADR-001 v1.1 FR-002: CAS version increment หลัง commit ใน DB transaction
+      // UPDATE จะล้มเหลว (affected=0) ถ้า version_no ถูกเปลี่ยนระหว่างนี้ (TOCTOU edge case)
+      const casResult = await queryRunner.manager
+        .createQueryBuilder()
+        .update(WorkflowInstance)
+        .set({ versionNo: () => 'version_no + 1' })
+        .where('id = :id AND version_no = :expected', {
+          id: instanceId,
+          expected: instance.versionNo,
+        })
+        .execute();
+
+      if ((casResult.affected ?? 0) === 0) {
+        throw new ConflictException(
+          'WORKFLOW_VERSION_CONFLICT',
+          'version_no changed between Redlock acquisition and CAS update (TOCTOU edge case)',
+          'เกิด Conflict กรุณารีเฟรชและลองใหม่',
+          ['รีเฟรชหน้า', 'ลองดำเนินการอีกครั้ง']
+        );
+      }
+
      await queryRunner.commitTransaction();

      // ADR-021 T043: Invalidate Workflow History cache หลัง transition สำเร็จ
@@ -536,23 +665,85 @@ export class WorkflowEngineService {
        void this.eventService.dispatchEvents(
          instance.id,
          evaluation.events,
-          context
+          context,
+          workflowCode // FR-005: DLQ notification \u0e43\u0e0a\u0e49 workflowCode \u0e23\u0e30\u0e1a\u0e38\u0e1a\u0e23\u0e34\u0e1a\u0e17\u0e18\u0e34\u0e4c Ops
        );
      }

+      outcome = 'success';
+      // FR-014 T014: คืน versionNo ที่ increment แล้ว ให้ Client เก็บไว้สำหรับ request ถัดไป
+      const newVersionNo = instance.versionNo + 1;
+
      return {
        success: true,
+        previousState: fromState,
        nextState: toState,
        events: evaluation.events,
        isCompleted: instance.status === WorkflowStatus.COMPLETED,
+        versionNo: newVersionNo,
      };
    } catch (err) {
      await queryRunner.rollbackTransaction();
+
+      // FR-019: Rollback file attachments กลับเป็น temporary เมื่อ DB transaction ล้มเหลว
+      // ไฟล์บน disk ยังคงอยู่ที่ permanent storage; cleanup job จะจัดการหลัง 24h TTL
+      if (
+        hasAttachments &&
+        attachmentPublicIds &&
+        attachmentPublicIds.length > 0
+      ) {
+        await this.attachmentRepo
+          .update(
+            { publicId: In(attachmentPublicIds), uploadedByUserId: userId },
+            {
+              isTemporary: true,
+              expiresAt: new Date(Date.now() + 24 * 60 * 60 * 1000),
+            }
+          )
+          .catch((rollbackErr: unknown) =>
+            this.logger.error(
+              `FR-019 Attachment rollback failed for ${instanceId}: ${rollbackErr instanceof Error ? rollbackErr.message : String(rollbackErr)}`
+            )
+          );
+        this.logger.warn(
+          `FR-019: Reverted ${attachmentPublicIds.length} attachment(s) to temp for instance ${instanceId} after DB failure`
+        );
+      }
+
+      // จำแนก outcome สำหรับ metric label
+      if (err instanceof ConflictException) outcome = 'conflict';
+      else if ((err as { status?: number }).status === 403)
+        outcome = 'forbidden';
+      else if (err instanceof WorkflowException) outcome = 'validation_error';
+
      this.logger.error(
        `Transition Failed for ${instanceId}: ${(err as Error).message}`
      );
      throw err;
    } finally {
+      const durationMs = Date.now() - startMs;
+      // FR-023: บันทึก transition duration histogram
+      this.transitionDuration
+        .labels({ workflow_code: workflowCode })
+        .observe(durationMs);
+      // FR-023: บันทึก transition counter ตาม outcome
+      this.transitionsTotal
+        .labels({ workflow_code: workflowCode, action, outcome })
+        .inc();
+      // FR-022: Structured log entry ทุก transition (success/failure/conflict)
+      this.logger.log(
+        JSON.stringify({
+          instanceId,
+          action,
+          fromState,
+          toState,
+          userUuid,
+          durationMs,
+          outcome,
+          workflowCode,
+        })
+      );
+
      await queryRunner.release();
      // ADR-021 C1: ปล่อย Redlock เสมอ (non-blocking หาก release ผิดพลาด)
      lock.release().catch((e: unknown) => {
@@ -0,0 +1,165 @@
+// File: src/modules/workflow-engine/workflow-event.processor.spec.ts
+// T026: Unit tests for WorkflowEventProcessor DLQ + n8n webhook (FR-005, FR-006)
+
+import { Test, TestingModule } from '@nestjs/testing';
+import { getQueueToken } from '@nestjs/bullmq';
+import { WorkflowEventProcessor } from './workflow-event.processor';
+import type { WorkflowEventJobData } from './workflow-event.processor';
+import type { Job } from 'bullmq';
+
+// Mock global fetch สำหรับ n8n webhook
+const mockFetch = jest.fn();
+global.fetch = mockFetch;
+
+describe('WorkflowEventProcessor', () => {
+  let processor: WorkflowEventProcessor;
+  let failedQueue: { add: jest.Mock };
+
+  const makeJob = (
+    overrides: Partial<{
+      id: string;
+      attemptsMade: number;
+      opts: { attempts: number };
+      data: Record<string, unknown>;
+    }> = {}
+  ) =>
+    ({
+      id: 'job-001',
+      attemptsMade: 3,
+      opts: { attempts: 3 },
+      data: {
+        instanceId: 'inst-wf-1',
+        events: [{ type: 'notify', target: 'admin', template: 'APPROVED' }],
+        context: {},
+        workflowCode: 'RFA_V1',
+      },
+      ...overrides,
+    }) as unknown as Job<WorkflowEventJobData>;
+
+  beforeEach(async () => {
+    failedQueue = { add: jest.fn().mockResolvedValue(undefined) };
+    mockFetch.mockReset();
+
+    const module: TestingModule = await Test.createTestingModule({
+      providers: [
+        WorkflowEventProcessor,
+        {
+          provide: getQueueToken('workflow-events-failed'),
+          useValue: failedQueue,
+        },
+      ],
+    }).compile();
+
+    processor = module.get<WorkflowEventProcessor>(WorkflowEventProcessor);
+  });
+
+  afterEach(() => {
+    delete process.env['N8N_WEBHOOK_URL'];
+  });
+
+  describe('onJobFailed()', () => {
+    it('T026a: should add dead-letter job to workflow-events-failed queue when attempts exhausted', async () => {
+      // Arrange: job.attemptsMade === job.opts.attempts (หมด retry)
+      const job = makeJob({ attemptsMade: 3, opts: { attempts: 3 } });
+      const error = new Error('Notification service timeout');
+
+      // Act
+      await processor.onJobFailed(job, error);
+
+      // Assert: ส่งไปยัง DLQ
+      expect(failedQueue.add).toHaveBeenCalledWith(
+        'dead-letter',
+        expect.objectContaining({
+          originalJobId: 'job-001',
+          queue: 'workflow-events',
+          error: 'Notification service timeout',
+          data: expect.objectContaining({ instanceId: 'inst-wf-1' }),
+        })
+      );
+    });
+
+    it('T026b: should NOT add to DLQ when job still has retry attempts remaining', async () => {
+      // Arrange: attempt 1 of 3 — ยังมี retry เหลือ
+      const job = makeJob({ attemptsMade: 1, opts: { attempts: 3 } });
+      const error = new Error('Temporary error');
+
+      // Act
+      await processor.onJobFailed(job, error);
+
+      // Assert: ไม่ส่ง DLQ
+      expect(failedQueue.add).not.toHaveBeenCalled();
+      expect(mockFetch).not.toHaveBeenCalled();
+    });
+
+    it('T026c: should POST to n8n webhook when N8N_WEBHOOK_URL is configured', async () => {
+      // Arrange: ตั้งค่า webhook URL
+      process.env['N8N_WEBHOOK_URL'] = 'https://n8n.example.com/webhook/dlq';
+      mockFetch.mockResolvedValue({ ok: true });
+
+      const job = makeJob({ attemptsMade: 3, opts: { attempts: 3 } });
+      const error = new Error('Service down');
+
+      // Act
+      await processor.onJobFailed(job, error);
+
+      // Assert: เรียก n8n webhook
+      expect(mockFetch).toHaveBeenCalledWith(
+        'https://n8n.example.com/webhook/dlq',
+        expect.objectContaining({
+          method: 'POST',
+          headers: { 'Content-Type': 'application/json' },
+          body: expect.stringContaining('"event":"workflow_event_failed"'),
+        })
+      );
+
+      // Body ต้องมี workflowCode + instanceId
+      const callArgs = mockFetch.mock.calls[0] as [string, RequestInit];
+      const body = JSON.parse(callArgs[1].body as string) as Record<
+        string,
+        unknown
+      >;
+      expect(body).toMatchObject({
+        event: 'workflow_event_failed',
+        jobId: 'job-001',
+        workflowCode: 'RFA_V1',
+        instanceId: 'inst-wf-1',
+        error: 'Service down',
+      });
+    });
+
+    it('T026d: should warn (not throw) when N8N_WEBHOOK_URL is not set', async () => {
+      // Arrange: ไม่ตั้ง env var
+      delete process.env['N8N_WEBHOOK_URL'];
+
+      const job = makeJob({ attemptsMade: 3, opts: { attempts: 3 } });
+      const error = new Error('Error');
+
+      // Act — ต้องไม่ throw
+      await expect(processor.onJobFailed(job, error)).resolves.toBeUndefined();
+
+      // DLQ ยังต้องถูกเรียก — แค่ไม่ call webhook
+      expect(failedQueue.add).toHaveBeenCalled();
+      expect(mockFetch).not.toHaveBeenCalled();
+    });
+
+    it('T026e: should continue without throwing when DLQ add fails', async () => {
+      // Arrange: DLQ queue ล้มเหลว — ไม่ควร throw ออกมา
+      failedQueue.add.mockRejectedValueOnce(new Error('Redis DLQ down'));
+
+      const job = makeJob({ attemptsMade: 3, opts: { attempts: 3 } });
+      const error = new Error('Original error');
+
+      // Act: ต้อง resolve ปกติ ไม่ throw
+      await expect(processor.onJobFailed(job, error)).resolves.toBeUndefined();
+    });
+  });
+
+  describe('process()', () => {
+    it('T026f: should process notify event without error', async () => {
+      const job = makeJob();
+
+      // Act — ต้อง resolve ปกติ
+      await expect(processor.process(job)).resolves.toBeUndefined();
+    });
+  });
+});
@@ -0,0 +1,133 @@
+// File: src/modules/workflow-engine/workflow-event.processor.ts
+// FR-005/FR-006: BullMQ Processor สำหรับ workflow-events queue พร้อม Dead-Letter Queue
+
+import {
+  Processor,
+  WorkerHost,
+  OnWorkerEvent,
+  InjectQueue,
+} from '@nestjs/bullmq';
+import { Logger } from '@nestjs/common';
+import { Job, Queue } from 'bullmq';
+import { RawEvent } from './workflow-dsl.service';
+
+export interface WorkflowEventJobData {
+  instanceId: string;
+  events: RawEvent[];
+  context: Record<string, unknown>;
+  workflowCode?: string;
+}
+
+@Processor('workflow-events', {
+  concurrency: 5,
+  limiter: { max: 100, duration: 60_000 },
+})
+export class WorkflowEventProcessor extends WorkerHost {
+  private readonly logger = new Logger(WorkflowEventProcessor.name);
+
+  constructor(
+    // FR-006: Queue สำหรับ Dead-Letter (jobs ที่หมด retry)
+    @InjectQueue('workflow-events-failed')
+    private readonly failedQueue: Queue
+  ) {
+    super();
+  }
+
+  // ADR-008: ประมวลผล workflow event job
+  process(job: Job<WorkflowEventJobData>): Promise<void> {
+    const { instanceId, events } = job.data;
+    this.logger.log(
+      `Processing ${events.length} event(s) for Instance ${instanceId} (Job: ${job.id})`
+    );
+
+    // ประมวลผลแต่ละ event (throw เพื่อให้ BullMQ retry อัตโนมัติ)
+    for (const event of events) {
+      this.processSingleEvent(instanceId, event, job.data.context);
+    }
+    return Promise.resolve();
+  }
+
+  // FR-006: Dead-Letter Queue handler — เรียกเมื่อ job หมด retry ทั้งหมด
+  @OnWorkerEvent('failed')
+  async onJobFailed(
+    job: Job<WorkflowEventJobData>,
+    error: Error
+  ): Promise<void> {
+    const maxAttempts = job.opts.attempts ?? 3;
+    if ((job.attemptsMade ?? 0) < maxAttempts) {
+      // ยังมี retry เหลือ — ไม่ต้องส่ง DLQ
+      return;
+    }
+
+    this.logger.error(
+      `Job ${job.id} exhausted all ${maxAttempts} retries for Instance ${job.data.instanceId}: ${error.message}`
+    );
+
+    // ส่งไปยัง Dead-Letter Queue
+    await this.failedQueue
+      .add('dead-letter', {
+        originalJobId: job.id,
+        queue: 'workflow-events',
+        data: job.data,
+        failedAt: new Date().toISOString(),
+        error: error.message,
+      })
+      .catch((dlqErr: unknown) =>
+        this.logger.error(
+          `Failed to add job ${job.id} to DLQ: ${dlqErr instanceof Error ? dlqErr.message : String(dlqErr)}`
+        )
+      );
+
+    // แจ้ง Ops ผ่าน n8n webhook (ถ้าตั้งค่าไว้)
+    const webhookUrl = process.env['N8N_WEBHOOK_URL'];
+    if (webhookUrl) {
+      await fetch(webhookUrl, {
+        method: 'POST',
+        headers: { 'Content-Type': 'application/json' },
+        body: JSON.stringify({
+          event: 'workflow_event_failed',
+          jobId: job.id,
+          workflowCode: job.data.workflowCode,
+          instanceId: job.data.instanceId,
+          error: error.message,
+          timestamp: new Date().toISOString(),
+        }),
+      }).catch((webhookErr: unknown) => {
+        // Warning เท่านั้น — ไม่ throw เพื่อไม่กระทบ DLQ add ที่สำเร็จแล้ว
+        this.logger.warn(
+          `n8n webhook failed for job ${job.id}: ${webhookErr instanceof Error ? webhookErr.message : String(webhookErr)}`
+        );
+      });
+    } else {
+      this.logger.warn(
+        `N8N_WEBHOOK_URL not configured — DLQ job created without ops notification (job: ${job.id})`
+      );
+    }
+  }
+
+  // --- Private Handlers ---
+
+  private processSingleEvent(
+    instanceId: string,
+    event: RawEvent,
+    _context: Record<string, unknown>
+  ): void {
+    switch (event.type) {
+      case 'notify':
+        this.logger.log(
+          `[NOTIFY] Instance ${instanceId} → target: "${event.target}" | template: "${event.template}"`
+        );
+        break;
+      case 'webhook':
+        this.logger.log(
+          `[WEBHOOK] Instance ${instanceId} → url: "${event.target}"`
+        );
+        break;
+      case 'auto_action':
+        this.logger.log(`[AUTO_ACTION] Instance ${instanceId}`);
+        break;
+      default:
+        this.logger.warn(`Unknown event type: ${event.type} for ${instanceId}`);
+    }
+  }
+}
@@ -1,6 +1,8 @@
 // File: src/modules/workflow-engine/workflow-event.service.ts

 import { Injectable, Logger } from '@nestjs/common';
+import { InjectQueue } from '@nestjs/bullmq';
+import { Queue } from 'bullmq';
 import { RawEvent } from './workflow-dsl.service';

 // Interface สำหรับ External Services ที่จะมารับ Event ต่อ
@@ -19,81 +21,47 @@ export interface WorkflowEventHandler {
 export class WorkflowEventService {
  private readonly logger = new Logger(WorkflowEventService.name);

-  // สามารถ Inject NotificationService หรือ HttpService เข้ามาได้ตรงนี้
-  // constructor(private readonly notificationService: NotificationService) {}
+  constructor(
+    // ADR-008: ใช้ BullMQ queue แทน inline processing เพื่อ Retry + DLQ (FR-005)
+    @InjectQueue('workflow-events')
+    private readonly workflowEventQueue: Queue
+  ) {}

  /**
-   * ประมวลผลรายการ Events ที่เกิดจากการเปลี่ยนสถานะ
+   * เพิ่ม Job ลงใน workflow-events queue (ADR-008: Async ไม่ Block Response)
+   * Processor: WorkflowEventProcessor (workflow-event.processor.ts)
   */
  dispatchEvents(
    instanceId: string,
    events: RawEvent[],
-    context: Record<string, unknown>
-  ) {
+    context: Record<string, unknown>,
+    workflowCode?: string
+  ): void {
    if (!events || events.length === 0) return;

    this.logger.log(
-      `Dispatching ${events.length} events for Instance ${instanceId}`
+      `Enqueuing ${events.length} event(s) for Instance ${instanceId} → workflow-events queue`
    );

-    // ทำแบบ Async ไม่รอผล (Fire-and-forget) เพื่อไม่ให้กระทบ Response Time ของ User
-    void Promise.allSettled(
-      events.map((event) => this.processSingleEvent(instanceId, event, context))
-    ).then((results) => {
-      // Log errors if any
-      results.forEach((res, idx) => {
-        if (res.status === 'rejected') {
-          this.logger.error(
-            `Failed to process event [${idx}]: ${String(res.reason)}`
-          );
+    // ADR-008: Fire-and-forget — ไม่ await เพื่อไม่กระทบ Response Time
+    // WorkflowEventProcessor จะประมวลผลและ retry อัตโนมัติ (3 retries, exponential backoff)
+    void this.workflowEventQueue
+      .add(
+        'process-events',
+        { instanceId, events, context, workflowCode },
+        {
+          attempts: 3,
+          backoff: { type: 'exponential', delay: 500 },
+          removeOnComplete: { age: 86_400 }, // เก็บ 24h
+          removeOnFail: false, // เก็บไว้ใน Bull Board + DLQ
        }
-      });
-    });
-  }
-
-  private async processSingleEvent(
-    instanceId: string,
-    event: RawEvent,
-    context: Record<string, unknown>
-  ) {
-    await Promise.resolve();
-    try {
-      switch (event.type) {
-        case 'notify':
-          this.handleNotify(event, context);
-          break;
-        case 'webhook':
-          this.handleWebhook(event, context);
-          break;
-        case 'auto_action':
-          // Logic สำหรับ Auto Transition (เช่น ถ้าผ่านเงื่อนไข ให้ไปต่อเลย)
-          this.logger.log(`Auto Action triggered for ${instanceId}`);
-          break;
-        default:
-          this.logger.warn(`Unknown event type: ${event.type}`);
-      }
-    } catch (error) {
-      this.logger.error(
-        `Error processing event ${event.type}: ${String(error)}`
+      )
+      .catch((err: unknown) =>
+        this.logger.error(
+          `Failed to enqueue workflow events for ${instanceId}: ${
+            err instanceof Error ? err.message : String(err)
+          }`
+        )
      );
-      throw error;
-    }
-  }
-
-  // --- Handlers ---
-
-  private handleNotify(event: RawEvent, _context: Record<string, unknown>) {
-    // Mockup: ในของจริงจะเรียก NotificationService.send()
-    // const recipients = this.resolveRecipients(event.target, context);
-    this.logger.log(
-      `[EVENT] Notify target: "${event.target}" | Template: "${event.template}"`
-    );
-  }
-
-  private handleWebhook(event: RawEvent, _context: Record<string, unknown>) {
-    // Mockup: เรียก HttpService.post()
-    this.logger.log(
-      `[EVENT] Webhook to: "${event.target}" | Payload: ${JSON.stringify(event.payload)}`
-    );
  }
 }
@@ -0,0 +1,41 @@
+# Domain Docs
+
+How the engineering skills should consume this repo's domain documentation when exploring the codebase.
+
+## Before exploring, read these
+
+- **`CONTEXT.md`** at the repo root, or
+- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
+- **`specs/06-Decision-Records/`** — read ADRs that touch the area you're about to work in. This repo uses `specs/` instead of `docs/` for all documentation.
+
+If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
+
+## File structure
+
+Single-context repo (this repo):
+
+```
+/
+├── CONTEXT.md (if it exists)
+├── specs/
+│   ├── 00-overview/
+│   ├── 01-requirements/
+│   ├── 02-architecture/
+│   ├── 03-Data-and-Storage/
+│   ├── 04-Infrastructure-OPS/
+│   ├── 05-Engineering-Guidelines/
+│   └── 06-Decision-Records/  ← ADRs live here
+└── src/
+```
+
+## Use the glossary's vocabulary
+
+When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `specs/00-overview/00-02-glossary.md`. Don't drift to synonyms the glossary explicitly avoids.
+
+If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
+
+## Flag ADR conflicts
+
+If your output contradicts an existing ADR in `specs/06-Decision-Records/`, surface it explicitly rather than silently overriding:
+
+> _Contradicts ADR-001 (unified workflow engine) — but worth reopening because…_
@@ -0,0 +1,23 @@
+# Issue tracker: Gitea
+
+Issues and PRDs for this repo live in the self-hosted Gitea instance at git.np-dms.work:2222. Use the `gh` CLI with custom host configuration for all operations.
+
+## Conventions
+
+- **Configure `gh` for Gitea**: Run `gh auth login --hostname git.np-dms.work:2222` to authenticate
+- **Create an issue**: `gh issue create --hostname git.np-dms.work:2222 --title "..." --body "..."`. Use a heredoc for multi-line bodies.
+- **Read an issue**: `gh issue view <number> --hostname git.np-dms.work:2222 --comments`, filtering comments by `jq` and also fetching labels.
+- **List issues**: `gh issue list --hostname git.np-dms.work:2222 --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
+- **Comment on an issue**: `gh issue comment <number> --hostname git.np-dms.work:2222 --body "..."`
+- **Apply / remove labels**: `gh issue edit <number> --hostname git.np-dms.work:2222 --add-label "..."` / `--remove-label "..."`
+- **Close**: `gh issue close <number> --hostname git.np-dms.work:2222 --comment "..."`
+
+Infer the repo from `git remote -v` — the origin is `ssh://git@git.np-dms.work:2222/np-dms/lcbp3.git`.
+
+## When a skill says "publish to the issue tracker"
+
+Create a Gitea issue using `gh issue create --hostname git.np-dms.work:2222`.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `gh issue view <number> --hostname git.np-dms.work:2222 --comments`.
@@ -0,0 +1,15 @@
+# Triage Labels
+
+The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
+
+| Label in mattpocock/skills | Label in our tracker | Meaning                                  |
+| -------------------------- | -------------------- | ---------------------------------------- |
+| `needs-triage`             | `needs-triage`       | Maintainer needs to evaluate this issue  |
+| `needs-info`               | `needs-info`         | Waiting on reporter for more information |
+| `ready-for-agent`          | `ready-for-agent`    | Fully specified, ready for an AFK agent  |
+| `ready-for-human`          | `ready-for-human`    | Requires human implementation            |
+| `wontfix`                  | `wontfix`            | Will not be actioned                     |
+
+When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
+
+Edit the right-hand column to match whatever vocabulary you actually use.
@@ -23,6 +23,7 @@ export default function WorkflowEditPage() {
  const router = useRouter();
  const id = params?.id === 'new' ? null : (params?.id as string);

+  const [hasValidationErrors, setHasValidationErrors] = useState(false);
  const [workflowData, setWorkflowData] = useState<Partial<Workflow>>({
    workflowName: '',
    description: '',
@@ -102,7 +103,7 @@ export default function WorkflowEditPage() {
          <Link href="/admin/doc-control/workflows">
            <Button variant="outline">Cancel</Button>
          </Link>
-          <Button onClick={handleSave} disabled={saving}>
+          <Button onClick={handleSave} disabled={saving || hasValidationErrors}>
            {saving && <Loader2 className="mr-2 h-4 w-4 animate-spin" />}
            <Save className="mr-2 h-4 w-4" />
            {id ? 'Save Changes' : 'Create Workflow'}
@@ -177,6 +178,7 @@ export default function WorkflowEditPage() {
              <DSLEditor
                initialValue={workflowData.dslDefinition}
                onChange={(value) => setWorkflowData({ ...workflowData, dslDefinition: value })}
+                onValidationChange={setHasValidationErrors}
              />
            </TabsContent>

@@ -0,0 +1,145 @@
+// T043: Vitest component test for FilePreviewModal
+// ตรวจสอบ: PDF → iframe, Image → img, unsupported → download link, onClose callback
+
+import { describe, it, expect, vi, beforeEach } from 'vitest';
+import { render, screen, waitFor } from '@testing-library/react';
+import userEvent from '@testing-library/user-event';
+import apiClient from '@/lib/api/client';
+import { FilePreviewModal } from '../file-preview-modal';
+import type { WorkflowAttachmentSummary } from '@/types/workflow';
+
+// Mock useTranslations — คืน key เป็น fallback สำหรับ test
+vi.mock('@/hooks/use-translations', () => ({
+  useTranslations: () => (key: string) => key,
+}));
+
+// apiClient.get ถูก mock ใน vitest.setup.ts แล้ว
+const mockApiGet = vi.mocked(apiClient.get);
+
+// Mock URL.createObjectURL / revokeObjectURL
+const mockObjectUrl = 'blob:http://localhost/mock-blob-url';
+vi.stubGlobal('URL', {
+  createObjectURL: vi.fn().mockReturnValue(mockObjectUrl),
+  revokeObjectURL: vi.fn(),
+});
+
+const makeAttachment = (
+  overrides: Partial<WorkflowAttachmentSummary> = {}
+): WorkflowAttachmentSummary => ({
+  publicId: 'att-preview-001',
+  originalFilename: 'test-file.pdf',
+  mimeType: 'application/pdf',
+  fileSize: 102400,
+  createdAt: '2026-01-01T00:00:00.000Z',
+  ...overrides,
+});
+
+describe('FilePreviewModal', () => {
+  const onClose = vi.fn();
+  const onUnavailable = vi.fn();
+  const mockBlob = new Blob(['%PDF-1.4'], { type: 'application/pdf' });
+
+  beforeEach(() => {
+    vi.clearAllMocks();
+    mockApiGet.mockResolvedValue({ data: mockBlob });
+  });
+
+  it('renders iframe for PDF MIME type', async () => {
+    const attachment = makeAttachment({ mimeType: 'application/pdf' });
+
+    render(<FilePreviewModal attachment={attachment} onClose={onClose} />);
+
+    await waitFor(() => {
+      expect(screen.getByTitle('test-file.pdf')).toBeInTheDocument();
+    });
+
+    const iframe = screen.getByTitle('test-file.pdf') as HTMLIFrameElement;
+    expect(iframe.tagName).toBe('IFRAME');
+    expect(iframe.src).toContain('blob:');
+  });
+
+  it('renders img for image MIME type', async () => {
+    const imageBlob = new Blob(['fake-image'], { type: 'image/png' });
+    mockApiGet.mockResolvedValue({ data: imageBlob });
+
+    const attachment = makeAttachment({
+      mimeType: 'image/png',
+      originalFilename: 'photo.png',
+    });
+
+    render(<FilePreviewModal attachment={attachment} onClose={onClose} />);
+
+    await waitFor(() => {
+      expect(screen.getByAltText('photo.png')).toBeInTheDocument();
+    });
+
+    const img = screen.getByAltText('photo.png') as HTMLImageElement;
+    expect(img.tagName).toBe('IMG');
+  });
+
+  it('shows download link for unsupported MIME type (no iframe or img)', async () => {
+    const docxBlob = new Blob(['PK...'], {
+      type: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+    });
+    mockApiGet.mockResolvedValue({ data: docxBlob });
+
+    const attachment = makeAttachment({
+      mimeType: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+      originalFilename: 'report.docx',
+    });
+
+    render(<FilePreviewModal attachment={attachment} onClose={onClose} />);
+
+    await waitFor(() => {
+      // download link ต้องมี href = blobUrl
+      const link = screen.getByRole('link');
+      expect(link).toHaveAttribute('href', mockObjectUrl);
+      expect(link).toHaveAttribute('download', 'report.docx');
+    });
+
+    // ต้องไม่มี iframe หรือ img
+    expect(screen.queryByTitle('report.docx')).not.toBeInTheDocument();
+  });
+
+  it('calls onClose when close button is clicked', async () => {
+    const attachment = makeAttachment();
+
+    render(<FilePreviewModal attachment={attachment} onClose={onClose} />);
+
+    await waitFor(() => {
+      expect(screen.getByRole('button', { name: /filepreview.close/i })).toBeInTheDocument();
+    });
+
+    await userEvent.click(screen.getByRole('button', { name: /filepreview.close/i }));
+
+    expect(onClose).toHaveBeenCalledTimes(1);
+  });
+
+  it('calls onUnavailable when API returns 404', async () => {
+    const notFoundError = Object.assign(new Error('Not Found'), {
+      response: { status: 404 },
+    });
+    mockApiGet.mockRejectedValue(notFoundError);
+
+    const attachment = makeAttachment({ publicId: 'missing-att-001' });
+
+    render(
+      <FilePreviewModal
+        attachment={attachment}
+        onClose={onClose}
+        onUnavailable={onUnavailable}
+      />
+    );
+
+    await waitFor(() => {
+      expect(onUnavailable).toHaveBeenCalledWith('missing-att-001');
+    });
+  });
+
+  it('does not render when attachment is null (dialog closed)', () => {
+    render(<FilePreviewModal attachment={null} onClose={onClose} />);
+
+    // Dialog should not be visible
+    expect(screen.queryByRole('dialog')).not.toBeInTheDocument();
+  });
+});
@@ -0,0 +1,113 @@
+// T054: Vitest test for DSLEditor — validates onValidationChange callback and Save button disable logic
+// ตรวจสอบ: Validate กดแล้ว workflowApi.validateDSL ถูกเรียก; errors → onValidationChange(true); valid → onValidationChange(false)
+
+import { describe, it, expect, vi, beforeEach } from 'vitest';
+import { render, screen, waitFor } from '@testing-library/react';
+import userEvent from '@testing-library/user-event';
+import { DSLEditor } from '../dsl-editor';
+import { workflowApi } from '@/lib/api/workflows';
+
+// Mock Monaco editor — ไม่มี DOM environment สำหรับ Monaco
+vi.mock('@monaco-editor/react', () => ({
+  default: ({ onChange }: { onChange?: (v: string) => void }) => (
+    <textarea
+      data-testid="monaco-editor"
+      onChange={(e) => onChange?.(e.target.value)}
+    />
+  ),
+}));
+
+// Mock next-themes
+vi.mock('next-themes', () => ({
+  useTheme: () => ({ theme: 'light' }),
+}));
+
+// Mock workflowApi.validateDSL
+vi.mock('@/lib/api/workflows', () => ({
+  workflowApi: {
+    validateDSL: vi.fn(),
+  },
+}));
+
+const mockValidateDSL = vi.mocked(workflowApi.validateDSL);
+
+describe('DSLEditor (T054)', () => {
+  const onValidationChange = vi.fn();
+
+  beforeEach(() => {
+    vi.clearAllMocks();
+  });
+
+  it('calls workflowApi.validateDSL when Validate button is clicked', async () => {
+    mockValidateDSL.mockResolvedValue({ valid: true });
+
+    render(<DSLEditor initialValue="workflow: test" onValidationChange={onValidationChange} />);
+
+    await userEvent.click(screen.getByRole('button', { name: /validate/i }));
+
+    await waitFor(() => {
+      expect(mockValidateDSL).toHaveBeenCalledWith('workflow: test');
+    });
+  });
+
+  it('calls onValidationChange(true) when validation returns errors', async () => {
+    mockValidateDSL.mockResolvedValue({
+      valid: false,
+      errors: ['DSL must have at least one state'],
+    });
+
+    render(<DSLEditor initialValue="bad: dsl" onValidationChange={onValidationChange} />);
+
+    await userEvent.click(screen.getByRole('button', { name: /validate/i }));
+
+    await waitFor(() => {
+      expect(onValidationChange).toHaveBeenCalledWith(true);
+    });
+
+    // แสดง error message ใน UI
+    expect(
+      screen.getByText('DSL must have at least one state')
+    ).toBeInTheDocument();
+  });
+
+  it('calls onValidationChange(false) when validation returns valid', async () => {
+    mockValidateDSL.mockResolvedValue({ valid: true });
+
+    render(<DSLEditor initialValue="workflow: rfa" onValidationChange={onValidationChange} />);
+
+    await userEvent.click(screen.getByRole('button', { name: /validate/i }));
+
+    await waitFor(() => {
+      expect(onValidationChange).toHaveBeenCalledWith(false);
+    });
+
+    // แสดง success message
+    expect(screen.getByText(/valid and ready/i)).toBeInTheDocument();
+  });
+
+  it('calls onValidationChange(true) on server error', async () => {
+    mockValidateDSL.mockRejectedValue(new Error('Network error'));
+
+    render(<DSLEditor initialValue="workflow: test" onValidationChange={onValidationChange} />);
+
+    await userEvent.click(screen.getByRole('button', { name: /validate/i }));
+
+    await waitFor(() => {
+      expect(onValidationChange).toHaveBeenCalledWith(true);
+    });
+  });
+
+  it('does not call onValidationChange when prop is not provided', async () => {
+    mockValidateDSL.mockResolvedValue({ valid: true });
+
+    // ไม่ส่ง onValidationChange — ต้องไม่ throw
+    render(<DSLEditor initialValue="workflow: test" />);
+
+    await userEvent.click(screen.getByRole('button', { name: /validate/i }));
+
+    await waitFor(() => {
+      expect(mockValidateDSL).toHaveBeenCalled();
+    });
+    // ไม่ throw error
+  });
+});
@@ -14,9 +14,11 @@ interface DSLEditorProps {
  initialValue?: string;
  onChange?: (value: string) => void;
  readOnly?: boolean;
+  // FR-025: callback เมื่อผล validate เปลี่ยน — parent ใช้ disable Save button
+  onValidationChange?: (hasErrors: boolean) => void;
 }

-export function DSLEditor({ initialValue = '', onChange, readOnly = false }: DSLEditorProps) {
+export function DSLEditor({ initialValue = '', onChange, readOnly = false, onValidationChange }: DSLEditorProps) {
  const [dsl, setDsl] = useState(initialValue);
  const [validationResult, setValidationResult] = useState<ValidationResult | null>(null);
  const [isValidating, setIsValidating] = useState(false);
@@ -47,9 +49,12 @@ export function DSLEditor({ initialValue = '', onChange, readOnly = false }: DSL
    try {
      const result = await workflowApi.validateDSL(dsl);
      setValidationResult(result);
+      // FR-025: แจ้ง parent ว่ามี validation errors หรือไม่
+      onValidationChange?.(!result.valid);
    } catch (_error) {
      // Validation failed - error state shown in UI
      setValidationResult({ valid: false, errors: ['Validation failed due to server error'] });
+      onValidationChange?.(true);
    } finally {
      setIsValidating(false);
    }
@@ -67,10 +67,18 @@ export function useWorkflowAction(instanceId: string | undefined) {
          return;
        }

-        // Clarify Q1: 409 Conflict (ไม่อยู่ในสถานะที่อนุญาตให้อัปโหลด)
+        // Clarify Q1: 409 Conflict (state violation หรือ optimistic lock conflict)
        if (statusCode === 409) {
          // M3: reset idempotency key — user intent กับ state เดิมใช้ไม่ได้แล้ว
          setIdempotencyKey(uuidv4());
+          // FR-002: Optimistic lock conflict — แสดง message เฉพาะเพื่อบอก user ให้ refresh
+          const isVersionConflict = error.error.code === 'WORKFLOW_VERSION_CONFLICT';
+          if (isVersionConflict) {
+            toast.error('เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช', {
+              description: 'ข้อมูลที่คุณกำลังดูอาจล้าสมัย กรุณาโหลดหน้าใหม่แล้วลองอีกครั้ง',
+            });
+            return;
+          }
          toast.error(message || 'ไม่สามารถดำเนินการในสถานะนี้ได้', {
            description: recoveryActions?.[0],
          });
@@ -77,3 +77,11 @@ export const useGetAvailableActions = () => {
    mutationFn: (data: GetAvailableActionsDto) => workflowEngineService.getAvailableActions(data),
  });
 };
+
+// FR-025: Inline DSL validation (POST /workflow-engine/definitions/validate)
+export const useValidateDsl = () => {
+  return useMutation({
+    mutationFn: (dsl: Record<string, unknown>) =>
+      workflowEngineService.validateDsl(dsl),
+  });
+};
@@ -178,6 +178,23 @@ export const workflowEngineService = {
    return response.data?.data || response.data;
  },

+  /**
+   * FR-025: ตรวจสอบ DSL โดยไม่บันทึก
+   * POST /workflow-engine/definitions/validate
+   */
+  validateDsl: async (
+    dsl: Record<string, unknown>
+  ): Promise<
+    | { valid: true }
+    | { valid: false; errors: { path: string; message: string }[] }
+  > => {
+    const response = await apiClient.post(
+      '/workflow-engine/definitions/validate',
+      { dsl }
+    );
+    return (response.data as { data?: unknown })?.data ?? response.data;
+  },
+
  /**
   * ลบ Workflow Definition
   * DELETE /workflow-engine/definitions/:id
@@ -75,4 +75,7 @@ export interface WorkflowTransitionWithAttachmentsDto {

  /** รายการ publicId ของไฟล์แนบประจำ Step นี้ (max 20, ADR-016 Two-Phase upload) */
  attachmentPublicIds?: string[];
+
+  /** FR-002: Client-side optimistic lock version — ส่งพร้อมทุก transition เพื่อตรวจ conflict (HTTP 409) */
+  versionNo?: number;
 }
@@ -1,6 +1,18 @@
 {
  "version": 1,
  "skills": {
+    "diagnose": {
+      "source": "mattpocock/skills",
+      "sourceType": "github",
+      "skillPath": "skills/engineering/diagnose/SKILL.md",
+      "computedHash": "1c3c85517ac42116fe5f2bfb5150f7b3e38ad23808e40b33fbb01f1afb611983"
+    },
+    "grill-with-docs": {
+      "source": "mattpocock/skills",
+      "sourceType": "github",
+      "skillPath": "skills/engineering/grill-with-docs/SKILL.md",
+      "computedHash": "e95d83038cb68774469932969b060438bc457973657269a479571321c93a9140"
+    },
    "nestjs-best-practices": {
      "source": "Kadajett/agent-nestjs-skills",
      "sourceType": "github",
@@ -10,6 +22,12 @@
      "source": "vercel-labs/next-skills",
      "sourceType": "github",
      "computedHash": "c31ddd68ba5798a79a516c619b1e1fc4408cc4876ad600fefa4f5add4425719d"
+    },
+    "setup-matt-pocock-skills": {
+      "source": "mattpocock/skills",
+      "sourceType": "github",
+      "skillPath": "skills/engineering/setup-matt-pocock-skills/SKILL.md",
+      "computedHash": "ab6e8143f9237f970435d95e94a0f79703faf125a0b8c583b35ee7fe340eeefe"
    }
  }
 }
@@ -0,0 +1,45 @@
+# Specification Quality Checklist: Unified Workflow Engine — Production Hardening & Integrated Context
+
+**Purpose**: Validate specification completeness and quality before proceeding to planning
+**Created**: 2026-05-02
+**Feature**: [spec.md](../spec.md)
+
+---
+
+## Content Quality
+
+- [~] No implementation details (languages, frameworks, APIs) — *Note: Technology-specific terms (Redis, BullMQ, ClamAV, JSON Logic) are present in FRs as ADR-mandated architectural constraints (ADR-001/ADR-008/ADR-016), not spec-level implementation choices. Consistent with existing `001-transmittals-circulation/spec.md` pattern.*
+- [x] Focused on user value and business needs
+- [~] Written for non-technical stakeholders — *Note: Platform/infrastructure feature; technical Functional Requirements (FR-001 to FR-021) intentionally use ADR terminology. User Stories (P1-P3) and Success Criteria are non-technical.*
+- [x] All mandatory sections completed
+
+## Requirement Completeness
+
+- [x] No [NEEDS CLARIFICATION] markers remain
+- [x] Requirements are testable and unambiguous
+- [x] Success criteria are measurable
+- [x] Success criteria are technology-agnostic (no implementation details)
+- [x] All acceptance scenarios are defined
+- [x] Edge cases are identified
+- [x] Scope is clearly bounded
+- [x] Dependencies and assumptions identified
+
+## Feature Readiness
+
+- [x] All functional requirements have clear acceptance criteria
+- [x] User scenarios cover primary flows
+- [x] Feature meets measurable outcomes defined in Success Criteria
+- [x] No implementation details leak into specification
+
+## Notes
+
+- Spec derived from ADR-001 (Unified Workflow Engine v1.1 — 2026-05-02 production hardening) and ADR-021 (Integrated Workflow Context & Step-specific Attachments)
+- **Clarification session 2026-05-02 (5/5 questions resolved):**
+  - Q1: DSL `require.role` → CASL ability check (FR-002a)
+  - Q2: Observability = structured log + metrics (FR-022, FR-023, SC-009)
+  - Q3: File rollback on DB failure = move back to temp, 24h TTL (FR-019)
+  - Q4: Admin UI for DSL authoring is IN scope (FR-024, FR-025)
+  - Q5: All 4 modules (RFA/Transmittal/Circulation/Correspondence) need banner gap-filling (FR-011, Assumptions updated)
+- ADR-001 clarifications fully captured in FR-001 through FR-010 and SC-001 through SC-005
+- ADR-021 requirements (REQ-01 to REQ-06) fully captured in FR-011 through FR-025 and SC-006 through SC-009
+- Visual workflow builder (drag-and-drop DSL editor) is explicitly **out of scope** (Phase 2)
@@ -0,0 +1,205 @@
+openapi: "3.1.0"
+info:
+  title: Workflow Engine — Definitions API
+  version: "1.1.0"
+  description: |
+    Endpoints for managing workflow DSL definitions.
+    Requires system.manage_all (Super Admin only) for all write operations (FR-009).
+    Includes DSL validation endpoint for Admin UI inline feedback (FR-025).
+
+paths:
+  /workflow-engine/definitions:
+    get:
+      summary: List all workflow definitions (latest version per code)
+      tags: [WorkflowDefinitions]
+      security:
+        - BearerAuth: []
+      responses:
+        "200":
+          description: Array of latest definitions
+          content:
+            application/json:
+              schema:
+                type: array
+                items:
+                  $ref: "#/components/schemas/WorkflowDefinitionDto"
+
+    post:
+      summary: Create a new workflow definition (auto-increments version)
+      description: |
+        Creates a new version for the given workflow_code.
+        DSL is compiled and validated (Phase 1 save-time check — FR-008).
+        Requires system.manage_all permission.
+      tags: [WorkflowDefinitions]
+      security:
+        - BearerAuth: []
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: "#/components/schemas/CreateWorkflowDefinitionDto"
+      responses:
+        "201":
+          description: Definition created
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/WorkflowDefinitionDto"
+        "400":
+          description: DSL structure validation failed (Phase 1)
+        "403":
+          description: Requires system.manage_all
+
+  /workflow-engine/definitions/{id}:
+    get:
+      summary: Get a specific definition by UUID
+      tags: [WorkflowDefinitions]
+      security:
+        - BearerAuth: []
+      parameters:
+        - name: id
+          in: path
+          required: true
+          schema:
+            type: string
+            format: uuid
+      responses:
+        "200":
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/WorkflowDefinitionDto"
+
+    patch:
+      summary: Update a workflow definition (DSL or is_active toggle)
+      description: |
+        Updating DSL re-compiles and re-validates (Phase 1).
+        Toggling is_active=true invalidates the Redis active pointer cache immediately (FR-007, SC-005).
+        In-progress instances are NOT rebound (FR-010).
+        Requires system.manage_all.
+      tags: [WorkflowDefinitions]
+      security:
+        - BearerAuth: []
+      parameters:
+        - name: id
+          in: path
+          required: true
+          schema:
+            type: string
+            format: uuid
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: "#/components/schemas/UpdateWorkflowDefinitionDto"
+      responses:
+        "200":
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/WorkflowDefinitionDto"
+        "400":
+          description: DSL validation failed
+        "403":
+          description: Requires system.manage_all
+
+  /workflow-engine/definitions/validate:
+    post:
+      summary: Validate a DSL JSON without saving (for Admin UI inline feedback — FR-025)
+      description: |
+        Runs Phase 1 (structure) validation only. Returns errors per field.
+        No authentication required for this endpoint (read-only, no state change)
+        — but still protected by JWT for Admin UI use.
+      tags: [WorkflowDefinitions]
+      security:
+        - BearerAuth: []
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              type: object
+              required: [dsl]
+              properties:
+                dsl:
+                  type: object
+                  description: DSL JSON to validate
+      responses:
+        "200":
+          description: Validation result
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/DslValidationResultDto"
+
+components:
+  schemas:
+    WorkflowDefinitionDto:
+      type: object
+      properties:
+        id:
+          type: string
+          format: uuid
+        workflowCode:
+          type: string
+          example: RFA_FLOW_V1
+        version:
+          type: integer
+          example: 2
+        isActive:
+          type: boolean
+        dsl:
+          type: object
+          description: Raw DSL JSON (JSON Logic conditions only — no eval/new Function)
+        createdAt:
+          type: string
+          format: date-time
+
+    CreateWorkflowDefinitionDto:
+      type: object
+      required: [workflow_code, dsl]
+      properties:
+        workflow_code:
+          type: string
+          example: RFA_FLOW_V2
+        dsl:
+          type: object
+          description: DSL JSON — must use JSON Logic format for conditions (FR-001)
+        is_active:
+          type: boolean
+          default: true
+
+    UpdateWorkflowDefinitionDto:
+      type: object
+      properties:
+        dsl:
+          type: object
+        is_active:
+          type: boolean
+        workflow_code:
+          type: string
+
+    DslValidationResultDto:
+      type: object
+      properties:
+        valid:
+          type: boolean
+        errors:
+          type: array
+          items:
+            type: object
+            properties:
+              path:
+                type: string
+                description: JSON path to the invalid field (e.g. "states.DRAFT.transitions")
+              message:
+                type: string
+                description: Human-readable error description
+
+  securitySchemes:
+    BearerAuth:
+      type: http
+      scheme: bearer
+      bearerFormat: JWT
@@ -0,0 +1,276 @@
+openapi: "3.1.0"
+info:
+  title: Workflow Engine — Transition API
+  version: "1.1.0"
+  description: |
+    Endpoints for triggering workflow state transitions.
+    ADR-001 v1.1: Added version_no (optimistic lock) and action_by_user_uuid.
+    ADR-021: Step-specific attachment support via attachmentPublicIds.
+
+paths:
+  /workflow-engine/instances/{id}/transition:
+    post:
+      summary: Trigger a workflow state transition
+      description: |
+        Transitions the workflow instance to the next state based on the DSL definition.
+        Requires Idempotency-Key header (ADR-016).
+        Optionally includes pre-uploaded attachment publicIds (ADR-021).
+        Supports optimistic concurrency control via versionNo (ADR-001 v1.1).
+      tags: [WorkflowEngine]
+      security:
+        - BearerAuth: []
+      parameters:
+        - name: id
+          in: path
+          required: true
+          schema:
+            type: string
+            format: uuid
+          description: Workflow Instance UUID
+        - name: Idempotency-Key
+          in: header
+          required: true
+          schema:
+            type: string
+            format: uuid
+          description: UUIDv7 idempotency key — duplicate requests return cached response
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: "#/components/schemas/WorkflowTransitionDto"
+      responses:
+        "200":
+          description: Transition successful
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/WorkflowTransitionResponseDto"
+        "409":
+          description: |
+            Conflict — one of:
+            - version_no mismatch (optimistic lock) — refresh and retry
+            - Terminal state — cannot transition further
+            - Upload rejected (state not in PENDING_REVIEW/PENDING_APPROVAL)
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ErrorResponse"
+        "422":
+          description: DSL condition not met or required context field missing
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ValidationErrorResponse"
+        "403":
+          description: User lacks the required CASL ability for this transition
+        "503":
+          description: Redlock unavailable — retry after brief delay
+
+  /workflow-engine/instances/{id}:
+    get:
+      summary: Get workflow instance state
+      description: Returns current state, available actions, and versionNo for optimistic locking.
+      tags: [WorkflowEngine]
+      security:
+        - BearerAuth: []
+      parameters:
+        - name: id
+          in: path
+          required: true
+          schema:
+            type: string
+            format: uuid
+      responses:
+        "200":
+          description: Instance details
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/WorkflowInstanceDto"
+
+  /workflow-engine/instances/{id}/history:
+    get:
+      summary: Get workflow history (timeline)
+      description: Returns all transition records for a workflow instance, including step-specific attachments.
+      tags: [WorkflowEngine]
+      security:
+        - BearerAuth: []
+      parameters:
+        - name: id
+          in: path
+          required: true
+          schema:
+            type: string
+            format: uuid
+      responses:
+        "200":
+          description: History items
+          content:
+            application/json:
+              schema:
+                type: array
+                items:
+                  $ref: "#/components/schemas/WorkflowHistoryItemDto"
+
+components:
+  schemas:
+    WorkflowTransitionDto:
+      type: object
+      required: [action]
+      properties:
+        action:
+          type: string
+          example: APPROVE
+          description: Action name matching a DSL transition key
+        comment:
+          type: string
+          maxLength: 2000
+          description: Optional decision comment
+        versionNo:
+          type: integer
+          minimum: 1
+          description: |
+            Current version_no from the client. If provided, triggers optimistic
+            lock check — returns 409 if mismatch (ADR-001 v1.1 FR-002).
+          example: 5
+        payload:
+          type: object
+          additionalProperties: true
+          description: Additional context fields required by DSL conditions
+        attachmentPublicIds:
+          type: array
+          items:
+            type: string
+            format: uuid
+          maxItems: 20
+          description: |
+            Pre-uploaded attachment UUIDs (ADR-021). Files must have been
+            uploaded via Two-Phase upload and passed ClamAV scan before
+            this request. Only valid in PENDING_REVIEW or PENDING_APPROVAL.
+
+    WorkflowTransitionResponseDto:
+      type: object
+      properties:
+        success:
+          type: boolean
+          example: true
+        previousState:
+          type: string
+          example: PENDING_REVIEW
+        nextState:
+          type: string
+          example: PENDING_APPROVAL
+        historyId:
+          type: string
+          format: uuid
+          description: UUID of the created WorkflowHistory record
+        isCompleted:
+          type: boolean
+          description: True if the transition reached a terminal state
+        versionNo:
+          type: integer
+          description: Updated versionNo after successful transition — client must store for next request
+
+    WorkflowInstanceDto:
+      type: object
+      properties:
+        id:
+          type: string
+          format: uuid
+        currentState:
+          type: string
+          example: PENDING_REVIEW
+        status:
+          type: string
+          enum: [ACTIVE, COMPLETED, CANCELLED, TERMINATED]
+        versionNo:
+          type: integer
+          description: Current optimistic lock version — include in next transition request
+        availableActions:
+          type: array
+          items:
+            type: string
+          example: [APPROVE, REJECT, RETURN]
+        workflowCode:
+          type: string
+          example: RFA_FLOW_V1
+
+    WorkflowHistoryItemDto:
+      type: object
+      properties:
+        id:
+          type: string
+          format: uuid
+        fromState:
+          type: string
+        toState:
+          type: string
+        action:
+          type: string
+        actorUuid:
+          type: string
+          format: uuid
+          description: UUID of the acting user (ADR-019 — INT FK excluded from API)
+        actorName:
+          type: string
+          description: Populated via user join for display
+        comment:
+          type: string
+          nullable: true
+        createdAt:
+          type: string
+          format: date-time
+        attachments:
+          type: array
+          items:
+            $ref: "#/components/schemas/AttachmentSummaryDto"
+
+    AttachmentSummaryDto:
+      type: object
+      properties:
+        publicId:
+          type: string
+          format: uuid
+          description: ADR-019 public identifier
+        originalFilename:
+          type: string
+        mimeType:
+          type: string
+        fileSize:
+          type: integer
+        createdAt:
+          type: string
+          format: date-time
+
+    ErrorResponse:
+      type: object
+      properties:
+        userMessage:
+          type: string
+        recoveryAction:
+          type: string
+        errorCode:
+          type: string
+
+    ValidationErrorResponse:
+      type: object
+      properties:
+        userMessage:
+          type: string
+        fields:
+          type: array
+          items:
+            type: object
+            properties:
+              field:
+                type: string
+              message:
+                type: string
+
+  securitySchemes:
+    BearerAuth:
+      type: http
+      scheme: bearer
+      bearerFormat: JWT
@@ -0,0 +1,388 @@
+# Data Model: Unified Workflow Engine — Production Hardening
+
+**Phase 1 Output** | Generated: 2026-05-02  
+**Extends**: `specs/08-Tasks/ADR-021-workflow-context/data-model.md` (deltas 01–08 already applied)
+
+---
+
+## 1. Schema Deltas
+
+### Delta 09 — `version_no` on `workflow_instances`
+
+**File**: `specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql`
+
+```sql
+-- ============================================================
+-- Delta 09: ADR-001 v1.1 — Optimistic Lock
+-- เพิ่ม version_no ใน workflow_instances สำหรับ Optimistic Concurrency Control
+-- ============================================================
+-- ข้อควรระวัง: Existing rows จะได้ค่า DEFAULT 1 อัตโนมัติ — ไม่มี Data Loss
+-- Rollback: ALTER TABLE workflow_instances DROP COLUMN version_no;
+
+ALTER TABLE workflow_instances
+  ADD COLUMN version_no INT NOT NULL DEFAULT 1
+    COMMENT 'Optimistic lock counter — incremented on every successful transition (ADR-001 v1.1 FR-002)';
+
+-- Index เพื่อรองรับ CAS check: WHERE id = ? AND version_no = ?
+CREATE INDEX idx_wf_inst_version
+  ON workflow_instances (id, version_no);
+```
+
+**Migration Notes (ADR-009):**
+- Apply via MariaDB CLI or n8n delta workflow — ไม่มี TypeORM migration file
+- Existing instances get `version_no = 1` — no disruption to active workflows
+- Rollback: `ALTER TABLE workflow_instances DROP INDEX idx_wf_inst_version; ALTER TABLE workflow_instances DROP COLUMN version_no;`
+
+---
+
+### Delta 10 — `action_by_user_uuid` on `workflow_histories`
+
+**File**: `specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql`
+
+```sql
+-- ============================================================
+-- Delta 10: ADR-001 v1.1 / ADR-019 UUID Compliance
+-- เพิ่ม action_by_user_uuid ใน workflow_histories
+-- เพื่อ expose User identity ผ่าน API โดยไม่ต้องเปิดเผย INT PK (ADR-019)
+-- ============================================================
+-- ข้อควรระวัง: NULL สำหรับ Historical records ที่สร้างก่อน delta นี้ (เป็น Acceptable)
+-- Rollback: ALTER TABLE workflow_histories DROP COLUMN action_by_user_uuid;
+
+ALTER TABLE workflow_histories
+  ADD COLUMN action_by_user_uuid VARCHAR(36) NULL
+    COMMENT 'UUID ของ User ผู้ดำเนินการ — ใช้ใน API Response (ADR-019). INT FK action_by_user_id ยังคงอยู่สำหรับ Internal use';
+```
+
+**Migration Notes (ADR-009):**
+- NULL สำหรับ historical records — acceptable; API consumers treat NULL as "system action" or "pre-migration"
+- Populate on all new transitions from this delta forward
+
+---
+
+## 2. Backend Entity Changes
+
+### 2.1 `workflow-instance.entity.ts` — Add `versionNo`
+
+**File**: `backend/src/modules/workflow-engine/entities/workflow-instance.entity.ts`
+
+```typescript
+// เพิ่มหลัง updatedAt column
+@Column({
+  name: 'version_no',
+  type: 'int',
+  default: 1,
+  comment: 'Optimistic lock — incremented on each successful transition (ADR-001 v1.1)',
+})
+versionNo!: number;
+```
+
+**Import to add**: No new imports needed.
+
+---
+
+### 2.2 `workflow-history.entity.ts` — Add `actionByUserUuid`
+
+**File**: `backend/src/modules/workflow-engine/entities/workflow-history.entity.ts`
+
+```typescript
+// เพิ่มหลัง actionByUserId column
+@Column({
+  name: 'action_by_user_uuid',
+  length: 36,
+  nullable: true,
+  comment: 'UUID ของ User ผู้ดำเนินการ — expose ใน API Response per ADR-019',
+})
+actionByUserUuid?: string;
+```
+
+---
+
+### 2.3 `workflow-history-item.dto.ts` — Add `actorUuid`
+
+**File**: `backend/src/modules/workflow-engine/dto/workflow-history-item.dto.ts`
+
+```typescript
+// เพิ่ม field ใน WorkflowHistoryItemDto
+@ApiPropertyOptional({
+  description: 'UUID ของ User ผู้ดำเนินการ (ADR-019)',
+  example: '019505a1-7c3e-7000-8000-abc123def456',
+})
+actorUuid?: string;
+```
+
+---
+
+## 3. `processTransition()` — Optimistic Lock Changes
+
+### Updated signature
+
+```typescript
+async processTransition(
+  instanceId: string,
+  action: string,
+  userId: number,
+  userUuid: string,          // NEW: ADR-019 UUID for history record
+  comment?: string,
+  payload: Record<string, unknown> = {},
+  attachmentPublicIds?: string[],
+  clientVersionNo?: number,  // NEW: Optimistic lock — sent by client
+)
+```
+
+### Fast-fail check (before Redlock)
+
+```typescript
+if (clientVersionNo !== undefined) {
+  const current = await this.instanceRepo.findOne({
+    where: { id: instanceId },
+    select: ['id', 'versionNo'],
+  });
+  if (!current) throw new NotFoundException('Workflow Instance', instanceId);
+  if (current.versionNo !== clientVersionNo) {
+    throw new ConflictException(
+      'WORKFLOW_VERSION_CONFLICT',
+      `Expected version_no=${clientVersionNo}, actual=${current.versionNo}`,
+      'เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช',
+      ['รีเฟรชหน้าแล้วลองใหม่']
+    );
+  }
+}
+```
+
+### History creation — add `actionByUserUuid`
+
+```typescript
+const history = this.historyRepo.create({
+  instanceId: instance.id,
+  fromState,
+  toState,
+  action,
+  actionByUserId: userId,
+  actionByUserUuid: userUuid,  // NEW
+  comment,
+  metadata: { events: evaluation.events },
+});
+```
+
+### Version increment (inside DB transaction, after history save)
+
+```typescript
+// CAS update — ถ้า version_no ถูกเปลี่ยนระหว่างนี้ (TOCTOU) จะไม่มีแถวถูก update
+const result = await queryRunner.manager
+  .createQueryBuilder()
+  .update(WorkflowInstance)
+  .set({ versionNo: () => 'version_no + 1' })
+  .where('id = :id AND version_no = :expected', {
+    id: instanceId,
+    expected: instance.versionNo,
+  })
+  .execute();
+
+if (result.affected === 0) {
+  // TOCTOU: version changed under pessimistic lock (edge case — should not normally occur)
+  throw new ConflictException(
+    'WORKFLOW_VERSION_CONFLICT',
+    'version_no changed between lock acquisition and update',
+    'เกิด Conflict กรุณารีเฟรชและลองใหม่',
+    ['รีเฟรชหน้า', 'ลองดำเนินการอีกครั้ง']
+  );
+}
+```
+
+---
+
+## 4. `processTransition()` — Structured Observability Changes
+
+### New metric injections in constructor
+
+```typescript
+@InjectMetric('workflow_transitions_total')
+private readonly transitionsTotal: Counter<string>,
+
+@InjectMetric('workflow_transition_duration_ms')
+private readonly transitionDuration: Histogram<string>,
+```
+
+### Wrap in timer + log
+
+```typescript
+const startMs = Date.now();
+let outcome: 'success' | 'conflict' | 'forbidden' | 'validation_error' | 'system_error' = 'system_error';
+let workflowCode = 'unknown';
+
+try {
+  // ... existing processTransition logic ...
+  workflowCode = instance.definition.workflow_code;
+  outcome = 'success';
+} catch (err) {
+  if (err instanceof ConflictException) outcome = 'conflict';
+  else if (err instanceof ForbiddenException) outcome = 'forbidden';
+  else if (err instanceof WorkflowException) outcome = 'validation_error';
+  throw err;
+} finally {
+  const durationMs = Date.now() - startMs;
+  this.transitionDuration.labels({ workflow_code: workflowCode }).observe(durationMs);
+  this.transitionsTotal.labels({ workflow_code: workflowCode, action, outcome }).inc();
+  this.logger.log(JSON.stringify({
+    instanceId, action, fromState: instance?.currentState,
+    toState: outcome === 'success' ? toState : undefined,
+    userUuid, durationMs, outcome, workflowCode,
+  }));
+}
+```
+
+### Module registration (in `workflow-engine.module.ts`)
+
+```typescript
+import { makeCounterProvider, makeHistogramProvider } from '@willsoto/nestjs-prometheus';
+
+// Add to providers array:
+makeCounterProvider({
+  name: 'workflow_transitions_total',
+  help: 'Total workflow transitions by code, action, and outcome',
+  labelNames: ['workflow_code', 'action', 'outcome'],
+}),
+makeHistogramProvider({
+  name: 'workflow_transition_duration_ms',
+  help: 'Workflow transition duration in milliseconds',
+  labelNames: ['workflow_code'],
+  buckets: [50, 100, 250, 500, 1000, 2500, 5000],
+}),
+```
+
+---
+
+## 5. DSL Cache Changes (FR-007)
+
+### Cache methods in `workflow-engine.service.ts`
+
+```typescript
+// ใน createDefinition() — หลัง save
+await this.cacheManager.set(
+  `wf:def:${saved.workflow_code}:${saved.version}`,
+  saved,
+  3600 * 1000  // 1 hour in ms (cache-manager v5 uses ms)
+);
+
+// ใน update() — ก่อน save (ถ้า DSL เปลี่ยน)
+await this.cacheManager.del(`wf:def:${definition.workflow_code}:${definition.version}`);
+
+// ใน activate/deactivate — invalidate active pointer
+await this.redis.del(`wf:def:${definition.workflow_code}:active`);
+if (dto.is_active === true) {
+  await this.cacheManager.set(
+    `wf:def:${definition.workflow_code}:active`,
+    saved,
+    3600 * 1000
+  );
+}
+```
+
+---
+
+## 6. BullMQ DLQ + n8n Webhook Changes (FR-005, FR-006)
+
+### `workflow-event.service.ts` additions
+
+```typescript
+// ใน WorkflowEventProcessor:
+
+@OnWorkerEvent('failed')
+async onJobFailed(job: Job, error: Error): Promise<void> {
+  // ตรวจสอบว่าหมด retry แล้วหรือยัง
+  if ((job.attemptsMade ?? 0) >= (job.opts.attempts ?? 3)) {
+    // ส่งไปยัง DLQ
+    await this.failedQueue.add('dead-letter', {
+      originalJobId: job.id,
+      queue: 'workflow-events',
+      data: job.data,
+      failedAt: new Date().toISOString(),
+      error: error.message,
+    });
+
+    // แจ้ง Ops ผ่าน n8n webhook (ถ้าตั้งค่าไว้)
+    const webhookUrl = process.env.N8N_WEBHOOK_URL;
+    if (webhookUrl) {
+      try {
+        await fetch(webhookUrl, {
+          method: 'POST',
+          headers: { 'Content-Type': 'application/json' },
+          body: JSON.stringify({
+            event: 'workflow_event_failed',
+            jobId: job.id,
+            workflowCode: job.data?.workflowCode,
+            instanceId: job.data?.instanceId,
+            error: error.message,
+            timestamp: new Date().toISOString(),
+          }),
+        });
+      } catch (webhookErr) {
+        // Warning เท่านั้น — ไม่ throw เพื่อไม่ให้กระทบ DLQ add
+        this.logger.warn(`n8n webhook failed: ${(webhookErr as Error).message}`);
+      }
+    } else {
+      this.logger.warn('N8N_WEBHOOK_URL not configured — DLQ job created without ops notification');
+    }
+  }
+}
+```
+
+### Worker configuration (verify/update in `workflow-engine.module.ts`)
+
+```typescript
+WorkerHost({
+  connection: { ... },
+  concurrency: 5,
+  limiter: { max: 50, duration: 60000 },
+}),
+// Job default options
+defaultJobOptions: {
+  attempts: 3,
+  backoff: { type: 'exponential', delay: 500 },
+  removeOnComplete: { age: 86400 },
+  removeOnFail: false,  // Keep in failed state for Bull Board visibility
+}
+```
+
+---
+
+## 7. Updated Entity Relationship Diagram
+
+```
+workflow_definitions
+  workflow_code + version (unique)
+  is_active: BOOLEAN
+       │ 1
+       │
+       ▼ N
+workflow_instances
+  version_no: INT DEFAULT 1       ← NEW (Delta 09)
+  current_state: VARCHAR(50)
+  context: JSON
+  contract_id: INT NULL
+       │ 1
+       │
+       ▼ N
+workflow_histories
+  action_by_user_id: INT NULL     ← existing (internal FK)
+  action_by_user_uuid: VARCHAR(36) ← NEW (Delta 10, ADR-019)
+  from_state / to_state / action
+  metadata: JSON
+       │ 1
+       │
+       ▼ N
+attachments
+  workflow_history_id: CHAR(36) NULL  ← Delta 04 (already applied)
+  uuid: VARCHAR(36)                   ← publicId (ADR-019)
+```
+
+---
+
+## 8. Index Strategy (updated)
+
+| Table | Index | Columns | Purpose | Status |
+|-------|-------|---------|---------|--------|
+| `workflow_instances` | `idx_wf_inst_version` | `(id, version_no)` | Optimistic lock CAS check | **NEW** |
+| `workflow_instances` | `idx_wf_inst_entity` | `(entity_type, entity_id)` | Polymorphic lookup | Existing |
+| `workflow_histories` | `idx_wf_hist_instance` | `(instance_id)` | History per instance | Existing |
+| `attachments` | `idx_att_wfhist_created` | `(workflow_history_id, created_at)` | Step attachments | Delta 04 |
@@ -0,0 +1,272 @@
+# Implementation Plan: Unified Workflow Engine — Production Hardening & Integrated Context
+
+**Branch**: `003-unified-workflow-engine` | **Date**: 2026-05-02 | **Spec**: [spec.md](./spec.md)  
+**Input**: Feature specification from `specs/003-unified-workflow-engine/spec.md`
+
+---
+
+## Summary
+
+The Workflow Engine backend infrastructure is substantially implemented (service, entities, guards, DSL, Redlock, Prometheus metrics). This plan closes the remaining production-hardening gaps from ADR-001 v1.1 (optimistic lock, user UUID in history, CASL-mapped DSL roles, per-transition metrics, DSL Redis cache, DLQ + n8n webhook) and completes ADR-021 (step-specific attachment data-wiring in all 4 modules, file preview modal, Admin DSL editor UI).
+
+Clarification decisions from `spec.md`:
+- **Q1**: DSL `require.role` → CASL ability check (FR-002a)
+- **Q2**: Observability = structured log + counter + histogram (FR-022, FR-023)
+- **Q3**: File rollback on DB failure = move back to temp, 24h TTL (FR-019)
+- **Q4**: Admin DSL editor UI is in scope (FR-024, FR-025)
+- **Q5**: All 4 modules need banner gap-filling (FR-011)
+
+---
+
+## Technical Context
+
+**Language/Version**: TypeScript 5.4, Node.js 20 LTS  
+**Primary Dependencies**: NestJS 10, TypeORM 0.3, BullMQ 5, `@willsoto/nestjs-prometheus`, `json-logic-js`, `redlock`, `ioredis`  
+**Frontend**: Next.js 14 (App Router), TanStack Query v5, React Hook Form + Zod, shadcn/ui  
+**Storage**: MariaDB 10.11, Redis 7, StorageService (Two-Phase Upload per ADR-016)  
+**Testing**: Jest + `@nestjs/testing` (backend), Vitest (frontend)  
+**Target Platform**: QNAP NAS Docker Compose (backend), Next.js SSR (frontend)  
+**Performance Goals**: Transition P95 < 1s (no upload); upload+transition P95 < 5s; cache invalidation < 1s across all instances  
+**Constraints**: ADR-009 (no TypeORM migrations), ADR-019 (UUID strings, no parseInt), ADR-016 (Two-Phase Upload), ADR-008 (BullMQ async)  
+**Scale/Scope**: 4 document modules × ~50 active workflows concurrently; up to 20 history records per instance
+
+---
+
+## Constitution Check
+
+_GATE: Must pass before Phase 0. Re-checked after Phase 1 design._
+
+| Gate | Rule | Status | Notes |
+|------|------|--------|-------|
+| ADR-019 UUID | No `parseInt` on UUIDs; expose `publicId` strings only | ✅ PASS | `WorkflowInstance.id` and `WorkflowHistory.id` are UUID PKs (native CHAR(36)); `action_by_user_uuid` addition follows pattern |
+| ADR-009 Schema | No TypeORM migrations; edit SQL directly | ✅ PASS | Two new delta files planned (delta-09, delta-10) |
+| ADR-016 Security | Two-Phase upload; ClamAV; whitelist | ✅ PASS | Already implemented in `processTransition()`; file preview uses existing attachment endpoint |
+| ADR-008 BullMQ | Async notifications; no inline dispatch | ✅ PASS | `WorkflowEventService` dispatches to `workflow-events` queue; DLQ is the gap |
+| ADR-007 Errors | Layered exception hierarchy | ✅ PASS | `WorkflowException`, `ConflictException`, `ServiceUnavailableException` already in use |
+| ADR-002 Numbering | Redlock for document numbering | ✅ N/A | Workflow engine does not generate document numbers |
+| ADR-018/020 AI | No AI direct DB access | ✅ N/A | No AI integration in this feature |
+| FR-002 Optimistic Lock | `version_no` column on `workflow_instances` | ⚠️ GAP | Column missing — delta-09 required |
+| FR-003 User UUID | `action_by_user_uuid` on `workflow_histories` | ⚠️ GAP | Column missing — delta-10 required |
+
+**Post-gate verdict**: PASS with two schema deltas required before implementation begins.
+
+---
+
+## Project Structure
+
+### Documentation (this feature)
+
+```text
+specs/003-unified-workflow-engine/
+├── plan.md              ← This file
+├── research.md          ← Phase 0 output
+├── data-model.md        ← Phase 1 output
+├── quickstart.md        ← Phase 1 output
+└── contracts/           ← Phase 1 output
+    ├── workflow-transition.yaml
+    └── workflow-definitions.yaml
+```
+
+### Source Code Layout
+
+```text
+backend/src/modules/workflow-engine/
+├── entities/
+│   ├── workflow-instance.entity.ts      ← ADD versionNo column
+│   └── workflow-history.entity.ts       ← ADD actionByUserUuid column
+├── guards/
+│   └── workflow-transition.guard.ts     ← ADD DSL require.role → CASL mapping (FR-002a)
+├── dto/
+│   └── workflow-history-item.dto.ts     ← ADD actorUuid field
+├── workflow-engine.service.ts           ← ADD version_no check, structured log, metrics, cache invalidation
+├── workflow-event.service.ts            ← ADD DLQ processor + n8n webhook (FR-005/006)
+└── workflow-engine.module.ts            ← Register new metrics providers
+
+specs/03-Data-and-Storage/deltas/
+├── 09-add-version-no-to-workflow-instances.sql    ← NEW
+└── 10-add-action-by-user-uuid-to-workflow-histories.sql  ← NEW
+
+frontend/components/workflow/
+├── integrated-banner.tsx                ← GAP-FILL: step-attachment upload zone
+├── workflow-lifecycle.tsx               ← GAP-FILL: history items with attachment list
+└── file-preview-modal.tsx               ← NEW component
+
+frontend/app/(admin)/admin/workflows/
+└── definitions/
+    ├── page.tsx                         ← NEW: DSL list + activate/deactivate
+    └── [id]/
+        └── page.tsx                     ← NEW: DSL JSON editor + inline validation
+
+frontend/app/(admin)/admin/doc-control/
+├── rfa/[uuid]/page.tsx                  ← GAP-FILL: availableActions, step-attach
+├── transmittals/[uuid]/page.tsx         ← GAP-FILL: step-attach upload zone
+├── circulation/[uuid]/page.tsx          ← GAP-FILL: step-attach upload zone
+└── correspondence/[uuid]/page.tsx       ← GAP-FILL + new IntegratedBanner wiring
+```
+
+---
+
+## Implementation Phases
+
+### Phase B1: Schema Deltas (prerequisite)
+
+Apply before any code changes.
+
+| Delta | File | Change |
+|-------|------|--------|
+| 09 | `09-add-version-no-to-workflow-instances.sql` | `ALTER TABLE workflow_instances ADD COLUMN version_no INT NOT NULL DEFAULT 1` |
+| 10 | `10-add-action-by-user-uuid-to-workflow-histories.sql` | `ALTER TABLE workflow_histories ADD COLUMN action_by_user_uuid VARCHAR(36) NULL` |
+
+### Phase B2: Entity & DTO Updates
+
+| Task | File | Change |
+|------|------|--------|
+| B2-1 | `workflow-instance.entity.ts` | Add `@Column() versionNo: number` with `@Version()` decorator |
+| B2-2 | `workflow-history.entity.ts` | Add `@Column() actionByUserUuid?: string` |
+| B2-3 | `workflow-history-item.dto.ts` | Add `actorUuid: string` field (exposed in API per ADR-019) |
+
+### Phase B3: Optimistic Lock in `processTransition()` (FR-002)
+
+In `workflow-engine.service.ts`:
+1. Accept `clientVersionNo?: number` parameter in `processTransition()`
+2. If provided: compare against `instance.versionNo` BEFORE Redlock acquisition → throw `ConflictException` (HTTP 409) if mismatch
+3. After DB transaction commit: increment `instance.versionNo + 1` via `UPDATE workflow_instances SET version_no = version_no + 1 WHERE id = :id AND version_no = :expected`
+4. No separate pessimistic lock change needed — keep both as defense-in-depth
+
+### Phase B4: CASL Role Mapping in Guard (FR-002a)
+
+In `workflow-transition.guard.ts`:
+1. After Level 1 (Superadmin) check, extract DSL `require.role` from the current step config
+2. Map each DSL role string to a CASL ability string via `DSL_ROLE_TO_CASL` config map
+3. Check `userPermissions.includes(mappedAbility)` for any match → pass
+4. Fall through to existing Level 3 (assignedUserId) check for `"AssignedHandler"` role
+
+```typescript
+const DSL_ROLE_TO_CASL: Record<string, string> = {
+  'Superadmin':      'system.manage_all',
+  'OrgAdmin':        'organization.manage_users',
+  'ContractMember':  'contract.view',
+  'AssignedHandler': '__assigned__',   // resolved by existing Level 3 check
+};
+```
+
+### Phase B5: Structured Observability (FR-022, FR-023)
+
+In `workflow-engine.service.ts`:
+1. Inject two new metrics via `@InjectMetric()`:
+   - `workflow_transitions_total` (Counter: `workflow_code`, `action`, `outcome`)
+   - `workflow_transition_duration_ms` (Histogram: `workflow_code`)
+2. Wrap `processTransition()` in a `startTimer` → `observe(duration)` block
+3. Emit structured log on every outcome:
+   ```typescript
+   this.logger.log(JSON.stringify({
+     instanceId, action, fromState, toState, userUuid,
+     durationMs, outcome, workflowCode
+   }));
+   ```
+4. Register providers in `workflow-engine.module.ts`
+
+### Phase B6: DSL Redis Cache Invalidation (FR-007)
+
+In `workflow-engine.service.ts`:
+1. In `createDefinition()`: after save, call `cacheManager.set('wf:def:${code}:${version}', entity, 3600)`
+2. In `update()`: call `cacheManager.del('wf:def:${code}:${oldVersion}')` before save
+3. In `getDefinitionById()` / cached lookup: read-through with `cacheManager.get()` → fallback to DB
+4. On `is_active` toggle: invalidate ALL `wf:def:{code}:*` keys (use `redis.keys()` + `redis.del()` pattern)
+
+### Phase B7: BullMQ DLQ + n8n Webhook (FR-005, FR-006)
+
+In `workflow-event.service.ts`:
+1. Add `workflow-events-failed` queue registration
+2. Add `@OnWorkerEvent('failed')` handler in the processor class
+3. On `attempts === maxAttempts`: POST to `process.env.N8N_WEBHOOK_URL` with job payload (never hardcoded)
+4. Verify existing `workflow-events` worker has `concurrency: 5, attempts: 3, backoff: { type: 'exponential', delay: 500 }`
+
+### Phase B8: File Rollback on Transaction Failure (FR-019)
+
+In `workflow-engine.service.ts` `processTransition()`:
+1. After file linkage step inside transaction, if `queryRunner.commitTransaction()` throws:
+   - Call `storageService.moveToTemp(attachmentPublicIds)` in the `catch` block
+   - Log the rollback with attachment IDs for audit
+2. The 24h TTL on temp files is handled by existing `FileCleanupService` cron
+
+### Phase F1: File Preview Modal (FR-020)
+
+New component: `frontend/components/workflow/file-preview-modal.tsx`
+- Props: `attachment: WorkflowAttachmentSummary | null`, `onClose: () => void`
+- Renders PDF via `<iframe src="/api/files/{publicId}/preview" />` for PDFs
+- Renders `<img>` for image MIME types
+- Falls back to download link for unsupported types
+- Uses shadcn/ui `Dialog` component
+
+### Phase F2: Step-Attachment Upload Zone (FR-014–FR-019)
+
+In `integrated-banner.tsx`:
+1. Show upload zone only when `currentState ∈ {PENDING_REVIEW, PENDING_APPROVAL}` AND user is assigned handler/org-admin/superadmin
+2. Upload zone calls existing Two-Phase upload endpoint, then appends `publicId` to pending list
+3. On action button click, pass `attachmentPublicIds` array to `use-workflow-action.ts` hook
+4. On success: invalidate TanStack Query cache for document + history
+
+In `workflow-lifecycle.tsx`:
+1. For each history item, render `attachments[]` as clickable file chips
+2. On click: open `FilePreviewModal`
+
+### Phase F3: Module Banner Gap-Fill (FR-011, all 4 modules)
+
+For each detail page (`rfa`, `transmittals`, `circulation`, `correspondence`):
+1. Ensure service `findOneByUuid()` exposes: `workflowInstanceId`, `workflowState`, `availableActions`, `workflowPriority`
+2. Pass live values to `<IntegratedBanner>` and `<WorkflowLifecycle>`
+3. Add step-attachment upload zone via Phase F2 components
+4. Verify `WorkflowHistoryItemDto` includes `attachments[]` in the history endpoint
+
+Correspondence is the only module requiring new backend wiring (Transmittal + Circulation already done per v1.8.7; RFA has partial wiring — needs `availableActions` + step-attach).
+
+### Phase F4: Admin DSL Editor UI (FR-024, FR-025)
+
+New pages under `frontend/app/(admin)/admin/workflows/definitions/`:
+
+**List page** (`page.tsx`):
+- Table of all workflow definitions with columns: `workflow_code`, `version`, `is_active`, actions (Edit / Activate / Deactivate)
+- Uses TanStack Query `useWorkflowDefinitions()` hook
+- Activate/Deactivate via `PATCH /workflow-engine/definitions/:id` with `{ is_active: true/false }`
+
+**Editor page** (`[id]/page.tsx`):
+- Load definition via `useWorkflowDefinition(id)`
+- JSON editor (Monaco Editor or `@uiw/react-codemirror` in JSON mode)
+- Inline validation: call `POST /workflow-engine/definitions/validate` with DSL JSON → display errors inline
+- Save button disabled when validation errors present (FR-025)
+- Form managed with React Hook Form + Zod (for wrapper metadata fields)
+
+---
+
+## Complexity Tracking
+
+No constitution violations requiring justification.
+
+---
+
+## Risk Register
+
+| Risk | Impact | Mitigation |
+|------|--------|-----------|
+| `version_no` delta on live DB with existing instances | Medium | Delta sets `DEFAULT 1`; existing rows auto-initialize; no data loss |
+| `action_by_user_uuid` delta — NULL for historical records | Low | Column is NULLABLE; historical records remain valid |
+| DSL role mapping gaps (unknown role strings) | Medium | `DSL_ROLE_TO_CASL` unknown keys default to `__assigned__` check — fail-safe |
+| Monaco Editor bundle size (~2MB) | Low | Lazy-loaded only on Admin DSL editor page; no impact to user-facing pages |
+| n8n webhook URL not configured in some environments | Medium | Guard with `if (!N8N_WEBHOOK_URL)` → warn log, don't throw; ops can configure later |
+
+---
+
+## Test Plan
+
+| Area | Tests Required | Target |
+|------|---------------|--------|
+| `WorkflowEngineService.processTransition` | Concurrent optimistic lock (409), version increment, structured log emission | Unit (Jest) |
+| `WorkflowTransitionGuard` | DSL role → CASL mapping for each level | Unit (Jest) |
+| `WorkflowEventService` DLQ | Failed job triggers n8n webhook | Unit (Jest + mock) |
+| Transition metrics | Counter/histogram incremented on success + failure | Unit (Jest) |
+| DSL cache invalidation | Activate triggers cache del | Integration (Jest) |
+| File rollback (FR-019) | DB failure → `moveToTemp()` called | Unit (Jest + mock) |
+| `FilePreviewModal` | Renders PDF/image/fallback correctly | Frontend (Vitest) |
+| Admin DSL editor | Validation errors shown inline; save blocked | Frontend (Vitest) |
+| Module gap-fill E2E | Each module detail page renders live `availableActions` | Manual / Playwright |
@@ -0,0 +1,205 @@
+# Quickstart: Unified Workflow Engine — Production Hardening
+
+**Phase 1 Output** | Generated: 2026-05-02  
+**For**: Developers implementing tasks from `tasks.md` (generated by `/speckit-tasks`)
+
+---
+
+## Pre-flight Checklist
+
+Before writing any code:
+
+- [ ] Apply Delta 09: `specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql`
+- [ ] Apply Delta 10: `specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql`
+- [ ] Confirm `workflow_instances` has `version_no` column: `DESCRIBE workflow_instances;`
+- [ ] Confirm `workflow_histories` has `action_by_user_uuid` column: `DESCRIBE workflow_histories;`
+- [ ] Verify existing tests pass: `pnpm test --testPathPattern=workflow-engine`
+
+---
+
+## Implementation Order
+
+Tasks MUST be implemented in this order to avoid breaking existing functionality:
+
+```
+[B1] Schema Deltas (DB)
+ ↓
+[B2] Entity + DTO updates
+ ↓
+[B3] processTransition() — optimistic lock
+ ↓
+[B4] WorkflowTransitionGuard — CASL role mapping
+ ↓
+[B5] Observability — metrics + structured log
+ ↓
+[B6] DSL Redis cache invalidation
+ ↓
+[B7] BullMQ DLQ + n8n webhook
+ ↓
+[F1] FilePreviewModal component
+ ↓
+[F2] Step-attachment upload zone in IntegratedBanner
+ ↓
+[F3] Module gap-fill (all 4 modules)
+ ↓
+[F4] Admin DSL editor UI
+```
+
+---
+
+## Key Files Reference
+
+| Task | File | Action |
+|------|------|--------|
+| B1 | `specs/03-Data-and-Storage/deltas/09-*.sql` | CREATE |
+| B1 | `specs/03-Data-and-Storage/deltas/10-*.sql` | CREATE |
+| B2 | `backend/src/modules/workflow-engine/entities/workflow-instance.entity.ts` | EDIT — add `versionNo` |
+| B2 | `backend/src/modules/workflow-engine/entities/workflow-history.entity.ts` | EDIT — add `actionByUserUuid` |
+| B2 | `backend/src/modules/workflow-engine/dto/workflow-history-item.dto.ts` | EDIT — add `actorUuid` |
+| B3 | `backend/src/modules/workflow-engine/workflow-engine.service.ts` | EDIT — optimistic lock, rollback, metrics |
+| B4 | `backend/src/modules/workflow-engine/guards/workflow-transition.guard.ts` | EDIT — DSL role → CASL |
+| B5 | `backend/src/modules/workflow-engine/workflow-engine.module.ts` | EDIT — register metrics providers |
+| B6 | `backend/src/modules/workflow-engine/workflow-engine.service.ts` | EDIT — cache set/del in createDefinition/update |
+| B7 | `backend/src/modules/workflow-engine/workflow-event.service.ts` | EDIT — DLQ + n8n webhook |
+| F1 | `frontend/components/workflow/file-preview-modal.tsx` | CREATE |
+| F2 | `frontend/components/workflow/integrated-banner.tsx` | EDIT — upload zone |
+| F2 | `frontend/components/workflow/workflow-lifecycle.tsx` | EDIT — attachment chips |
+| F3 | `frontend/app/(admin)/admin/doc-control/correspondence/[uuid]/page.tsx` | EDIT — banner wiring |
+| F3 | `frontend/app/(admin)/admin/doc-control/rfa/[uuid]/page.tsx` | EDIT — step-attach gap |
+| F3 | `frontend/app/(admin)/admin/doc-control/transmittals/[uuid]/page.tsx` | EDIT — step-attach gap |
+| F3 | `frontend/app/(admin)/admin/doc-control/circulation/[uuid]/page.tsx` | EDIT — step-attach gap |
+| F4 | `frontend/app/(admin)/admin/workflows/definitions/page.tsx` | CREATE |
+| F4 | `frontend/app/(admin)/admin/workflows/definitions/[id]/page.tsx` | CREATE |
+
+---
+
+## Critical Patterns
+
+### Optimistic Lock — Client Side
+
+```typescript
+// Frontend: store versionNo from GET /workflow-engine/instances/:id
+const { data: instance } = useWorkflowInstance(instanceId);
+
+// On transition: pass versionNo in body
+await triggerTransition({
+  action: 'APPROVE',
+  versionNo: instance.versionNo,   // ← MUST include
+  attachmentPublicIds: pendingFiles,
+  comment,
+});
+
+// On 409 → show toast "เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช"
+// Invalidate query cache → user sees updated state
+```
+
+### DSL Role Mapping — Guard
+
+```typescript
+// backend/src/modules/workflow-engine/guards/workflow-transition.guard.ts
+const DSL_ROLE_TO_CASL: Record<string, string> = {
+  'Superadmin':     'system.manage_all',
+  'OrgAdmin':       'organization.manage_users',
+  'ContractMember': 'contract.view',
+  'AssignedHandler': '__assigned__',
+};
+
+// In canActivate: extract require.role from DSL compiled state
+const stepConfig = compiled?.states?.[instance.currentState];
+const requiredRoles: string[] = stepConfig?.require?.role ?? [];
+
+for (const dslRole of requiredRoles) {
+  const caslAbility = DSL_ROLE_TO_CASL[dslRole];
+  if (!caslAbility) continue;
+  if (caslAbility === '__assigned__') continue; // handled by Level 3 check
+  if (userPermissions.includes(caslAbility)) return true;
+}
+// Fall through to Level 3 (assignedUserId) check as before
+```
+
+### File Preview Modal — Usage
+
+```tsx
+// In workflow-lifecycle.tsx
+import { FilePreviewModal } from './file-preview-modal';
+
+const [preview, setPreview] = useState<WorkflowAttachmentSummary | null>(null);
+
+// In attachment chip onClick:
+<button onClick={() => setPreview(attachment)}>{attachment.originalFilename}</button>
+
+<FilePreviewModal attachment={preview} onClose={() => setPreview(null)} />
+```
+
+### Admin DSL Editor — Monaco Setup
+
+```tsx
+// In definitions/[id]/page.tsx
+import dynamic from 'next/dynamic';
+
+const MonacoEditor = dynamic(() => import('@monaco-editor/react'), { ssr: false });
+
+// Validate on change (debounced 800ms)
+const handleEditorChange = useCallback(
+  debounce(async (value: string) => {
+    try {
+      const parsed = JSON.parse(value);
+      const result = await validateDsl(parsed);
+      setValidationErrors(result.errors);
+    } catch {
+      setValidationErrors([{ path: 'root', message: 'Invalid JSON' }]);
+    }
+  }, 800),
+  []
+);
+```
+
+---
+
+## Testing Verification Commands
+
+```bash
+# Backend unit tests for workflow engine
+cd backend
+pnpm test --testPathPattern=workflow-engine --coverage
+
+# Frontend typecheck
+cd frontend
+pnpm tsc --noEmit
+
+# Frontend component tests
+cd frontend
+pnpm vitest run components/workflow
+
+# Full backend test suite
+cd backend
+pnpm test --coverage
+```
+
+---
+
+## Environment Variables
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `N8N_WEBHOOK_URL` | Prod only | URL for dead-letter job ops notifications |
+| `REDIS_URL` | All | Redis connection for BullMQ + cache |
+
+Both must be set in `docker-compose.yml` — never hardcoded.
+
+---
+
+## Commit Message Convention
+
+```
+feat(workflow-engine): add optimistic lock version_no (FR-002, ADR-001 v1.1)
+feat(workflow-engine): add CASL DSL role mapping to guard (FR-002a)
+feat(workflow-engine): structured transition log + metrics (FR-022/023)
+feat(workflow-engine): DSL cache invalidation on activate (FR-007)
+feat(workflow-engine): BullMQ DLQ + n8n webhook (FR-005/006)
+feat(workflow-ui): FilePreviewModal component (FR-020)
+feat(workflow-ui): step-attachment upload zone in IntegratedBanner (FR-014-019)
+feat(workflow-ui): Admin DSL editor page (FR-024/025)
+feat(correspondence): IntegratedBanner gap-fill wiring (FR-011)
+chore(schema): delta-09 version_no, delta-10 action_by_user_uuid (ADR-009)
+```
@@ -0,0 +1,209 @@
+# Research: Unified Workflow Engine — Production Hardening Decisions
+
+**Phase 0 Output** | Generated: 2026-05-02  
+**Builds on**: `specs/08-Tasks/ADR-021-workflow-context/research.md` (attachment strategy, FK structure, UUID type — all resolved previously)
+
+---
+
+## Decision 1: Optimistic Lock Strategy for `processTransition()` (FR-002)
+
+**Question:** `processTransition()` already uses `pessimistic_write` DB lock. ADR-001 v1.1 requires adding `version_no` optimistic lock. Should they co-exist or replace?
+
+### Option A: Replace pessimistic with optimistic (Selected ❌)
+
+Remove `lock: { mode: 'pessimistic_write' }` and rely solely on `version_no` CAS.
+
+**Cons:**
+- Two concurrent requests with different `version_no` values still cause a race window between the DB read and the UPDATE
+- Redlock already acquired before DB transaction — removing pessimistic adds no benefit to latency
+
+### Option B: Dual-layer defense-in-depth (Selected ✅)
+
+Keep `pessimistic_write` inside the transaction. Add `version_no` check as a **fast-fail before Redlock acquisition**.
+
+**Flow:**
+```
+Client sends { action, version_no: N }
+    ↓
+[Fast-fail] Read instance.version_no from DB (no lock)
+    If N ≠ instance.version_no → HTTP 409 immediately (no Redlock acquired)
+    ↓
+[Acquire Redlock]
+    ↓
+[DB Transaction with pessimistic_write]
+    Re-check version_no under lock (TOCTOU defense)
+    If still mismatch → 409 and release lock
+    Else: commit + increment version_no
+```
+
+**Pros:**
+- Fast-fail saves Redlock round-trip for stale clients (SC-001 — no double approvals)
+- Inner pessimistic lock prevents any residual race within the DB transaction
+- Defense-in-depth: two independent barriers
+
+**Decision:** Option B — dual-layer.  
+**Rationale:** Zero latency regression for non-conflicting requests; stale-client 409 fired before lock acquisition; inner lock remains for cross-process correctness.
+
+---
+
+## Decision 2: DSL `require.role` → CASL Ability Mapping (FR-002a)
+
+**Question:** How should the guard resolve DSL `require.role: ["Admin"]` against the CASL permission model?
+
+### Option A: Static config map in guard (Selected ✅)
+
+```typescript
+const DSL_ROLE_TO_CASL: Record<string, string> = {
+  'Superadmin':      'system.manage_all',
+  'OrgAdmin':        'organization.manage_users',
+  'ContractMember':  'contract.view',
+  'AssignedHandler': '__assigned__',
+};
+```
+
+Guard resolves each DSL role string → CASL permission string → `userPermissions.includes(mapped)`.
+
+**Pros:**
+- No new DB tables or config entities
+- Testable in isolation (mock `userPermissions`)
+- Backward-compatible: unknown DSL roles fall through to `__assigned__` check
+
+### Option B: Dynamic mapping table in DB (Rejected ❌)
+
+Store DSL role → CASL ability mappings in a new `workflow_role_mappings` table.
+
+**Cons:**
+- New table requires ADR-009 delta + entity + service
+- Over-engineering for a mapping that changes rarely
+- Adds DB query to every transition guard check
+
+**Decision:** Option A — static config map.  
+**Rationale:** The mapping is stable (tied to ADR-016 RBAC levels); config-driven in code is sufficient and avoids over-engineering.
+
+---
+
+## Decision 3: Per-Transition Prometheus Metrics (FR-023)
+
+**Question:** The existing service has Redlock-specific metrics. Where should workflow-level transition metrics be registered?
+
+### Existing metrics (keep)
+- `workflow_redlock_acquire_duration_ms` (Histogram)
+- `workflow_redlock_acquire_failures_total` (Counter)
+
+### New metrics needed
+- `workflow_transitions_total` (Counter, labels: `workflow_code`, `action`, `outcome`)
+- `workflow_transition_duration_ms` (Histogram, labels: `workflow_code`)
+
+**Registration approach:** Add `makeCounterProvider` and `makeHistogramProvider` in `workflow-engine.module.ts` via `@willsoto/nestjs-prometheus`. Inject with `@InjectMetric('workflow_transitions_total')`.
+
+**Outcome label values:**
+- `success` — transition committed
+- `conflict` — optimistic lock mismatch (409) or TOCTOU
+- `forbidden` — CASL guard rejection (403)
+- `validation_error` — DSL condition failed (422)
+- `system_error` — unexpected exception (500)
+
+**Decision:** Register in `WorkflowEngineModule`; inject into `WorkflowEngineService`; record in `processTransition()` try/catch/finally block.
+
+---
+
+## Decision 4: DSL Definition Redis Cache Pattern (FR-007)
+
+**Question:** Cache key format, TTL, and invalidation strategy for `workflow_definitions`.
+
+### Cache key design
+```
+wf:def:{workflow_code}:{version}   → single definition
+wf:def:{workflow_code}:active      → pointer to active version (for fast active lookup)
+```
+
+### Invalidation triggers
+| Event | Action |
+|-------|--------|
+| `createDefinition()` | SET new key; leave old active pointer |
+| `update()` — DSL change | DEL old key; SET updated key |
+| `is_active = true` | SET `wf:def:{code}:active`; DEL previous active pointer |
+| `is_active = false` | DEL `wf:def:{code}:active` |
+
+### TTL: 3600 seconds (1 hour). Acceptable stale window for inactive definitions; active pointer is always invalidated on toggle.
+
+**Decision:** Two-key pattern with a separate `:active` pointer key. Invalidate pointer immediately on `is_active` change → satisfies SC-005 (< 1s invalidation).
+
+---
+
+## Decision 5: BullMQ Dead-Letter Queue Architecture (FR-005, FR-006)
+
+**Question:** How to implement `workflow-events-failed` DLQ with n8n webhook notification?
+
+### Option A: Separate `workflow-events-failed` queue (Selected ✅)
+
+```typescript
+// WorkflowEventProcessor
+@OnWorkerEvent('failed')
+async onFailed(job: Job, error: Error) {
+  if (job.attemptsMade >= job.opts.attempts) {
+    // All retries exhausted → DLQ + webhook
+    await this.failedQueue.add('dead-letter', { jobId: job.id, ...job.data });
+    await this.notifyOps(job, error);
+  }
+}
+```
+
+**Pros:**
+- Failed jobs visible in Bull Board under `workflow-events-failed`
+- Can be requeued manually via Bull Board UI
+- n8n only notified on final failure (not on intermediate retries)
+
+### Option B: BullMQ native `removeOnFail: false` only (Rejected ❌)
+
+Keep jobs in `workflow-events` completed/failed states, no separate queue.
+
+**Cons:**
+- Bull Board has no separate DLQ view
+- No ops notification mechanism
+- Harder to isolate and requeue failed jobs
+
+**Decision:** Option A — separate `workflow-events-failed` queue.  
+**n8n webhook:** Send via `fetch(process.env.N8N_WEBHOOK_URL, { method: 'POST', body: JSON.stringify(payload) })`. Guard with `if (!process.env.N8N_WEBHOOK_URL)` to avoid hard failure in dev environment.
+
+---
+
+## Decision 6: Admin DSL Editor — JSON Editor Library (FR-024, FR-025)
+
+**Question:** Which JSON editor library for the DSL authoring UI?
+
+### Option A: Monaco Editor (`@monaco-editor/react`) (Selected ✅)
+
+Full VS Code-like editor with JSON syntax highlighting, bracket matching, and inline error markers.
+
+**Pros:**
+- Inline error decoration (squiggle underlines) — satisfies FR-025 inline validation feedback
+- JSON schema validation via `monaco.languages.json.jsonDefaults.setDiagnosticsOptions()`
+- Familiar to developers
+- Already potentially used elsewhere in admin UIs
+
+**Cons:**
+- ~2MB bundle (lazy loaded via `dynamic(() => import('@monaco-editor/react'), { ssr: false })`)
+
+### Option B: CodeMirror 6 (`@uiw/react-codemirror`) (Alternative)
+
+Lighter (~400KB for JSON extension).
+
+**Cons:**
+- No native JSON Schema validation; requires custom linting extension
+- Inline error decoration requires manual setup
+
+**Decision:** Option A — Monaco Editor, lazy-loaded. The bundle cost is acceptable for an Admin-only page (not user-facing); inline validation is a critical FR-025 requirement.  
+**DSL Schema**: Provide the compiled DSL JSON Schema to Monaco for inline validation → errors shown before the user clicks Save.
+
+---
+
+## Carry-Forward from Prior Research
+
+The following decisions from `specs/08-Tasks/ADR-021-workflow-context/research.md` remain valid and are not re-litigated:
+
+- **File attachment strategy**: Upload-then-reference (Two-Phase, ADR-016) ✅
+- **FK structure**: Direct `workflow_history_id` on `attachments` table ✅
+- **UUID type for `workflow_histories.id`**: CHAR(36) UUID direct PK ✅
+- **Redlock scope**: Transition-level Redlock (not document-numbering Redlock) ✅
+- **Preview endpoint**: Use existing `/api/files/{publicId}` with `Content-Disposition: inline` ✅
@@ -0,0 +1,216 @@
+# Feature Specification: Unified Workflow Engine — Production Hardening & Integrated Context
+
+**Feature Branch**: `003-unified-workflow-engine`
+**Created**: 2026-05-02
+**Status**: Draft
+**References**: ADR-001 (Unified Workflow Engine v1.1), ADR-021 (Integrated Workflow Context & Step-specific Attachments)
+
+---
+
+## Clarifications
+
+### Session 2026-05-02
+
+- Q: How should the `WorkflowTransitionGuard` resolve DSL `require.role` values against the CASL permission system? → A: DSL `require.role` values map to **CASL ability checks** — each role string corresponds to a defined CASL `action:subject` permission pair (e.g., `"Admin"` → `workflow.manage`). The guard resolves permissions dynamically at transition time; it does NOT match DB role names directly.
+- Q: What level of observability is required for workflow transition operations? → A: **Structured log + metrics** — one structured log entry per transition (instance ID, action, user UUID, duration ms, outcome: success/conflict/forbidden/error) plus a counter metric for transition throughput and a latency histogram. No distributed tracing required at this stage.
+- Q: When a file has been moved to permanent storage but the DB transition subsequently fails, what is the recovery action? → A: **Move back to temp** — `StorageService` moves the file from permanent back to temp on DB failure; temp files expire after a 24-hour TTL, allowing the user to retry the transition without re-uploading or re-scanning.
+- Q: Does this feature include a frontend Admin UI for DSL authoring, or is API-only sufficient? → A: **Full Admin UI in scope** — a frontend page for Super Admins to create, edit (JSON editor), activate, and deactivate workflow definitions with inline DSL validation feedback. Visual workflow builder (drag-and-drop) remains Phase 2 / out of scope.
+- Q: Which modules still need new Integrated Banner + Workflow Lifecycle integration work? → A: **All four modules need gap-filling** — RFA, Transmittal, Circulation, and Correspondence all have the banner component mounted but have incomplete data wiring (e.g., missing `availableActions`, no step-attachment upload support). None are fully complete; all require targeted completion work.
+
+---
+
+## User Scenarios & Testing _(mandatory)_
+
+### User Story 1 — Workflow Transition with State Integrity (Priority: P1)
+
+A Reviewer or Approver assigned to an active workflow step transitions a document from one state to the next (e.g., `PENDING_REVIEW` → `APPROVED`). The system must guarantee that only one transition occurs even if two users click "Approve" simultaneously, that the workflow history records who acted and when, and that downstream notifications are dispatched asynchronously without slowing down the response.
+
+**Why this priority**: Core correctness of the Workflow Engine — without reliable, race-condition-free transitions the entire approval chain is unreliable.
+
+**Independent Test**: Can be fully tested by submitting two concurrent approval requests and verifying only one succeeds (the other returns 409), and that the history table contains exactly one new record.
+
+**Acceptance Scenarios**:
+
+1. **Given** a document in `PENDING_REVIEW` state with `version_no = 5`, **When** an assigned handler submits the `APPROVE` action, **Then** the state transitions to `APPROVED`, `version_no` increments to `6`, and a new `workflow_histories` record is written within the same DB transaction.
+2. **Given** two concurrent `APPROVE` requests for the same instance at the same `version_no`, **When** both reach the server simultaneously, **Then** exactly one succeeds (200) and the other receives 409 "Concurrent transition detected — please retry" without any data corruption.
+3. **Given** a successful transition, **When** the transition commits, **Then** a BullMQ job is enqueued on the `workflow-events` queue within the same request (no inline notification call).
+4. **Given** a `PENDING_REVIEW` instance and a user who is NOT the assigned handler and does NOT have the required CASL ability (e.g., `workflow.manage`) mapped from the DSL `require.role` value, **When** they attempt to transition, **Then** they receive 403 Forbidden.
+
+---
+
+### User Story 2 — Condition-Gated Transitions via DSL (Priority: P1)
+
+A workflow step requires a condition to be met (e.g., `requiresLegal > 0`) before a transition is allowed. The DSL defines this as a JSON Logic rule, and the engine evaluates it against the current `context` at transition time.
+
+**Why this priority**: Without reliable condition evaluation, automated gating (legal review, approval thresholds) fails and documents could bypass required steps.
+
+**Independent Test**: Can be fully tested by configuring a DSL with a JSON Logic condition, providing a context that both satisfies and fails the condition, and observing that transitions are allowed/blocked accordingly.
+
+**Acceptance Scenarios**:
+
+1. **Given** a DSL transition with `{ "type": "json-logic", "rule": { ">": [{ "var": "requiresLegal" }, 0] } }` and context `{ "requiresLegal": 1 }`, **When** the `SUBMIT` action is triggered, **Then** the transition proceeds.
+2. **Given** the same DSL and context `{ "requiresLegal": 0 }`, **When** `SUBMIT` is triggered, **Then** the transition is blocked and the caller receives a `ValidationException` (HTTP 422) with a field-level error.
+3. **Given** a DSL that uses a raw JS string expression (`"context.x === true"`) instead of JSON Logic format, **When** an Admin attempts to save the DSL, **Then** the save is rejected with a validation error explaining only JSON Logic format is permitted.
+
+---
+
+### User Story 3 — Integrated Contextual Banner & Workflow Lifecycle View (Priority: P1)
+
+A Reviewer opens a document detail page (RFA, Transmittal, Circulation, or Correspondence). Instead of navigating to a separate Workflow panel, the document header immediately shows the document number, current status, priority badge, and Approve/Reject action buttons. A "Workflow Engine" tab below displays a vertical timeline of all workflow steps — active step highlighted in indigo with a pulse animation.
+
+**Why this priority**: Without the Integrated Banner and Lifecycle View (ADR-021 REQ-01 to REQ-03), Reviewers must switch between screens to understand context, increasing approval time and error rate.
+
+**Independent Test**: Can be fully tested by opening any document in `PENDING_REVIEW` or `PENDING_APPROVAL` state and visually confirming the banner shows correct status + action buttons, and the timeline tab shows the active step in indigo.
+
+**Acceptance Scenarios**:
+
+1. **Given** an RFA in `PENDING_APPROVAL` state with priority `URGENT`, **When** the detail page loads, **Then** the banner at the top displays the document number, `PENDING_APPROVAL` status badge, `URGENT` priority badge, and `Approve`/`Reject` action buttons — all before the document body content.
+2. **Given** a workflow with 4 steps (DRAFT → PENDING_REVIEW → PENDING_APPROVAL → APPROVED), **When** the document is in `PENDING_REVIEW`, **Then** step 2 shows indigo color with CSS pulse animation; steps 1, 3, 4 show no animation.
+3. **Given** a completed document (`APPROVED` or `CLOSED`), **When** the detail page loads, **Then** the action buttons are disabled/hidden and no upload controls are visible.
+
+---
+
+### User Story 4 — Step-specific Attachment Upload & Preview (Priority: P2)
+
+While reviewing a document in an active workflow step, a handler uploads evidence files (PDF, DWG, DOCX, XLSX, ZIP) to be linked specifically to that step's history record. Later, any authorized user can click the file to preview it inline via a modal without navigating away.
+
+**Why this priority**: Step-specific attachments provide the audit trail required for compliance — files are traceable to the exact decision step. Preview reduces time spent downloading/opening files.
+
+**Independent Test**: Can be fully tested by uploading a PDF during `PENDING_REVIEW`, transitioning to `APPROVED`, and verifying the file is visible under the `PENDING_REVIEW` history entry with inline preview working.
+
+**Acceptance Scenarios**:
+
+1. **Given** a document in `PENDING_REVIEW` state, **When** the assigned handler drags and drops a valid PDF onto the upload zone, **Then** the file is scanned by ClamAV, stored in permanent storage after a successful transition, and linked to the `workflow_histories` record for that step.
+2. **Given** a document in `APPROVED` (terminal) state, **When** any user attempts to upload a file, **Then** the upload zone is disabled and the system returns HTTP 409 "Cannot upload to terminal state".
+3. **Given** a file linked to a step, **When** any authorized user clicks the file name, **Then** a preview modal opens in-browser without navigating away from the detail page.
+4. **Given** a file infected with malware detected by ClamAV, **When** upload is attempted, **Then** the temp file is deleted immediately, the upload is rejected, and the user sees "File rejected: security scan failed".
+5. **Given** a duplicate upload request with the same `Idempotency-Key`, **When** the duplicate request arrives, **Then** the system returns the cached 201 response without creating a second record.
+
+---
+
+### User Story 5 — Workflow Definition Authoring (Super Admin Only) (Priority: P2)
+
+A Super Admin creates or updates a workflow DSL definition via an **Admin UI page** (JSON editor with inline validation feedback). The system validates the DSL structure and activates the new version. In-progress workflow instances continue using their bound version until completion.
+
+**Why this priority**: Without safe DSL authoring, new document types cannot be onboarded and workflow changes cannot be deployed without code releases.
+
+**Independent Test**: Can be fully tested by creating a new DSL definition, activating it, and verifying existing in-progress instances still use the old version while new instances use the new version.
+
+**Acceptance Scenarios**:
+
+1. **Given** a Super Admin submits a valid DSL JSON, **When** the definition is saved and activated, **Then** the Redis cache key `wf:def:{workflow_code}:{version}` is invalidated immediately and new instances start using the new version.
+2. **Given** an in-progress `workflow_instances` record bound to version 1, **When** version 2 is activated, **Then** the in-progress instance continues using version 1's `definition_id` until it reaches a terminal state.
+3. **Given** a non-Super-Admin user, **When** they attempt to create or activate a DSL definition, **Then** they receive 403 Forbidden (`system.manage_all` required).
+4. **Given** a context_schema with a `required` field, **When** a transition is triggered with a context missing that field, **Then** HTTP 422 is returned with `{ "field": "<context_field>", "message": "required field missing" }`.
+
+---
+
+### User Story 6 — Dead-letter Queue & Ops Recovery (Priority: P3)
+
+A BullMQ `workflow-events` job fails all 3 retry attempts and moves to `workflow-events-failed`. Ops team is notified via n8n webhook and can manually requeue the job via Bull Board UI.
+
+**Why this priority**: Without dead-letter recovery, failed event dispatches (notifications, downstream triggers) are silently lost, breaking audit trail integrity.
+
+**Independent Test**: Can be fully tested by causing a simulated worker failure and verifying the n8n webhook fires and the job appears in the Bull Board dead-letter queue.
+
+**Acceptance Scenarios**:
+
+1. **Given** a `workflow-events` job that fails 3 times with exponential backoff, **When** attempts are exhausted, **Then** the job moves to `workflow-events-failed` queue and a webhook call is sent to `N8N_WEBHOOK_URL`.
+2. **Given** a job in `workflow-events-failed`, **When** an Ops admin clicks "Retry" in Bull Board UI, **Then** the job re-enters `workflow-events` queue for processing.
+3. **Given** a failed job, **When** the system auto-retries, **Then** it uses exponential backoff: attempt 1 immediately, attempt 2 after 500ms, attempt 3 after 1000ms — and does NOT auto-requeue after the dead-letter queue.
+
+---
+
+### Edge Cases
+
+- What happens when Redis is down during a workflow transition (no Redlock available for state transition)? The optimistic lock (`version_no`) alone handles concurrency for transitions — Redis is NOT required for transitions (only for Document Numbering per ADR-002). Transition proceeds normally; only file-upload-plus-transition uses Redlock.
+- What happens when a Redis Redlock fails during file-upload-plus-transition? Retry 3 times (500ms exponential backoff); if still failing, return HTTP 503 "Service temporarily unavailable" (Fail-closed — no partial state).
+- What happens when a terminal-state workflow receives a transition request? The engine returns 409 `BusinessException` — "Workflow is already in a terminal state".
+- What happens when `context_schema.required` field is missing at transition time? HTTP 422 `ValidationException` with field-level error — transition is blocked; caller must supply the missing context field and retry.
+- What happens when a file is deleted from storage after being linked to a workflow step? The UI shows "File unavailable" for that attachment; the `workflow_histories` metadata record is preserved.
+- What happens when two Admins concurrently activate different DSL versions for the same `workflow_code`? Last-write-wins on `is_active`; Redis cache is invalidated by both writes; existing instances are unaffected (already bound to a `definition_id`).
+
+---
+
+## Requirements _(mandatory)_
+
+### Functional Requirements
+
+**Workflow Engine Core (ADR-001)**
+
+- **FR-001**: The system MUST evaluate workflow transition conditions using JSON Logic format (`{ "type": "json-logic", "rule": {...} }`) exclusively — no JavaScript string evaluation (`eval` / `new Function`).
+- **FR-002**: The system MUST use optimistic locking (`version_no INT NOT NULL DEFAULT 1`) on `workflow_instances` to prevent concurrent double-transitions — only one transition per `(id, current_state, version_no)` tuple succeeds; the other receives HTTP 409.
+- **FR-002a**: The `WorkflowTransitionGuard` MUST resolve DSL `require.role` values as **CASL ability checks** — each string value maps to a defined CASL `action:subject` pair (e.g., `"Admin"` → `workflow.manage`). Direct DB role-name matching is forbidden; permissions are evaluated dynamically at transition time via the CASL `AbilityFactory`.
+- **FR-003**: The system MUST record every state transition in `workflow_histories`, including `action_by_user_id` (INT FK, internal, excluded from API) and `action_by_user_uuid` (VARCHAR 36, exposed in API per ADR-019).
+- **FR-004**: All workflow events (notifications, side effects) MUST be dispatched via the dedicated BullMQ queue `workflow-events` — never inline within the request thread.
+- **FR-005**: The `workflow-events` worker MUST be configured with concurrency 5, 3 retry attempts with exponential backoff, and a `workflow-events-failed` dead-letter queue.
+- **FR-006**: When a job enters `workflow-events-failed`, the system MUST send a webhook to `N8N_WEBHOOK_URL` (env var, never hardcoded) to alert the ops team.
+- **FR-007**: `workflow_definitions` MUST be cached in Redis with key `wf:def:{workflow_code}:{version}` (TTL: 1 hour), invalidated immediately when a Super Admin saves or activates a definition.
+- **FR-008**: Context schema validation MUST occur in two phases: Phase 1 at definition save-time (structure), Phase 2 at transition-time (values against required fields) — missing required fields return HTTP 422 with field-level errors.
+- **FR-009**: Only users with `system.manage_all` permission MAY create, update, activate, or deactivate workflow definitions.
+- **FR-010**: In-progress `workflow_instances` MUST remain bound to the `definition_id` at time of creation — activating a new DSL version MUST NOT rebind in-progress instances.
+
+**Integrated Banner & Lifecycle View (ADR-021 REQ-01 to REQ-03)**
+
+- **FR-011**: Every document detail page (RFA, Transmittal, Circulation, Correspondence) MUST complete the Integrated Banner wiring — all four modules already have the component mounted but require gap-filling: live `workflowState`, `availableActions`, priority badge, and step-attachment upload support must be fully connected. No module is exempt.
+- **FR-012**: The "Workflow Engine" tab on detail pages MUST display a vertical timeline of all workflow steps with: step role, handler name, description, and visual state (completed/active/pending).
+- **FR-013**: The active step MUST be rendered with indigo color (`#6366f1`) and a CSS pulse animation; all other steps MUST NOT have the pulse animation.
+
+**Step-specific Attachments (ADR-021 REQ-04 to REQ-05)**
+
+- **FR-014**: The `attachments` table MUST have a nullable FK `workflow_history_id` — existing attachments without this FK are treated as main-document attachments.
+- **FR-015**: Users MAY upload attachments only when the document is in an active-decision state (`PENDING_REVIEW` or `PENDING_APPROVAL`); uploads MUST be rejected with HTTP 409 when the document is in a terminal state (`APPROVED`, `REJECTED`, `CLOSED`).
+- **FR-016**: Only the assigned step handler, organization admin, or Super Admin may upload step-specific attachments; unauthorized attempts return HTTP 403.
+- **FR-017**: All uploaded files MUST be scanned by ClamAV before moving from temp to permanent storage; infected files MUST be deleted immediately and the user notified with "File rejected: security scan failed".
+- **FR-018**: File uploads with a transition MUST require an `Idempotency-Key` header; duplicate requests with the same key return the cached result without re-processing.
+- **FR-019**: Every step-specific attachment upload MUST be atomic with the workflow transition. Recovery on failure is: (1) if DB transition fails after file reaches permanent storage, `StorageService` MUST move the file back to temp storage; (2) temp files expire after a **24-hour TTL** and are automatically purged; (3) the user MAY retry the transition within the TTL window without re-uploading or re-scanning the file.
+- **FR-020**: Any authorized user MAY preview PDF and image files inline via a modal without navigating away from the detail page.
+
+**Admin UI — DSL Authoring (Super Admin)**
+
+- **FR-024**: The system MUST provide an Admin UI page (accessible only to Super Admins) where DSL definitions can be created, edited (JSON editor), activated, and deactivated.
+- **FR-025**: The DSL editor MUST display inline validation feedback — structure errors (Phase 1 save-time) are highlighted before the user saves; the page MUST NOT allow saving a DSL that fails Phase 1 validation.
+
+**i18n (ADR-021 REQ-06)**
+
+- **FR-021**: All UI text on new and updated components MUST use i18n keys — no hardcoded Thai or English strings.
+
+**Observability**
+
+- **FR-022**: The Workflow Engine MUST emit one structured log entry per transition containing: `instanceId`, `action`, `fromState`, `toState`, `userUuid`, `durationMs`, and `outcome` (`success` | `conflict` | `forbidden` | `validation_error` | `system_error`).
+- **FR-023**: The Workflow Engine MUST record two metrics: (1) a **transition counter** labelled by `workflow_code`, `action`, and `outcome`; (2) a **transition latency histogram** (ms) labelled by `workflow_code`.
+
+### Key Entities
+
+- **WorkflowDefinition**: Versioned DSL template defining states, transitions, conditions, events, and context schema. Identified by `workflow_code` + `version`. One active version per code.
+- **WorkflowInstance**: Running instance bound to a specific entity (RFA, Transmittal, Correspondence, Circulation). Tracks `current_state`, `context` (JSON), and `version_no` (optimistic lock).
+- **WorkflowHistory**: Immutable record of every state transition. Linked to the acting user (both INT FK and UUID), comment, and metadata. Step-specific attachments link here.
+- **Attachment**: File stored in permanent storage. May be a main-document attachment (`workflow_history_id = NULL`) or a step-specific attachment (`workflow_history_id` set).
+
+---
+
+## Success Criteria _(mandatory)_
+
+### Measurable Outcomes
+
+- **SC-001**: Zero concurrent double-approvals — a load test with 50 simultaneous `APPROVE` requests on the same workflow instance results in exactly 1 success and 49 responses with status 409.
+- **SC-002**: Transition throughput — workflow state change (without file upload) completes in under 1 second (P95) for documents with up to 20 workflow history records under normal load.
+- **SC-003**: Upload + transition SLA — `POST /workflow/:uuid/transition` with a file ≤ 10MB (including ClamAV scan, Redlock, and DB transaction) responds within 5 seconds (P95).
+- **SC-004**: Event delivery reliability — less than 0.1% of `workflow-events` jobs reach the dead-letter queue under normal operating conditions.
+- **SC-005**: DSL cache effectiveness — activating a new DSL version results in the stale cache entry being invalidated within 1 second on all app instances.
+- **SC-006**: Integrated Banner adoption — 100% of document detail pages (RFA, Transmittal, Circulation, Correspondence) display the Integrated Banner and Workflow Engine tab after release.
+- **SC-007**: No navigation required — reviewers complete document approval (view context + act) without leaving the detail page in 95%+ of sessions.
+- **SC-008**: Audit completeness — every workflow transition has a corresponding `workflow_histories` record with user UUID, timestamp, action, and comment (if provided); zero orphaned transitions.
+- **SC-009**: Observability coverage — 100% of workflow transitions (success, conflict, forbidden, error) produce a structured log entry and increment the transition counter metric; no silent failures.
+
+---
+
+## Assumptions
+
+- ADR-001 Unified Workflow Engine backend infrastructure (`workflow_definitions`, `workflow_instances`, `workflow_histories` tables) is already partially implemented; this spec covers the production-hardening gaps (JSON Logic, `version_no`, dedicated BullMQ queue, context schema two-phase validation, ADR-019 UUID compliance for history records).
+- ADR-021 Integrated Banner and Workflow Lifecycle components are **mounted but incompletely wired** across all four modules (RFA, Transmittal, Circulation, Correspondence). Common gaps include: missing live `availableActions`, no step-specific attachment upload zone, incomplete i18n. This spec closes all four modules to full completion.
+- `json-logic-js` npm package is used for condition evaluation in `WorkflowDslService` (in-process, no external service).
+- Redis and BullMQ infrastructure are available in all environments.
+- ClamAV is available as a service and integrated via the existing `StorageService` two-phase upload pattern.
+- `N8N_WEBHOOK_URL` environment variable will be set in `docker-compose.yml` for all environments before deploy.
+- Bull Board UI (`@bull-board/nestjs`) will be installed for `workflow-events` and `workflow-events-failed` queue visibility.
@@ -0,0 +1,313 @@
+# Tasks: Unified Workflow Engine — Production Hardening & Integrated Context
+
+**Input**: Design documents from `specs/003-unified-workflow-engine/`  
+**Prerequisites**: plan.md ✅ | spec.md ✅ | data-model.md ✅ | research.md ✅ | contracts/ ✅ | quickstart.md ✅  
+**Tests**: Included for business-critical paths (per plan.md Test Plan)
+
+**Organization**: Tasks grouped by user story (US1–US5) enabling independent implementation and testing.
+
+## Format: `[ID] [P?] [Story] Description`
+
+- **[P]**: Can run in parallel (different files, no shared dependencies)
+- **[Story]**: Which user story this task belongs to
+- **Exact file paths** included in all descriptions
+
+---
+
+## Phase 1: Setup (Schema Deltas — DB Prerequisites)
+
+**Purpose**: Create and apply schema changes that ALL subsequent code depends on. No code changes until Phase 1 is complete.
+
+**⚠️ MUST apply to DB before writing any entity code**
+
+- [ ] T001 Create `specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql` — `ALTER TABLE workflow_instances ADD COLUMN version_no INT NOT NULL DEFAULT 1` with `idx_wf_inst_version` index (per data-model.md §1 Delta 09)
+- [ ] T002 Create `specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql` — `ALTER TABLE workflow_histories ADD COLUMN action_by_user_uuid VARCHAR(36) NULL` (per data-model.md §1 Delta 10)
+- [ ] T003 Apply Delta 09 to MariaDB: `source specs/03-Data-and-Storage/deltas/09-add-version-no-to-workflow-instances.sql` — verify with `DESCRIBE workflow_instances`
+- [ ] T004 Apply Delta 10 to MariaDB: `source specs/03-Data-and-Storage/deltas/10-add-action-by-user-uuid-to-workflow-histories.sql` — verify with `DESCRIBE workflow_histories`
+
+**Checkpoint**: Run `DESCRIBE workflow_instances` and `DESCRIBE workflow_histories` — both new columns must be present before Phase 2 begins.
+
+---
+
+## Phase 2: Foundational (Entity & Module Setup — Blocking Prerequisites)
+
+**Purpose**: Entity/DTO/module changes that ALL user story implementations depend on. No user story work until Phase 2 is complete.
+
+**⚠️ CRITICAL — blocks all phases 3+**
+
+- [ ] T005 [P] Add `versionNo: number` column to `backend/src/modules/workflow-engine/entities/workflow-instance.entity.ts` — `@Column({ name: 'version_no', type: 'int', default: 1 })` (per data-model.md §2.1)
+- [ ] T006 [P] Add `actionByUserUuid?: string` column to `backend/src/modules/workflow-engine/entities/workflow-history.entity.ts` — `@Column({ name: 'action_by_user_uuid', length: 36, nullable: true })` (per data-model.md §2.2)
+- [ ] T007 [P] Add `actorUuid?: string` field to `backend/src/modules/workflow-engine/dto/workflow-history-item.dto.ts` with `@ApiPropertyOptional` decorator (per data-model.md §2.3)
+- [ ] T008 Register `workflow_transitions_total` Counter and `workflow_transition_duration_ms` Histogram in `backend/src/modules/workflow-engine/workflow-engine.module.ts` via `makeCounterProvider` / `makeHistogramProvider` from `@willsoto/nestjs-prometheus` (per data-model.md §4, plan.md Phase B5)
+- [ ] T009 [P] Verify backend TypeScript compiles with no errors after T005–T008: `pnpm tsc --noEmit` in `backend/`
+
+**Checkpoint**: `pnpm tsc --noEmit` passes in backend. Existing workflow-engine tests still pass: `pnpm test --testPathPattern=workflow-engine`.
+
+---
+
+## Phase 3: User Story 1 — Workflow Transition with State Integrity (P1) 🎯 MVP
+
+**Goal**: Guarantee race-condition-free state transitions with optimistic lock, CASL-mapped DSL role checks, structured observability, BullMQ dead-letter queue, and file rollback on DB failure.
+
+**Independent Test**: POST 50 concurrent APPROVE requests on one instance → exactly 1 success (200) + 49 conflicts (409). Transition log entry appears for each outcome. Redlock metric increments.
+
+### Implementation — US1 Core: Optimistic Lock
+
+- [ ] T010 [US1] Update `processTransition()` signature in `backend/src/modules/workflow-engine/workflow-engine.service.ts` — add `userUuid: string` and `clientVersionNo?: number` parameters (per data-model.md §3, quickstart.md)
+- [ ] T011 [US1] Add fast-fail optimistic lock check in `processTransition()` BEFORE Redlock acquisition: read `instance.versionNo`, compare with `clientVersionNo`, throw `ConflictException('WORKFLOW_VERSION_CONFLICT')` HTTP 409 on mismatch (per data-model.md §3 "Fast-fail check")
+- [ ] T012 [US1] Add CAS version increment inside DB transaction in `processTransition()`: `UPDATE workflow_instances SET version_no = version_no + 1 WHERE id = :id AND version_no = :expected` — throw `ConflictException` if `affected === 0` (per data-model.md §3 "Version increment")
+- [ ] T013 [US1] Populate `actionByUserUuid: userUuid` when creating `WorkflowHistory` record inside `processTransition()` (per data-model.md §3 "History creation")
+- [ ] T014 [US1] Return `versionNo` (post-increment value) in the transition response DTO so clients can update their local version
+
+### Implementation — US1: CASL DSL Role Mapping (FR-002a)
+
+- [ ] T015 [US1] Add `DSL_ROLE_TO_CASL` config map constant in `backend/src/modules/workflow-engine/guards/workflow-transition.guard.ts`: map `Superadmin → system.manage_all`, `OrgAdmin → organization.manage_users`, `ContractMember → contract.view`, `AssignedHandler → __assigned__` (per research.md Decision 2, quickstart.md)
+- [ ] T016 [US1] Add DSL role resolution step in `WorkflowTransitionGuard.canActivate()`: load compiled definition from instance, extract `require.role[]` for `currentState`, map each via `DSL_ROLE_TO_CASL`, check `userPermissions.includes(mapped)` — pass if any match; fall through to existing Level 3 check for `__assigned__` (per plan.md Phase B4, quickstart.md "DSL Role Mapping" pattern)
+
+### Implementation — US1: Structured Observability (FR-022, FR-023)
+
+- [ ] T017 [US1] Inject `workflow_transitions_total` Counter and `workflow_transition_duration_ms` Histogram via `@InjectMetric()` in `WorkflowEngineService` constructor (per data-model.md §4)
+- [ ] T018 [US1] Wrap `processTransition()` body in `startMs = Date.now()` timer; add `try/catch/finally` block that: labels `outcome` from exception type, calls `transitionDuration.labels({workflow_code}).observe(durationMs)`, calls `transitionsTotal.labels({workflow_code, action, outcome}).inc()`, emits structured `this.logger.log(JSON.stringify({instanceId, action, fromState, toState, userUuid, durationMs, outcome, workflowCode}))` (per data-model.md §4, FR-022/023)
+
+### Implementation — US1: BullMQ Dead-Letter Queue (FR-005, FR-006)
+
+- [ ] T019 [US1] Register `workflow-events-failed` queue in `backend/src/modules/workflow-engine/workflow-engine.module.ts` — inject via `BullModule.registerQueue({ name: 'workflow-events-failed' })` (per plan.md Phase B7)
+- [ ] T020 [US1] Add `@OnWorkerEvent('failed')` handler `onJobFailed(job, error)` in `backend/src/modules/workflow-engine/workflow-event.service.ts`: if `job.attemptsMade >= job.opts.attempts`, add job to `workflow-events-failed` queue; if `N8N_WEBHOOK_URL` env var set, POST JSON payload via `fetch`; else `logger.warn('N8N_WEBHOOK_URL not configured')` (per data-model.md §6, research.md Decision 5)
+- [ ] T021 [US1] Verify worker default options in `workflow-engine.module.ts` have `concurrency: 5`, `attempts: 3`, `backoff: { type: 'exponential', delay: 500 }`, `removeOnFail: false` (per FR-005, plan.md Phase B7)
+
+### Implementation — US1: File Rollback on DB Failure (FR-019)
+
+- [ ] T022 [US1] In `processTransition()` `catch` block, after `queryRunner.rollbackTransaction()`, call `storageService.moveToTemp(attachmentPublicIds)` when `attachmentPublicIds` is non-empty — log rollback with attachment IDs for audit (per plan.md Phase B8, FR-019)
+- [ ] T023 [US1] Inject `StorageService` (or `FileStorageService`) into `WorkflowEngineService` constructor for rollback call — add to `workflow-engine.module.ts` imports if not already present
+
+### Tests — US1
+
+- [ ] T024 [P] [US1] Write unit test in `backend/src/modules/workflow-engine/workflow-engine.service.spec.ts` — concurrent optimistic lock: mock two simultaneous calls with same `clientVersionNo`, assert first resolves success and second throws `ConflictException` with code `WORKFLOW_VERSION_CONFLICT`
+- [ ] T025 [P] [US1] Write unit test in `backend/src/modules/workflow-engine/guards/workflow-transition.guard.spec.ts` — DSL role CASL mapping: assert `Superadmin` maps to `system.manage_all` pass, `OrgAdmin` with matching org passes, unknown role falls through to assignedUserId check
+- [ ] T026 [P] [US1] Write unit test for `onJobFailed` in `workflow-event.service.ts` — assert `workflow-events-failed` queue receives dead-letter job and `fetch` is called with correct payload when `N8N_WEBHOOK_URL` is set; assert `logger.warn` when unset
+
+**Checkpoint**: `pnpm test --testPathPattern=workflow-engine --coverage` — T024/T025/T026 green. Concurrent lock test passes.
+
+---
+
+## Phase 4: User Story 2 — Integrated Banner & Workflow Lifecycle View (P1)
+
+**Goal**: All four document detail pages (RFA, Transmittal, Circulation, Correspondence) display live `workflowState`, `availableActions`, and priority badge with no navigation required for approval.
+
+**Independent Test**: Open each detail page while a workflow instance is in `PENDING_REVIEW` — banner shows correct state + action buttons; Workflow Engine tab shows step timeline with active step highlighted in indigo + pulse animation.
+
+### Implementation — US2: Correspondence Backend Gap-Fill
+
+- [ ] T027 [US2] Update `backend/src/modules/correspondence/correspondence.service.ts` `findOneByUuid()` — call `workflowEngineService.getInstanceByEntity('correspondence', correspondence.uuid)` and expose `workflowInstanceId`, `workflowState`, `availableActions` in the response (same pattern as Transmittal/Circulation per v1.8.7 memory)
+- [ ] T028 [US2] Update `backend/src/modules/correspondence/correspondence.module.ts` — import `WorkflowEngineModule` if not already imported
+
+### Implementation — US2: Frontend Module Gap-Fill (all 4 modules)
+
+- [ ] T029 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/correspondence/[uuid]/page.tsx` — wire live `workflowInstanceId`, `workflowState`, `availableActions`, `workflowPriority` into `<IntegratedBanner>` and `<WorkflowLifecycle>` components; update Correspondence type in `frontend/types/` to include workflow fields
+- [ ] T030 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/rfa/[uuid]/page.tsx` — connect missing `availableActions` and `workflowPriority` props to `<IntegratedBanner>`; ensure `<WorkflowLifecycle>` receives live `instanceId`
+- [ ] T031 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/transmittals/[uuid]/page.tsx` — add step-attachment upload zone props (`canUpload` flag computed from `currentState ∈ {PENDING_REVIEW, PENDING_APPROVAL}` AND user is assigned/org-admin/superadmin)
+- [ ] T032 [P] [US2] Gap-fill `frontend/app/(admin)/admin/doc-control/circulation/[uuid]/page.tsx` — same step-attachment upload zone props as T031
+- [ ] T033 [US2] Update `frontend/types/correspondence.ts` (or equivalent) — add `workflowInstanceId?: string`, `workflowState?: string`, `availableActions?: string[]`, `workflowPriority?: 'URGENT' | 'HIGH' | 'MEDIUM' | 'LOW'` (ADR-019: string UUIDs only, no parseInt)
+
+### Tests — US2
+
+- [ ] T034 [P] [US2] Verify `pnpm tsc --noEmit` in `frontend/` passes after T029–T033 — all four detail pages type-check correctly
+
+**Checkpoint**: All four detail pages render `<IntegratedBanner>` with live data. Switch a document to `PENDING_REVIEW` — banner shows correct action buttons without page navigation.
+
+---
+
+## Phase 5: User Story 3 — Step-specific Attachments with Preview (P1)
+
+**Goal**: Users in `PENDING_REVIEW` / `PENDING_APPROVAL` states can upload files via drag-and-drop, attached atomically to the workflow step. All users can preview PDFs/images inline without navigation.
+
+**Independent Test**: Upload a PDF during `PENDING_REVIEW` → click Approve → history timeline shows the file chip → click chip → preview modal opens inline. Force-fail DB transaction → file appears back in temp, permanent storage unchanged.
+
+### Implementation — US3: File Preview Modal (FR-020)
+
+- [ ] T035 [P] [US3] Create `frontend/components/workflow/file-preview-modal.tsx` — shadcn/ui `Dialog` component; accepts `attachment: WorkflowAttachmentSummary | null` and `onClose: () => void` props; renders `<iframe src="/api/files/{publicId}/preview" />` for PDFs; `<img>` for image MIME types; download link fallback for other types (per plan.md Phase F1, quickstart.md "File Preview Modal")
+- [ ] T036 [P] [US3] Add `WorkflowAttachmentSummary` interface to `frontend/types/workflow.ts` if not present: `{ publicId: string; originalFilename: string; mimeType: string; fileSize: number; createdAt: string }` (ADR-019: `publicId` only, no `id` or `uuid` alias)
+
+### Implementation — US3: Step-Attachment Upload Zone (FR-014–FR-019)
+
+- [ ] T037 [US3] Update `frontend/components/workflow/integrated-banner.tsx` — add conditional upload zone rendered only when `props.currentState ∈ {PENDING_REVIEW, PENDING_APPROVAL}` AND `props.canUpload === true`; upload calls existing Two-Phase upload endpoint; appends returned `publicId` to `pendingAttachmentIds` state; passes `pendingAttachmentIds` to action button handler (per plan.md Phase F2)
+- [ ] T038 [US3] Update `frontend/components/workflow/workflow-lifecycle.tsx` — for each history item render `attachments[]` as clickable file chips; on chip click open `<FilePreviewModal>`; import and use `FilePreviewModal` from T035 (per plan.md Phase F2)
+- [ ] T039 [US3] Update `frontend/hooks/use-workflow-action.ts` — accept `attachmentPublicIds: string[]` parameter; include in POST body to `/workflow-engine/instances/:id/transition`; include `versionNo` from current instance state; on HTTP 409 show toast "เอกสารถูกอนุมัติโดยผู้อื่นแล้ว กรุณารีเฟรช"; on 503 show toast "ระบบยุ่งชั่วคราว กรุณาลองใหม่" (per quickstart.md "Optimistic Lock — Client Side")
+- [ ] T040 [US3] Update `backend/src/modules/workflow-engine/workflow-engine.controller.ts` — ensure `POST /instances/:id/transition` accepts `Idempotency-Key` header and passes `userUuid` (from JWT) and `clientVersionNo` to `processTransition()` (per contracts/workflow-transition.yaml)
+- [ ] T041 [US3] Verify `WorkflowHistoryItemDto` exposes `attachments: AttachmentSummaryDto[]` in the history list endpoint response — update `getHistory()` method in `workflow-engine.service.ts` to eagerly load `attachments` relation per `workflow_history_id` (per data-model.md §3, FR-014)
+
+### Tests — US3
+
+- [ ] T042 [P] [US3] Write unit test in `backend/src/modules/workflow-engine/workflow-engine.service.spec.ts` — file rollback: mock `queryRunner.commitTransaction()` to throw; assert `storageService.moveToTemp()` is called with the correct `attachmentPublicIds` (per plan.md Test Plan)
+- [ ] T043 [P] [US3] Write Vitest component test in `frontend/components/workflow/__tests__/file-preview-modal.test.tsx` — assert PDF renders `<iframe>`, image MIME type renders `<img>`, unsupported type renders download link, `onClose` called on dialog dismiss
+
+**Checkpoint**: Upload a PDF on a document in `PENDING_REVIEW` → approve → check `workflow_histories` record has matching `workflow_history_id` in `attachments` table. Click the file chip → modal opens inline.
+
+---
+
+## Phase 6: User Story 4 — DSL Versioning & Instance Binding (P2)
+
+**Goal**: Super Admins can activate new DSL versions; in-progress workflow instances continue on their bound definition version; Redis cache invalidates within 1 second of activation (SC-005).
+
+**Independent Test**: Activate DSL v2 while v1 has an in-progress instance → existing instance still uses v1 DSL transitions; new instance created after activation uses v2.
+
+### Implementation — US4: DSL Redis Cache Invalidation (FR-007, SC-005)
+
+- [ ] T044 [US4] In `workflow-engine.service.ts` `createDefinition()` — after `workflowDefRepo.save()`, call `cacheManager.set('wf:def:${code}:${version}', saved, 3600000)` (1h TTL in ms) (per data-model.md §5, research.md Decision 4)
+- [ ] T045 [US4] In `workflow-engine.service.ts` `update()` — before save, call `cacheManager.del('wf:def:${code}:${oldVersion}')` when DSL changes; when `is_active` toggles to `true`, call `redis.del('wf:def:${code}:active')` then set updated pointer; when `is_active` toggles to `false`, call `redis.del('wf:def:${code}:active')` (per data-model.md §5 "Invalidation triggers")
+- [ ] T046 [US4] Add read-through cache in `getDefinitionById()`: call `cacheManager.get('wf:def:${id}')` first; fall back to `workflowDefRepo.findOne()` on miss; store result in cache before returning (per research.md Decision 4)
+- [ ] T047 [US4] Verify `createInstance()` always uses latest active definition from DB (not cache) to prevent stale binding — confirm `findOne({ where: { workflow_code, is_active: true }, order: { version: 'DESC' } })` pattern is authoritative (per FR-010)
+
+### Tests — US4
+
+- [ ] T048 [P] [US4] Write unit test in `workflow-engine.service.spec.ts` — DSL activate cache invalidation: mock `cacheManager.del`, call `update({ is_active: true })`, assert `cacheManager.del` called with correct key within the same tick (per plan.md Test Plan)
+
+**Checkpoint**: Activate DSL v2 via `PATCH /workflow-engine/definitions/:id` → Redis key `wf:def:{code}:active` updated immediately. In-progress v1 instance transitions still resolve against v1 compiled DSL.
+
+---
+
+## Phase 7: User Story 5 — Workflow Definition Authoring (Super Admin) (P2)
+
+**Goal**: Super Admins can list, create, edit (JSON editor with inline validation), activate, and deactivate DSL definitions from an Admin UI page without touching the API directly.
+
+**Independent Test**: Log in as Super Admin → navigate to `/admin/workflows/definitions` → create a new definition with an invalid DSL → see inline validation error before saving → fix → save → new definition appears in list.
+
+### Implementation — US5: Backend `/validate` Endpoint (FR-025)
+
+- [ ] T049 [US5] Add `POST /workflow-engine/definitions/validate` endpoint to `backend/src/modules/workflow-engine/workflow-engine.controller.ts` — accepts `{ dsl: object }`, calls `dslService.compile(dto.dsl)` in try/catch, returns `{ valid: true }` or `{ valid: false, errors: [{ path, message }] }` (per contracts/workflow-definitions.yaml, FR-025)
+
+### Implementation — US5: TanStack Query Hooks
+
+- [ ] T050 [P] [US5] Create `frontend/hooks/use-workflow-definitions.ts` — `useWorkflowDefinitions()` (GET list), `useWorkflowDefinition(id)` (GET single), `useCreateDefinition()` (POST mutation), `useUpdateDefinition()` (PATCH mutation), `useValidateDsl()` (POST validate mutation) — all using TanStack Query v5 patterns (per quickstart.md)
+
+### Implementation — US5: Admin DSL List Page
+
+- [ ] T051 [US5] Create `frontend/app/(admin)/admin/workflows/definitions/page.tsx` — Server Component shell + Client Component table; columns: `workflow_code`, `version`, `is_active` badge, created date, Actions (Edit link, Activate/Deactivate toggle button); uses `useWorkflowDefinitions()` hook; Activate/Deactivate calls `useUpdateDefinition()` mutation with `{ is_active: true/false }`; requires `system.manage_all` permission (CASL guard on page) (per plan.md Phase F4, FR-024)
+
+### Implementation — US5: Admin DSL Editor Page
+
+- [ ] T052 [US5] Create `frontend/app/(admin)/admin/workflows/definitions/[id]/page.tsx` — loads definition via `useWorkflowDefinition(id)`; renders Monaco Editor via `dynamic(() => import('@monaco-editor/react'), { ssr: false })`; `onChange` handler debounced 800ms calls `useValidateDsl()` mutation; displays validation errors as inline error list below editor; Save button disabled when `validationErrors.length > 0` (FR-025); on Save calls `useUpdateDefinition()` and shows success toast; i18n keys for all UI text (per research.md Decision 6, quickstart.md "Admin DSL Editor")
+- [ ] T053 [US5] Create `frontend/app/(admin)/admin/workflows/definitions/new/page.tsx` — same editor as T052 but calls `useCreateDefinition()` mutation; `workflow_code` input field with validation; redirect to list page on success
+
+### Tests — US5
+
+- [ ] T054 [P] [US5] Write Vitest test for `frontend/app/(admin)/admin/workflows/definitions/[id]/page.tsx` — assert Save button is disabled when validation errors present; assert Save button enabled when `validationErrors` is empty; assert `useValidateDsl` is called on editor change (per plan.md Test Plan)
+
+**Checkpoint**: Navigate to `/admin/workflows/definitions` — list renders all definitions. Click Edit → Monaco editor loads definition DSL. Paste invalid DSL → Save button disables and errors display inline. Fix DSL → Save enabled → save succeeds.
+
+---
+
+## Phase 8: Polish & Cross-Cutting Concerns
+
+**Purpose**: i18n coverage, SC-009 verification, and spec compliance checks across all user stories.
+
+- [ ] T055 [P] Audit all new UI text in `frontend/components/workflow/` and `frontend/app/(admin)/admin/workflows/` — replace any hardcoded Thai/English strings with i18n keys; add missing keys to `frontend/public/locales/th/` and `frontend/public/locales/en/` translation files (FR-021)
+- [ ] T056 [P] Run full backend test suite: `pnpm test --coverage` in `backend/` — confirm no regressions; coverage ≥ 70% overall, ≥ 80% on `workflow-engine.service.ts` business logic (per plan.md Test Plan)
+- [ ] T057 [P] Run full frontend typecheck: `pnpm tsc --noEmit` in `frontend/` — zero errors across all modified files
+- [ ] T058 Verify SC-009 observability coverage: trigger one transition of each outcome type (success, conflict, forbidden, validation_error) and confirm structured log entries appear in the NestJS log output with all required fields (`instanceId`, `action`, `fromState`, `toState`, `userUuid`, `durationMs`, `outcome`, `workflowCode`)
+- [ ] T059 Update `specs/003-unified-workflow-engine/spec.md` Status field from `Draft` to `Implemented` after all phases complete
+
+---
+
+## Dependencies & Execution Order
+
+### Phase Dependencies
+
+- **Phase 1 (Setup)**: No dependencies — start immediately
+- **Phase 2 (Foundational)**: Depends on Phase 1 DB columns applied — **BLOCKS Phases 3–7**
+- **Phase 3 (US1)**: Depends on Phase 2 — can start as soon as entities compile
+- **Phase 4 (US2)**: Depends on Phase 2 — independent of Phase 3 (different files)
+- **Phase 5 (US3)**: Depends on Phase 3 (uses updated `processTransition` + `use-workflow-action`) and Phase 4 (upload zone sits inside `IntegratedBanner`)
+- **Phase 6 (US4)**: Depends on Phase 2 — independent of US1/US2/US3
+- **Phase 7 (US5)**: Depends on Phase 6 (T049 validate endpoint, T044 cache) — `/validate` endpoint needed for editor inline feedback
+- **Phase 8 (Polish)**: Depends on all phases complete
+
+### User Story Dependencies
+
+- **US1 (P1)**: Starts after Phase 2 — no US dependencies
+- **US2 (P1)**: Starts after Phase 2 — no US dependencies (parallel with US1)
+- **US3 (P1)**: Starts after US1 (T039 needs updated hook signature) and US2 (upload zone in banner)
+- **US4 (P2)**: Starts after Phase 2 — independent (parallel with US1/US2)
+- **US5 (P2)**: Starts after US4 (T049 validate endpoint depends on DSL cache from T044)
+
+### Within Each Phase
+
+- Schema before entities → entities before services → services before controllers → backend before frontend
+- [P] tasks within a phase can run in parallel (different files)
+
+---
+
+## Parallel Execution Examples
+
+### Phase 2 Parallel (T005–T007 run together)
+
+```
+T005: workflow-instance.entity.ts   ← add versionNo
+T006: workflow-history.entity.ts    ← add actionByUserUuid
+T007: workflow-history-item.dto.ts  ← add actorUuid
+```
+
+### Phase 3 Parallel Groups
+
+```
+Group A (processTransition core): T010 → T011 → T012 → T013 → T014 (sequential)
+Group B (guard): T015 → T016 (sequential, different file from Group A — parallel with Group A)
+Group C (observability): T017 → T018 (different file — parallel with Groups A+B)
+Group D (BullMQ): T019 → T020 → T021 (different service file — parallel with Groups A+B+C)
+Tests: T024, T025, T026 (parallel with each other after Groups A+B+D complete)
+```
+
+### Phase 4 + Phase 6 Parallel (different feature areas)
+
+```
+Phase 4 (US2): T027–T034 — Correspondence backend + frontend gap-fill
+Phase 6 (US4): T044–T048 — DSL cache invalidation
+(Run simultaneously — no shared files)
+```
+
+---
+
+## Implementation Strategy
+
+### MVP Scope (US1 + US2 + US3 — all P1)
+
+```
+Phase 1 → Phase 2 → Phase 3 (US1) → Phase 4 (US2) → Phase 5 (US3) → Phase 8 Polish
+```
+
+Delivers: Race-condition-free transitions, live banner on all 4 modules, step-specific attachments with preview.
+
+### Full Delivery (adds P2 stories)
+
+```
+MVP + Phase 6 (US4) + Phase 7 (US5)
+```
+
+Adds: Redis cache invalidation, Admin DSL editor.
+
+### Suggested First Commit
+
+After T001–T009 (schema + entities compile) → commit:
+```
+chore(schema): delta-09 version_no, delta-10 action_by_user_uuid (ADR-009)
+feat(workflow-engine): add versionNo + actionByUserUuid entities + metrics registration (FR-002/003)
+```
+
+---
+
+## Summary
+
+| Phase | User Story | Tasks | Parallel Opportunities |
+|-------|-----------|-------|----------------------|
+| 1 — Setup | Schema | T001–T004 | T001+T002 parallel |
+| 2 — Foundational | — | T005–T009 | T005+T006+T007 parallel |
+| 3 — P1 US1 | Transition Integrity | T010–T026 | Guard + observability + BullMQ parallel; tests parallel |
+| 4 — P1 US2 | Banner Gap-Fill | T027–T034 | T029+T030+T031+T032+T033 parallel |
+| 5 — P1 US3 | Step Attachments | T035–T043 | T035+T036 parallel; tests parallel |
+| 6 — P2 US4 | DSL Versioning | T044–T048 | T044+T046+T047 parallel |
+| 7 — P2 US5 | Admin DSL Editor | T049–T054 | T050+T054 parallel |
+| 8 — Polish | Cross-cutting | T055–T059 | T055+T056+T057 parallel |
+| **Total** | | **59 tasks** | **~22 parallel opportunities** |
+
+**MVP**: T001–T043 (43 tasks, Phases 1–5, all P1 stories)  
+**Full**: T001–T059 (59 tasks, all phases)
@@ -0,0 +1,17 @@
+-- ============================================================
+-- Delta 09: ADR-001 v1.1 — Optimistic Lock for Workflow Transitions
+-- เพิ่ม version_no ใน workflow_instances สำหรับ Optimistic Concurrency Control
+-- ============================================================
+-- Feature: 003-unified-workflow-engine (FR-002)
+-- Date: 2026-05-03
+-- ข้อควรระวัง: Existing rows จะได้ค่า DEFAULT 1 อัตโนมัติ — ไม่มี Data Loss
+-- Rollback: ALTER TABLE workflow_instances DROP INDEX idx_wf_inst_version;
+--           ALTER TABLE workflow_instances DROP COLUMN version_no;
+
+ALTER TABLE workflow_instances
+  ADD COLUMN version_no INT NOT NULL DEFAULT 1
+    COMMENT 'Optimistic lock counter — incremented on every successful transition (ADR-001 v1.1 FR-002). Client sends current value; server rejects with 409 if mismatch.';
+
+-- Index เพื่อรองรับ CAS check: WHERE id = ? AND version_no = ?
+CREATE INDEX idx_wf_inst_version
+  ON workflow_instances (id, version_no);
@@ -0,0 +1,14 @@
+-- ============================================================
+-- Delta 10: ADR-001 v1.1 / ADR-019 UUID Compliance
+-- เพิ่ม action_by_user_uuid ใน workflow_histories
+-- เพื่อ expose User identity ผ่าน API โดยไม่ต้องเปิดเผย INT PK (ADR-019)
+-- ============================================================
+-- Feature: 003-unified-workflow-engine (FR-003)
+-- Date: 2026-05-03
+-- ข้อควรระวัง: NULL สำหรับ Historical records ที่สร้างก่อน delta นี้ — เป็น Acceptable
+--              NULL ในบริบทนี้ = "System Action" หรือ "Pre-migration record"
+-- Rollback: ALTER TABLE workflow_histories DROP COLUMN action_by_user_uuid;
+
+ALTER TABLE workflow_histories
+  ADD COLUMN action_by_user_uuid VARCHAR(36) NULL
+    COMMENT 'UUID ของ User ผู้ดำเนินการ — ใช้ใน API Response แทน INT FK (ADR-019). NULL = System Action หรือ Pre-migration record';
@@ -2,6 +2,7 @@

 **Status:** Accepted
 **Date:** 2026-02-24
+**Last Amended:** 2026-05-02
 **Decision Makers:** Development Team, System Architect
 **Related Documents:**

@@ -42,6 +43,34 @@ LCBP3-DMS ต้องจัดการเอกสารหลายประ

 ---

+## Clarifications
+
+### Session 2026-05-02 (Round 1 — ADR-001-add.md merge)
+
+- Q: Event handling — Outbox Pattern หรือ BullMQ (ADR-008)? → A: **BullMQ only** — WorkflowEngine enqueues BullMQ job โดยตรง ไม่มี outbox table; สอดคล้อง ADR-008
+- Q: Concurrency control — Optimistic Lock vs Redis Redlock vs แยก concern? → A: **แยก concern** — `version_no` optimistic lock สำหรับ state transition; Redis Redlock เฉพาะ Document Numbering (ADR-002)
+- Q: Context schema — validate ที่ไหน และ scope ระดับใด? → A: **Two-phase validation** (save-time + transition-time); schema scope **per `workflow_definition` version**
+- Q: Condition Engine library? → A: **`json-logic-js` in-process** ใน `WorkflowDslService`; fallback to custom parser if production issues
+- Q: Auto-action worker — extend existing หรือ dedicated queue? → A: **Dedicated `workflow-events` BullMQ queue** แยกจาก `notification-queue`
+
+### Session 2026-05-02 (Round 2 — ADR-001 full review)
+
+- Q: DDL gap — เพิ่ม `version_no` + `context_schema` ใน DDL? → A: **yes** — `version_no INT NOT NULL DEFAULT 1` ใน `workflow_instances`; `context_schema JSON NULL` ใน `workflow_definitions`
+- Q: ConflictException retry strategy? → A: **409 ขึ้น frontend** via `BusinessException` (ADR-007); frontend แสดง toast "กรุณาลองใหม่" — ไม่ auto-retry
+- Q: Redis cache TTL/invalidation strategy? → A: **TTL 1h + event invalidation** เมื่อ admin save/activate DSL; key `wf:def:{workflow_code}:{version}`
+- Q: WorkflowEventsWorker concurrency/retry config? → A: **concurrency 5, retry 3 + exponential backoff + dead-letter queue**
+- Q: RBAC สำหรับ DSL authoring? → A: **Super Admin เท่านั้น** (`system.manage_all`) — create/update/activate/deactivate workflow definitions
+
+### Session 2026-05-02 (Round 3 — ADR-019 compliance + ops)
+
+- Q: `action_by_user_id INT NULL` ใน `workflow_histories` — ADR-019 compliance? → A: **คง INT FK + `@Exclude()`** บน Entity; เพิ่ม `action_by_user_uuid VARCHAR(36) NULL` สำหรับ API response
+- Q: `validateContext()` fail ที่ transition-time — HTTP status? → A: **422 Unprocessable Entity** via `ValidationException` (ADR-007 Validation tier) พร้อม field-level errors
+- Q: Dead-letter queue `workflow-events-failed` — ops procedure? → A: **n8n webhook alert + Bull Board UI** สำหรับ manual requeue
+- Q: n8n webhook URL — เก็บที่ไหน? → A: **`N8N_WEBHOOK_URL` environment variable** ใน `docker-compose.yml`; อ่านผ่าน `ConfigService`
+- Q: `context_schema.required` — enforce จริงหรือไม่? → A: **enforce strictly** — required field หาย → throw 422 `ValidationException`; ไม่ block transition
+
+---
+
 ## Decision Drivers

 - **DRY Principle:** Don't Repeat Yourself - ลดการเขียน Code ซ้ำ
@@ -206,8 +235,9 @@ CREATE TABLE workflow_definitions (
  workflow_code VARCHAR(50) NOT NULL,
  version INT NOT NULL DEFAULT 1,
  description TEXT NULL,
-  dsl JSON NOT NULL,       -- Raw DSL from user
-  compiled JSON NOT NULL,  -- Validated and optimized for Runtime
+  dsl JSON NOT NULL,            -- Raw DSL from user
+  compiled JSON NOT NULL,       -- Validated and optimized for Runtime
+  context_schema JSON NULL,     -- JSON Schema for context validation (two-phase)
  is_active BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
@@ -221,6 +251,7 @@ CREATE TABLE workflow_instances (
  entity_type VARCHAR(50) NOT NULL, -- e.g. "correspondence", "rfa"
  entity_id VARCHAR(50) NOT NULL,
  current_state VARCHAR(50) NOT NULL,
+  version_no INT NOT NULL DEFAULT 1, -- Optimistic lock (@VersionColumn) — ป้องกัน race condition
  status ENUM('ACTIVE', 'COMPLETED', 'CANCELLED', 'TERMINATED') DEFAULT 'ACTIVE',
  context JSON NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
@@ -235,7 +266,8 @@ CREATE TABLE workflow_histories (
  from_state VARCHAR(50) NOT NULL,
  to_state VARCHAR(50) NOT NULL,
  action VARCHAR(50) NOT NULL,
-  action_by_user_id INT NULL,
+  action_by_user_id INT NULL,           -- Internal FK (@Exclude() in Entity) — ห้าม expose ใน API
+  action_by_user_uuid VARCHAR(36) NULL, -- UUID สำหรับ API response (ADR-019)
  comment TEXT NULL,
  metadata JSON NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
@@ -250,6 +282,14 @@ CREATE TABLE workflow_histories (
  "workflow": "CORRESPONDENCE_ROUTING",
  "version": 1,
  "description": "Standard correspondence routing",
+  "context_schema": {
+    "type": "object",
+    "properties": {
+      "requiresLegal": { "type": "number" },
+      "hasRecipient": { "type": "boolean" }
+    },
+    "required": []
+  },
  "states": [
    {
      "name": "DRAFT",
@@ -261,7 +301,10 @@ CREATE TABLE workflow_histories (
            "role": ["Admin"],
            "user": "123"
          },
-          "condition": "context.requiresLegal > 0",
+          "condition": {
+            "type": "json-logic",
+            "rule": { ">": [{ "var": "requiresLegal" }, 0] }
+          },
          "events": [
            {
              "type": "notify",
@@ -299,6 +342,8 @@ CREATE TABLE workflow_histories (
 }
 ```

+> **⚠️ หมายเหตุ:** `condition` ต้องใช้ JSON Logic format (`{ "type": "json-logic", "rule": {...} }`) เท่านั้น — ห้ามใช้ JS string expression (`"context.x === true"`) เพราะเป็น security risk (code injection)
+
 ### NestJS Module Structure

 ```typescript
@@ -325,6 +370,11 @@ export class WorkflowEngineService {
      order: { version: 'DESC' },
    });

+    // Validate initial context against context_schema (save-time phase 1)
+    if (definition.compiled.contextSchema) {
+      this.dslService.validateContext(initialContext, definition.compiled.contextSchema);
+    }
+
    // Initial state directly from compiled DSL
    const initialState = definition.compiled.initialState;

@@ -333,6 +383,7 @@ export class WorkflowEngineService {
      entityType,
      entityId,
      currentState: initialState,
+      versionNo: 1,  // TypeORM @VersionColumn — optimistic lock
      status: WorkflowStatus.ACTIVE,
      context: initialContext,
    });
@@ -345,19 +396,46 @@ export class WorkflowEngineService {
    comment?: string,
    payload: Record<string, unknown> = {}
  ) {
-    // Evaluation via WorkflowDslService
+    // Validate context values against schema (transition-time phase 2)
+    if (definition.compiled.contextSchema) {
+      this.dslService.validateContext(instance.context, definition.compiled.contextSchema);
+    }
+
+    // Evaluation via WorkflowDslService (uses json-logic-js in-process)
    const evaluation = this.dslService.evaluate(compiled, instance.currentState, action, context);

-    // Update state to target State
-    instance.currentState = evaluation.nextState;
+    // Optimistic lock: update state only if current_state + version_no match
+    // ❌ ไม่ใช้ Redis Redlock ใน workflow transition (Redlock เฉพาะ Document Numbering ADR-002)
+    const updated = await this.instanceRepo
+      .createQueryBuilder()
+      .update(WorkflowInstance)
+      .set({
+        currentState: evaluation.nextState,
+        versionNo: () => 'version_no + 1',
+      })
+      .where('id = :id AND current_state = :state AND version_no = :ver', {
+        id: instance.id,
+        state: instance.currentState,
+        ver: instance.versionNo,
+      })
+      .execute();
+
+    if (updated.affected === 0) {
+      throw new ConflictException('Concurrent transition detected — please retry');
+    }

    if (compiled.states[evaluation.nextState].terminal) {
      instance.status = WorkflowStatus.COMPLETED;
    }

-    // Process background events asynchronously
+    // Dispatch events async via dedicated BullMQ queue 'workflow-events' (ADR-008)
+    // ❌ ห้าม dispatch events แบบ sync ใน request thread
    if (evaluation.events && evaluation.events.length > 0) {
-      this.eventService.dispatchEvents(instance.id, evaluation.events, context);
+      await this.workflowEventsQueue.add('dispatch', {
+        instanceId: instance.id,
+        events: evaluation.events,
+        context,
+      });
    }
  }
 }
@@ -365,6 +443,80 @@ export class WorkflowEngineService {

 ---

+## 🏭 Production Architecture
+
+### Runtime Flow
+
+```
+[ API / Service Layer ]
+        ↓
+[ WorkflowEngineService ]
+   - validate context (two-phase: save-time + transition-time)
+   - evaluate condition (json-logic-js in-process, WorkflowDslService)
+   - optimistic lock: UPDATE WHERE current_state = ? AND version_no = ?
+   - write workflow_histories
+   - enqueue BullMQ job → queue: 'workflow-events'
+        ↓
+[ DB (workflow_instances + workflow_histories) ]
+
+        ↓ (async, dedicated queue)
+[ WorkflowEventsWorker (BullMQ: 'workflow-events') ]
+        ↓
+ ┌───────────────┐
+ │     n8n       │  (webhook / notification dispatch)
+ └───────────────┘
+```
+
+### Production Rules (Non-Negotiable)
+
+| # | Rule | Detail |
+|---|------|--------|
+| 1 | **Source of Truth** | Workflow state = DB only — ห้ามเก็บ state ใน memory/cache |
+| 2 | **Deterministic Execution** | ทุก transition MUST declared ใน DSL — ห้าม dynamic transition |
+| 3 | **No Inline Code Execution** | Condition MUST ใช้ JSON Logic format — ห้าม JS string eval |
+| 4 | **Async Side Effects** | ทุก event MUST ผ่าน BullMQ `workflow-events` queue — ห้าม sync dispatch |
+| 5 | **Idempotency** | Transition MUST safe to retry — optimistic lock ป้องกัน double-apply |
+| 6 | **Instance Isolation** | In-progress instances ใช้ `workflow_definition` version เดิม — ห้าม rebind |
+
+### Concurrency Control (แยก concern)
+
+| Concern | Mechanism | Scope |
+|---------|-----------|-------|
+| Workflow state transition | `version_no` optimistic lock (TypeORM `@VersionColumn`) | `workflow_instances` table |
+| Document Numbering | Redis Redlock (ADR-002) | Number generation only |
+
+> ❌ **ห้ามใช้ Redis Redlock ใน workflow transition layer** — Redlock เฉพาะ Document Numbering
+
+### Condition Engine
+
+- **Library:** `json-logic-js` (npm) — evaluate in-process ใน `WorkflowDslService`
+- **Fallback:** migrate to custom parser เมื่อพบปัญหา performance/complexity ใน production
+- **Forbidden:** arbitrary JS string evaluation (`eval`, `new Function`, string conditions)
+
+### Context Schema Validation
+
+- `context_schema` stored per `workflow_definition` version (รองรับ schema evolution)
+- **Phase 1 (Save-time):** validate schema structure เมื่อ admin save DSL
+- **Phase 2 (Transition-time):** validate context values ตรง schema ก่อน evaluate condition
+- **Required field enforcement:** `required` array ใน schema **enforce strictly** — missing required field → throw `ValidationException` (ADR-007) → HTTP 422 + field-level errors
+- **Failure response:** `{ field: "<context_field>", message: "required field missing" }` — ไม่ block transition — caller ต้องแก้ context แล้ว retry
+
+### Event Queue
+
+- Queue name: `workflow-events` (dedicated BullMQ queue — แยกจาก `notification-queue`)
+- Worker: `WorkflowEventsWorker` — config:
+  - **concurrency:** 5
+  - **attempts:** 3 (exponential backoff)
+  - **dead-letter queue:** `workflow-events-failed` หลัง attempts หมด
+- **n8n webhook URL:** `N8N_WEBHOOK_URL` env var (ใน `docker-compose.yml`) — อ่านผ่าน `ConfigService`; ห้าม hardcode
+- **Dead-letter ops:**
+  - เมื่อ job ตกใน `workflow-events-failed` → trigger n8n webhook แจ้ง ops team
+  - Manual requeue ผ่าน **Bull Board UI** (admin panel)
+  - ❌ ไม่ auto-requeue — ป้องกัน retry loop ถ้าเป็น permanent bug
+- ❌ ไม่ใช้ Outbox Pattern (polling DB table) — BullMQ มี retry/dead-letter/persistence อยู่แล้ว
+
+---
+
 ## Consequences

 ### Positive
@@ -388,7 +540,8 @@ export class WorkflowEngineService {

 - **Complexity:** สร้าง UI Builder สำหรับ Workflow Design ในอนาคต
 - **Learning Curve:** เขียน Documentation และ Examples ที่ชัดเจน
- **Performance:** ใช้ Redis Cache สำหรับ Workflow Definitions
+- **Performance:** Redis Cache สำหรับ `workflow_definitions` — key: `wf:def:{workflow_code}:{version}`, TTL: 1h, invalidate ทันทีเมื่อ admin save/activate DSL ใหม่
+- **Concurrency Conflict:** `ConflictException` ส่ง `BusinessException` (ADR-007) → 409 ไป frontend; user retry ด้วยตัวเอง — ไม่ auto-retry
 - **Debugging:** สร้าง Workflow Visualization Tool
 - **Testing:** เขียน Comprehensive Unit Tests สำหรับ Engine

@@ -400,6 +553,9 @@ export class WorkflowEngineService {

 - [Backend Guidelines](../05-Engineering-Guidelines/05-02-backend-guidelines.md#workflow-engine-integration) - Unified Workflow Engine
 - [Unified Workflow Requirements](../01-Requirements/01-03-modules/01-03-06-unified-workflow.md) - Unified Workflow Specification
+- [ADR-007 Error Handling](./ADR-007-error-handling-strategy.md) - `BusinessException` + 409 conflict response pattern
+- [ADR-008 Notifications](./ADR-008-email-notification-strategy.md) - BullMQ `workflow-events` queue pattern
+- [ADR-016 Security](./ADR-016-security-authentication.md) - `system.manage_all` required for DSL authoring

 ---

@@ -409,6 +565,8 @@ export class WorkflowEngineService {
 - Admin UI สำหรับจัดการ Workflow จะพัฒนาใน Phase 2
 - ต้องมี Migration Tool สำหรับ Workflow Definition Changes
 - พิจารณาใช้ BPMN 2.0 Notation ในอนาคต (ถ้าต้องการ Visual Workflow Designer)
+- **Required env vars:** `N8N_WEBHOOK_URL` ต้องตั้งใน `docker-compose.yml` ทุก environment ก่อน deploy
+- **Bull Board UI:** ติดตั้ง `@bull-board/nestjs` สำหรับ visibility ของ `workflow-events` และ `workflow-events-failed` queues

 ---

@@ -429,12 +587,16 @@ export class WorkflowEngineService {
 | Version | Date | Changes | Status |
 |---------|------|---------|--------|
 | 1.0 | 2026-02-24 | Initial version - DSL-based Unified Workflow Engine | ✅ Active |
+| 1.1 | 2026-05-02 | Production hardening: JSON Logic condition engine, optimistic lock concurrency, BullMQ dedicated queue, context schema two-phase validation, async-only auto-action rule | ✅ Active |

 ---

 ## Related ADRs

- [ADR-002: Document Numbering Strategy](./ADR-002-document-numbering-strategy.md) - ใช้ Workflow Engine trigger Document Number Generation
+- [ADR-002: Document Numbering Strategy](./ADR-002-document-numbering-strategy.md) - ใช้ Workflow Engine trigger Document Number Generation; Redis Redlock เฉพาะ numbering
+- [ADR-007: Error Handling Strategy](./ADR-007-error-handling-strategy.md) - `ConflictException` → `BusinessException` → 409 pattern
+- [ADR-008: Email/Notification Strategy](./ADR-008-email-notification-strategy.md) - BullMQ `workflow-events` dedicated queue
+- [ADR-016: Security & Authentication](./ADR-016-security-authentication.md) - `system.manage_all` RBAC guard สำหรับ DSL authoring
 - [RBAC Matrix](../01-Requirements/01-02-business-rules/01-02-01-rbac-matrix.md) - Permission Guards ใน Workflow Transitions

 ---