Dev Journal

2026-05-22 — Cross-Feature Intelligence Layer

This feature closes a gap in the pipeline’s memory: each /cspec run previously started from a near-blank slate despite the pipeline accumulating rich historical data across features. Six data sources existed (deferred review findings, Devil’s Advocate reports, override patterns, lens recommendations, debug investigations, and phase effectiveness history) but none were surfaced to the spec agent during brainstorm. The cross-feature intelligence layer aggregates them into a single JSON brief that /cspec reads as advisory context.

The core implementation is scripts/cross-feature-intel.sh (876 lines), a deterministic bash script that reads each source, extracts structured entries with per-source parsing logic, applies recency filtering (90-day exclusion), optional file-scope filtering (for debug investigations that have file-scoped data), and caps the output at 30 entries with a per-section minimum guarantee (each non-empty section retains at least 1 entry even at the cap). The six extraction routines each handle a different data format: deferred findings use jq to filter status: "open" entries from the JSON backlog; Devil’s Advocate reports parse ## DA-NNN: markdown headings with severity extracted from either inline bold or subsection formats; overrides are collapsed by reason hash (sha256 first 8 chars) across multiple run files; lens recommendations are collapsed by name with a promotion_candidate flag at count >= 3; debug investigations extract Root Cause text from markdown with file refs matched via intentionally narrow project-conventional path regex; phase effectiveness collapses post-merge bugs by the phase that should have caught them. The script follows PAT-003 (phase-transition script conventions): lives in scripts/, sources lib.sh, accepts CLI arguments, outputs to stdout, exits 0 always.

The /cspec integration adds Step 0a after the initial brainstorm exchange. Once the user has described their feature, /cspec invokes the script with --scope set to the feature’s likely affected files and presents a 3-5 entry summary framed as context. The design decision that required the most thought was DD-004: whether to use an <UNTRUSTED_RESEARCH_BRIEF> fence (like TB-007 for external web content) or a prose anti-anchoring directive. The intelligence brief contains internal project data that was human-reviewed at creation time – the risk is cognitive anchoring (over-weighting historical patterns) rather than prompt injection. The anti-anchoring directive includes calibration examples (weight-when: same files, 3+ recurrence, security-related; dismiss-when: different module, near 90-day boundary, irrelevant pattern) following the PMB-007 lesson that uncalibrated directives cause agents to default to lowest-friction interpretation. PRH-003 prevents the amplification vector where brief content gets interpolated into spec rules. This asymmetry with TB-007 is documented in ARCHITECTURE.md TB-003’s “Mitigation variant” subsection.

The /cstatus integration adds 3-state intelligence health reporting: no data (fresh projects with no pipeline history), stale (brief older than 7 days with remediation guidance), and current (brief age and section entry counts). The dormancy pattern (PAT-019) handles pre-upgrade projects where the script doesn’t exist yet – no intelligence health section appears at all. The feature introduced ABS-037 for the brief’s sole-writer contract, with /cspec as read-only consumer and /cstatus as metadata reader. SFG protection was intentionally omitted for the brief itself (advisory and regenerable from source artifacts), though the script is protected by the existing scripts/*.sh glob.

2026-05-18 — DA-002 Debt Sprint (Workflow-Advance Decomposition)

This feature addresses the single largest file in the project: hooks/workflow-advance.sh at 1,368 lines with 23 command functions. Every feature that adds a phase-transition gate touches this file, and it was growing at roughly 50 lines per feature. The Devil’s Advocate assessment (DA-002) flagged it as a maintainability risk for a single-maintainer project. The solution decomposes it into a thin dispatcher that sources three module files from scripts/wf/.

The decomposition follows the DD-002 design decision: modules are sourced, not executed. They share the dispatcher’s variable scope (REPO_ROOT, CONFIG_FILE, ARTIFACTS_DIR, etc.) and all helper functions. This avoids parameter passing and keeps the same runtime model – callers of workflow-advance.sh see identical behavior. The dispatcher sets SCRIPT_DIR (DD-006) before sourcing any module, and modules use $SCRIPT_DIR instead of BASH_SOURCE[0] for path resolution. This is critical because BASH_SOURCE[0] inside a sourced module resolves to the module file, not the dispatcher – every relative path in the original monolith would break without this fix.

The three modules are grouped by responsibility (DD-001): transitions.sh holds phase transition commands (review, tests, impl, qa, fix, done, verify, documented, audit-start, audit-done, etc.), utility.sh holds operational commands (init, reset, override, status, status-all, diagnose, help), and metadata.sh holds state modification commands (set-intensity, resolve-drift, spec-update). The grouping keeps each file under 500 lines while avoiding the overhead of one-file-per-command (23 files would be excessive). Helper functions used across modules remain in the dispatcher – moving them to lib.sh would require parameterizing dispatcher-local variables for no benefit (DD-003).

The second major change replaces the hardcoded test command in workflow-config.json. The old command was a 3,372-character string manually listing 86 test file names – each new test required editing this string (AP-024 class, the same bug class that PMB-003 caught in setup). The new command uses for f in tests/test-*.sh with an explicit test-helpers.sh exclusion and echoes each filename before execution (DD-004). This required renaming tests/test.sh to tests/test-core.sh (the old name didn’t match the test-*.sh glob) and removing its inline invocations of other test files (each test now runs independently via the glob, eliminating double-execution). The rename cascaded through 12+ test files that contained registration checks verifying their own inclusion in workflow-config.json, ci.yml, and the old test.sh – all updated to check glob discoverability instead. CI’s ci.yml was updated in parallel.

Infrastructure updates: setup now creates the scripts/wf/ subdirectory and installs module files to .correctless/scripts/wf/ with manifest tracking. sync.sh propagates scripts/wf/*.sh to correctless/scripts/wf/. hooks/sensitive-file-guard.sh DEFAULTS include scripts/wf/ to prevent LLM agents from modifying module files directly. Four drift debt items (DRIFT-001, 003, 004, 008) were triaged – two resolved (the underlying concerns are now structurally addressed by other features), two wont-fix (the original proposed fixes are superseded by phase separation and per-round diff review). A drift debt cadence check was added to /cspec Step 0: if 2+ items are open, it emits an advisory before brainstorm.

2026-05-15 — Dashboard Visual Redesign

This feature is a complete visual and UX overhaul of the project dashboard generated by scripts/build-dashboard.sh. The data collection pipeline (bash Steps 0-13) is unchanged – only the HTML/CSS/JS rendering layer (Step 15) was rewritten. Three files changed: the source script, the distribution copy, and the test file (46 new tests for redesign-specific assertions).

The visual identity moves away from the generic GitHub-like palette. Custom fonts (DM Sans for body text, DM Serif Display for headings) are loaded from Google Fonts via a <link> tag with an onerror handler that resets CSS variables --font-body and --font-display to system fonts if the CDN fails. A placeholder SRI hash is present on the font link but is not functional – Google Fonts returns different CSS per user-agent, making static SRI impractical. The onerror fallback is the real safety net, and this tradeoff is documented in QA-001. The accent color shifts from blue (#4361ee / #58a6ff) to warm amber/gold (#c8842d in light mode, #dba14a in dark). Both light and dark modes have distinct, polished color palettes defined through CSS variables – light mode uses warm off-white backgrounds (#faf8f5) while dark mode uses deep purple-tinted darks (#121018).

The layout system changes from flat <h2>-separated sections to a card-based hierarchy. Three card CSS classes (card, section-card, health-verdict) provide different levels of visual containment with box-shadow, border-radius, and border properties. A new .value-narrative section sits near the top of the Metrics view (before Quality Trajectory), prominently displaying the total findings caught pre-merge as a large stat number, escape metrics when available, and a pipeline phase distribution breakdown. This addresses R-002’s goal of making the dashboard’s value immediately obvious to a first-time viewer. The Artifact Browser retains its spec-centric structure with search, status indicators, content tabs, and right panel – the redesign updates its typography and card styling to match the new visual system.

2026-05-14 — Project Dashboard UI

This feature replaces scripts/generate-dashboard.sh with a proper skill (/cdashboard) backed by scripts/build-dashboard.sh. The old script generated a flat HTML dashboard with metrics sections; the new version adds a second view — an Artifact Browser that lets users browse specs, verifications, review findings, research briefs, architecture docs, QA findings, and audit history as rendered markdown directly in the dashboard.

The script collects artifact data by globbing .correctless/ directories (specs, verification, artifacts, findings) and inlines everything as a JSON block inside a <script type="application/json"> tag. The browser-side JavaScript renders markdown using marked.js v14.0.0 with DOMPurify v3.2.4 for sanitization — both loaded from CDN with SRI hashes. This addresses TB-003 (LLM-generated content rendered as HTML), since artifact markdown files contain prose written by LLM agents that could include script tags or event handlers. The </script> injection vector is closed by escaping all </ sequences as <\/ in the inlined JSON before embedding.

The R-007 migration was the most involved part: deleting the old script from source, distribution, and installed locations, then updating all references across ARCHITECTURE.md (ABS-026 consumer list), cmetrics SKILL.md, session-cost tests, sync.sh, FEATURES.md, CLAUDE.md, AGENT_CONTEXT.md, and six test files with hardcoded count assertions. The skill itself is minimal — a 34-line SKILL.md that invokes the bash script and handles the passthrough fallback when artifact reading fails. ABS-032 documents the sole-writer contract. The output directory .correctless/dashboard/ is gitignored since the dashboard is regenerated on demand.

2026-05-09 — UX Review Lens

4 of 9 post-merge bugs (PMB-004 path hallucination, PMB-006 fork stalling, PMB-008 lost findings, PMB-009 silent truncation) are fundamentally UX failures – silent breakage, missing recovery paths, lost output – that no existing review lens would have caught. QA checks correctness, Hacker checks security, Performance checks speed, but nothing asks “does this work from the user’s perspective?” This feature adds UX review agents to all four quality review integration points.

The implementation adds a UX agent to /creview-spec (6th adversarial agent, spawned at high+ intensity), /creview (first-ever parallel subagent in the single-pass review), /ctdd mini-audit (5th specialist agent alongside cross-component, hostile-input, resource-bounds, upgrade-compatibility), and /caudit (new UX preset with 5 specialized roles). The UX agent at each integration point evaluates through four sub-lenses: new-user (path discovery, zero-state behavior, first-run errors), upgrade (silent behavioral changes, migration path clarity, config backward compatibility), offboarding (residual state, orphaned artifacts, graceful degradation), and recovery (error messages on failure, resumption paths, state consistency, output persistence). The /caudit UX preset adds a fifth sub-lens – cross-session continuity – that checks for workflow state persistence across sessions, conversation context dependency, and fresh-session artifact path resolution. This fifth sub-lens is scoped to /caudit because cross-session continuity is only meaningfully testable through multi-session audit scenarios.

Each UX agent prompt includes PMB calibration examples (at least 3 of PMB-004, PMB-006, PMB-008, PMB-009) as concrete instances of what BLOCKING UX failures look like per AP-028 (uncalibrated severity gate). The fail-open design (R-008) means UX agent failures never gate progression – consistent with all other review lenses. Output format varies by integration point: UX-xxx IDs in /creview-spec and /creview, MA-xxx with ux-review LENS in /ctdd, confidence-tiered bounty format in /caudit.

2026-05-08 — Pipeline Completeness Verification

PMB-009 exposed a silent truncation bug in /cauto: the pipeline stopped after TDD+simplify (2 of 7 steps at high intensity) when the Skill tool’s forked execution exhausted context capacity. The Skill tool reported “completed” with no error – workflow state showed done instead of documented. The pipeline is resumable on re-invocation, but the silent truncation breaks the “run to completion” assumption.

The fix adds a two-layer verification mechanism. First, /cauto writes a pipeline manifest (.correctless/artifacts/pipeline-manifest-{branch_slug}.json) as its very first action after the phase gate. The manifest records expected_steps (canonical step list based on intensity: standard gets 6 steps, high+ gets 7 including cupdate-arch), completed_steps (updated after each step), and status (in_progress vs completed). On resumption, /cauto reads the existing manifest and reports which steps were missed. Second, /cstatus checks for incomplete manifests and reports them as a dormant check (per PAT-019 – skips silently when no manifest exists, fires only when one is found incomplete). The canonical step enum (ctdd, simplify, cverify, cupdate-arch, cdocs, consolidation, pr) is defined in ABS-031 and verified by structural tests.

2026-05-08 — Escape Metrics in Audit Pipeline

Added escape rate tracking to the /caudit convergence pipeline. After each audit round, the pipeline now computes and logs the escape rate – findings from round N+1 that should have been caught in round N. This feeds into /cmetrics as a quality signal: a declining escape rate across rounds means agents are getting better at catching issues on the first pass. The metric is advisory and never gates progression.

2026-05-08 — Autonomous Skill Contract

The /cauto pipeline previously stalled at every human decision point – doc approval, architecture entry triage, refactoring confirmation. Each pause broke the pipeline’s execution model and caused PMB-006-class stalls. This feature solves the stalling problem by adding a formal contract for how skills behave when dispatched autonomously.

The core mechanism is an interaction_mode field in every SKILL.md’s YAML frontmatter. The field has three values: autonomous (5 skills like chelp and cmetrics that already run to completion without input), interactive (2 skills – csetup and cspec – that inherently require Socratic human interaction), and hybrid (22 skills that have decision points but can provide sensible defaults). The field is documentation-only – it is NOT parsed by the Claude Code plugin loader (ENV-007 documents the loader only reads name, description, tools, model). Instead, /cauto reads it via the Read tool when planning dispatch, and structural tests verify every skill has it.

Each autonomous and hybrid skill gained a ## Autonomous Defaults section listing decision points with unique IDs (AD-001, AD-002, etc.) and rationale. The interesting design decision was the escalate: always marker for certain decisions that MUST get human input regardless of mode. For hybrid skills with context: fork (cdevadv, cpostmortem, credteam, cverify), these decisions cannot actually reach the human during execution – the fork prevents follow-up input. The deferred escalation mechanism (R-011) resolves this: the skill applies the default, flags escalation_deferred: true, and returns it in structured output. /cauto collects these and surfaces them at pipeline end as a confirmation gate before PR creation (R-013). This means fork+hybrid skills can participate in the autonomous pipeline without architectural changes to fork semantics.

The JSONL artifact (.correctless/artifacts/autonomous-decisions-{branch_slug}.jsonl) follows the ABS-029/audit-record.sh pattern: a dedicated writer script (scripts/autonomous-decision-writer.sh) with subcommands (append/read/path), SFG protection on both the script and the JSONL file, and /cauto as the sole invoker. Skills return decisions in a structured AUTONOMOUS_DECISIONS_START/AUTONOMOUS_DECISIONS_END block; /cauto parses and persists them. The AD-UNLISTED fallback (R-014) handles decision points not listed in a skill’s defaults – they use the first option and get flagged as deferred escalations, making incomplete defaults sections visible rather than silently wrong. The fail-open design (R-005) means that if mode: autonomous is absent from the prompt, skills run interactively – a stall is annoying but safe, while silently applying defaults when the user expects to be asked is worse.

2026-04-25 — Statusline Live Cost

The session cost analysis feature (compute-session-cost.sh) takes ~2 seconds to run – far too slow for the statusline’s 50ms budget. This feature bridges the gap with a background-refresh cache: the statusline reads a lightweight JSON cache file synchronously (<5ms), and when the cache is stale (>30 seconds), spawns compute-session-cost.sh in the background to regenerate it.

The background refresh mechanism uses three defenses against concurrency bugs. First, a lock file (.correctless/artifacts/cost-cache.lock) containing the PID of the background process prevents double spawns – if a second render fires while a computation is running, it sees the lock, checks kill -0, and skips the refresh. Second, a trap EXIT in the background subshell ensures the lock file is cleaned up even if the process dies abnormally. Third, atomic writes via mktemp + mv prevent the statusline from reading a half-written cache file. The lock file is written by the statusline before disown but after & – the $! PID is only available after the background spawn, creating a minimal TOCTOU gap that QA-001 acknowledged as inherent to bash semantics.

Two helpers were extracted to support the display: phase_display_name() converts raw workflow phases (tdd-impl -> GREEN) and was factored out of the existing phase display logic (which used inline case branches), and fmt_cost_nonzero() uses awk to format a decimal only when non-zero. The cost display format – $47.23 ($12.50 in GREEN) – appends to the existing Section 4 content after QA rounds and duration. When cost is zero or the cache doesn’t exist, the cost portion is omitted entirely, keeping the statusline clean during early workflow phases before any cost accrues.

The compute-session-cost.sh extensions (--cache and --phase flags) follow a clean separation: --cache changes the output format (lightweight JSON to stdout instead of full artifact to file), and --phase computes current_phase_cost_usd by filtering the by_phase array. The caller (statusline background subshell) handles file placement, maintaining the script’s single-responsibility as a computation engine per ABS-026 (cost artifact contract).

2026-04-22 — Skill Path Discovery

PMB-004 surfaced a class of bug where skills reference workflow artifacts by concept (“Read the spec artifact”) without specifying how to discover the file path. This works on the Correctless repo itself because conversation context from a preceding /cspec run carries the path forward. On other projects in fresh sessions, the agent hallucinates paths – /creview-spec tried three wrong locations before giving up. The fix is straightforward: each skill now calls workflow-advance.sh status and reads the Spec: line, matching the pattern already used by /creview and /ctdd.

Four skills were fixed: /creview-spec (step 2 replaced entirely), /cverify (removed the vague “from workflow state or .correctless/specs/” fallback), /cpostmortem (added workflow state lookup with a .correctless/specs/ fallback for post-merge postmortems), and /csummary (added workflow-advance.sh status call to replace state file reading). The changes are text-only – prompt edits, not code. The distribution copies were synced via sync.sh.

The structural guard in test-architecture-drift.sh is the more interesting contribution. It maintains two explicit lists – MUST_HAVE_DISCOVERY (8 skills that must have at least one path discovery token) and EXCLUDED_FROM_DISCOVERY (20 skills that don’t need single-spec discovery). Every skill directory that isn’t _shared must appear in exactly one list, or the test fails. This is the same list-based classification pattern used by REG-001 (test registration guard) – a new skill being added to skills/ will fail the drift test until the author decides whether it needs path discovery. The skill_body() helper was extracted to test-helpers.sh (shared harness) since both test-skill-path-discovery.sh and test-architecture-drift.sh need it to strip YAML frontmatter before checking skill content.

AP-025 was added to antipatterns.md documenting the bug class. The Correctless Learnings in CLAUDE.md got a PMB-004 entry. No new architecture patterns were introduced – this feature applies existing conventions (PAT-001 source-to-dist sync, structural guard classification) to a new context.

2026-04-19 — Test Harness Extraction

The 14 newest test files all had the same ~30-line boilerplate block: pass(), fail(), section(), skip(), counter variables, color definitions, preamble (set -uo pipefail, cd to repo root), and summary(). The duplication was a natural consequence of each test file being authored by a fresh TDD agent that couldn’t know about helpers that didn’t exist yet. Once the pattern stabilized across enough files, extraction became purely mechanical.

The interesting part was the variant classification. Not all 14 files duplicated the same subset. Variant A files (8 files like test-carchitect.sh and test-session-cost.sh) had the full boilerplate — complete pass/fail/section/skip functions, counter init, colors, and preamble. Variant B files (test-dev-journal.sh, test-qa-uncertain.sh) had minimal one-liner pass/fail and counters but no section/skip/colors. Variant C files (test-sensitive-file-guard.sh, test-auto-policy.sh, test-allowed-tools-check.sh) never defined pass/fail at all — they used file-specific assert helpers (assert_eq, file_contains) that directly incremented PASS/FAIL. These files only needed the harness for the preamble and counter initialization.

The variable normalization in test-architecture-drift.sh was a minor surprise — it used FAILED_INVS (invariants) instead of the standard FAILED_IDS, an artifact of being written before the naming convention settled. The harness uses FAILED_IDS, so the migration required updating references in both the fail() calls and the summary function.

The registration guard updates (QA-001) were the most consequential QA finding. test-ci-hook-wiring.sh and test-architecture-drift.sh both enumerate test-*.sh files and expect every match to be registered in CI and workflow-config.json. But test-helpers.sh is a sourced helper, not a standalone test — running it directly would just define functions and exit. Both guards now explicitly skip it. This is the same naming-convention tension noted in the QA class fix: test-helpers.sh matches test-*.sh but is semantically different from the test files it serves. A naming convention like helpers-test.sh would avoid this, but changing it now would break the 14 source lines that already reference the path.

2026-04-20 — Session Cost Analysis

The token-tracking PostToolUse hook has been writing zeros for total_cost_usd and token counts since it was introduced. The fields don’t exist in Claude Code’s PostToolUse contract (tracked as #11008). Every /cmetrics dashboard, every calibration entry, every “cost by phase” section has been showing zeros or deriving cost from nothing. This feature replaces the phantom data with real USD cost computed from Claude Code’s session transcripts.

The key insight is that Claude Code already records everything needed — session transcripts in ~/.claude/projects/ contain per-turn model, token counts (input, output, cache write, cache read), and branch context. The challenge is deduplication: streaming produces ~3.14x inflation with multiple JSONL entries per API call sharing the same message.id. Taking the last entry per unique ID (the final streaming response with complete token counts) eliminates the inflation cleanly.

Phase attribution was the most interesting design decision. The script reads the audit trail for phase transitions and assigns each transcript turn to the phase active at its timestamp. Subagents spawned during GREEN that complete during QA are attributed to QA (completion-time, not spawn-time) — spawn-time attribution would require correlating parent tool_use IDs with subagent transcript IDs, adding complexity for marginal accuracy gain. The script always undercounts by the invoking /cdocs session’s cost since it runs before the session ends.

The adversarial review (F-02) caught a significant design flaw: the original spec included a cross-project fallback scan that would search all ~/.claude/projects/ directories for matching cwd patterns. This creates information leakage between projects. The fix was clean — two discovery paths only: candidate derivation from repo root, plus a config override for non-standard layouts.

The pricing validation ($500/M ceiling) catches a likely confusion between per-token and per-million-token values. The 6-decimal precision invariant (total_cost_usd == sum(by_phase) == sum(by_subagent)) ensures the two orthogonal breakdowns account for 100% of cost without floating-point drift.

2026-04-20 — Dashboard Trend Insights

The dashboard started as a data dump — it showed what happened but not whether things are improving. This feature adds four trend sections that answer “is Correctless working?” by transforming raw counts into trajectory views.

The QA Rounds Trend reuses the same horizontal bar visual as Quality Trajectory but maps QA rounds: N from workflow-history.md entries. A declining bar length over time means the workflow is learning — fewer QA rounds needed per feature. The data was already parsed in Step 3; the new section just renders it differently.

Intensity Accuracy reads calibration entries (already parsed in Step 7) and compares recommended_intensity against actual_intensity using an ordinal map. The three buckets (agreed/raised/lowered) surface whether the system’s intensity recommendations match human judgment. With 11 calibration entries, the data is starting to be meaningful — alpha was raised from standard to high, everything else agreed or was lowered.

Override Rate shows per-feature override counts as bars with a one-line mean summary. The data comes from workflow-history.md’s Overrides: N field, which is only present when >0 (per the /cdocs convention). Features with 0 overrides show empty bars. The mean is a simple arithmetic average — useful as a monitoring signal for gate misclassification (AP-023).

Fix Rate reads findings with status fields and computes fixed/total with a percentage bar. The dual degradation (no findings at all vs findings without status fields) matches the spec’s R-006 requirement. The Fix status data not available message catches older qa-findings files that predate the status field.

Section ordering (R-005) was the most constraint-heavy rule — 7 ordering assertions verify the full narrative flow: Project Summary, Quality Trajectory, QA Rounds Trend, Pipeline Phase Distribution, Fix Rate, Antipattern Health, Intensity Accuracy, Override Rate, Cost by Phase, Drift Debt, Dev Journal. The test extracts line numbers via grep -n and compares numerically.

2026-04-19 — Project Dashboard

The dashboard is the first feature that reads across nearly every artifact Correctless produces — workflow history, QA findings, antipatterns, calibration entries, drift debt, token logs, dev journal, overrides, and project config. Building the parser surface in pure bash (awk for markdown, jq for JSON, grep for pattern matching) was the natural choice given the project’s zero-external-dependency stance, but it makes the implementation brittle to format changes. The spec explicitly accepted this risk: if the format changes, the parser breaks visibly (empty sections), not silently (wrong data).

The most interesting section is antipattern dormancy detection. It cross-references AP-xxx IDs against the last 5 qa-findings files to determine whether an antipattern is still firing. Antipatterns with Status: Structurally enforced are marked resolved. This closes the loop on the antipattern lifecycle — you can now see which antipatterns were caught early and stopped recurring because the workflow learned from them.

The HTML generation uses a small vanilla JS DOM builder (h(tag, attrs, ...children)) instead of template literals or string concatenation. This keeps the inline script readable and avoids the escaping nightmares that come with embedding data-derived content in HTML strings. The data is injected as a JSON blob in a <script type="application/json"> tag, parsed once, and rendered entirely client-side.

Dark/light mode via prefers-color-scheme CSS custom properties. Horizontal bars via inline <div> widths. No charting libraries, no CDN links, no fetch calls. The file opens correctly via file:// protocol, which matters because this is a local development tool, not a hosted dashboard.

2026-04-18 — Agent Hook for Internal Import Enforcement

This feature introduces the first agent hook in Correctless, establishing a new hook type alongside the existing bash script hooks. The key insight is that some enforcement checks require LLM reasoning — reading ARCHITECTURE.md, parsing YAML entrypoints, matching glob patterns against import paths — which cannot be done deterministically in a bash script. Claude Code’s agent hook type (type: "agent") solves this by spawning a lightweight sub-agent (Haiku by default) that can read files and reason about the result.

The implementation is a single JSON file at hooks/import-guard.json containing the hook configuration and an embedded prompt. The prompt decomposes the check into six sequential steps: (1) is this a test file? (2) do entrypoints exist in ARCHITECTURE.md? (3) read the test_helpers allow-list, (4) parse entrypoints YAML, (5) check imports against scope globs, (6) decide allow/deny. Each step has a clear early-exit path, and the prompt includes language-aware import patterns for Go, TypeScript/JavaScript, Python, and Rust with explicit allow for unsupported languages.

The most interesting design decision was making the deny reason unconditionally include escalation guidance (“ask the user for guidance”) rather than tracking retry counts. Agent hooks are stateless — they have no persistent state between invocations. The original spec (R-012) described retry counting, but review correctly identified that this is impossible in the agent hook model. The unconditional guidance means every deny is self-documenting without requiring the hook to track anything.

The setup script was extended with a second hook discovery loop for hooks/*.json files. This loop reads hook_type, type, matcher, prompt, and timeout from the JSON file and constructs the settings.json entry differently from command hooks — {type: "agent", prompt: ..., timeout: ...} instead of {type: "command", command: ..., timeout_ms: ...}. The idempotency logic checks for existing agent hooks and updates matcher/prompt/timeout on re-run without duplicating entries. Sync.sh was updated with JSON-specific propagation and bidirectional staleness detection (both source-has-but-dist-missing and dist-has-but-source-missing cases).

The workflow.test_helpers config field is the escape hatch for false positives. Test helper packages (e.g., pkg/handlers/testutil/) that live within an entrypoint’s scope but are legitimately imported in tests can be allow-listed via glob patterns. This was the primary risk mitigation from the spec — agent hooks return {ok: false} with no override mechanism, so false positives are hard walls. The allow-list plus the deny reason’s explicit guidance on how to add to it makes the hook self-correcting.

2026-04-19 — Upgrade Compatibility Lens

PMB-003 exposed a gap in the review pipeline: setup had a hardcoded 2-file script list that silently went stale across 5 PRs, leaving 16 of 18 scripts uninstalled on user projects. No pipeline phase ever asked “what happens to an existing user who upgrades?” This feature closes that gap by adding the upgrade compatibility question to both /creview-spec (spec-level) and /ctdd (implementation-level).

The implementation is entirely prompt-level. In /creview-spec (skills/creview-spec/SKILL.md), a 5th adversarial agent – the Upgrade Compatibility Auditor – was added to the high+ intensity agent roster. It receives the same self-assessment input as the other four agents but examines the spec through a 5-item upgrade checklist: (1) does the spec account for installation of new scripts/hooks, and is the mechanism complete (glob vs hardcoded list)? (2) do new config keys have defaults? (3) do schema changes address backward compatibility? (4) do removals include migration paths? (5) do features depending on new artifacts degrade gracefully? At standard intensity (which only spawns 3 agents), the upgrade agent is not spawned – upgrade issues are primarily implementation-level bugs, making the mini-audit’s code-level check more reliable at catching them.

In /ctdd (skills/ctdd/SKILL.md), a 4th mini-audit specialist was added alongside cross-component, hostile-input, and resource-bounds. Unlike the review agent, this one runs at all intensity levels because it examines the actual git diff, not the spec. The same 5-item checklist is used but reframed for implementation: “does the install/setup mechanism install all new files?” instead of “does the spec account for installation?” Both prompts reference AP-024 and PMB-003 as concrete examples of the bug class, giving the agent historical context for what upgrade failures look like in this project.

The count updates were the most mechanical part but the most error-prone. In creview-spec: “4 adversarial agents” became “5 adversarial agents” in 6 locations (progress announcement, agent spawning text, intensity tier description, task list, checkpoint phases, agent_role enum). In ctdd: “3 specialist agents” became “4 specialist agents” in the progress announcement, and upgrade-compatibility was added to the LENS enum, agent_role enum, and token tracking. The standard-intensity count in creview-spec stayed at 3 – the upgrade agent is intentionally gated behind high+ there. The full redundancy design (both review and mini-audit ask the same questions) means the upgrade compatibility check fires twice per feature at high intensity: once during spec review (catching design omissions) and once during mini-audit (catching implementation omissions).

2026-04-18 — /carchitect Phase 1: Entrypoint-Aware TDD

Phase 1 closes the gap between /carchitect’s machine-referenceable entrypoints (Phase 0, ABS-023) and /ctdd’s test-writing behavior. Before this feature, the RED phase test agent would read ARCHITECTURE.md but had no specific instruction to use the entrypoints section when writing integration tests. The test audit had no check for tests that bypass entrypoints by importing internal packages directly. PR #70 added Entry/Through/Exit contracts (ABS-024) that tell the test audit what shape a test should have — but nothing told the RED phase agent to write tests through entrypoints from the start.

The implementation is entirely prompt-level changes to skills/ctdd/SKILL.md. Three new paragraphs were added to the RED phase test agent’s blockquoted instructions: one instructing the agent to read entrypoints and match rule scope to entrypoint scope globs before writing integration tests (R-001), one instructing it to read Key Patterns/Layer Conventions/Trust Boundaries and respect layer access constraints (R-002), and one providing a graceful fallback with a No documented entrypoint comment marker when no entrypoints section exists (R-004). The Read context list was updated to emphasize “especially the Entrypoints section and Key Patterns” (R-003).

In the test audit section, a new check 10 (Internal import bypass detection) was added after the existing check 9 (Entry contract verification). Check 10 reads entrypoints from ARCHITECTURE.md, builds a map of scope globs to entrypoint names, and checks each [integration] test file for import statements referencing paths within an entrypoint’s scope. The check is language-aware with patterns for Go (import "pkg/..."), TypeScript/JavaScript (import ... from / require()), Python (from pkg import / import pkg), and Rust (use crate:: / mod). Unsupported languages get an ADVISORY skip. The check explicitly excludes self-imports of the entrypoint itself (R-007) and skips entirely when no entrypoints are documented (R-008). When check 10 and check 9 both fire on the same test, they consolidate into a single finding (R-005).

The test file (tests/test-carchitect-phase1.sh) is structural — it verifies prompt text presence in the SKILL.md file via grep patterns rather than testing LLM behavior. This is the same approach used for the integration-test-contracts tests (check 9). 33 assertions across 9 rules cover the mechanical envelope: required phrases, check descriptions, severity levels, consolidation logic, language patterns, and documentation updates. The ABS-023 consumer description was updated from “transitive consumer” to “direct consumer” to reflect that /ctdd now reads entrypoints directly in both the RED phase and test audit, not just through Entry/Through/Exit contracts.

2026-04-26 — Harness Fingerprint + Model Upgrade Detection

This feature exists to close a class of silent regression that the 4.6 → 4.7 audit (OPUS_4_7_MIGRATION.md) made painfully visible: when Anthropic ships a new model or tweaks harness defaults inside an existing model version, the workflow regresses silently. Three findings surfaced in one audit session, none caught by tests, none surfaced by metrics. The “uncontracted model defaults” antipattern (logged in MEMORY.md as a class) is the underlying issue — Correctless’s correctness model implicitly depends on a single Anthropic version’s behavioral defaults (length caps, parallel-tool-call preferences, anti-defensive code priors, in-context skill inlining), and there was no mechanism to notice when those changed.

The implementation is two bundled mechanisms. First, a deterministic fingerprint in scripts/harness-fingerprint.sh: it computes the literal string "{model_name}|{HARNESS_VERSION}" (no hashing — debuggable by reading the file directly) where HARNESS_VERSION is a manually-bumped integer constant at the top of the script. The maintainer increments this when a behavioral change is observed (heuristic OQ-006: >20% delta in any metric across consecutive same-model runs, or a manually-noticed shift like a 4.7-style audit pattern). The script runs at every /cspec Step -1 via a structural marker <!-- correctless:harness-fingerprint:invocation --> and emits a one-time version_bumped advisory per session, gated by a flag file at .correctless/artifacts/harness-notified-{session-id}.flag. Session-id derivation lives in lib.sh as get_current_session_id() (cross-platform: ps -o lstart=/proc/{pid}/stat → PID-only fallback) — single source of truth, no per-skill drift permitted. A new locked_update_file() helper in lib.sh mirrors locked_update_state for arbitrary file paths (BND-002 / ME-4 round-2).

Second, a /cmodelupgrade skill (the project’s 29th skill — Analysis category) is the sole writer of .correctless/meta/model-baselines.json. Given the current {model}+{HARNESS_VERSION} key, it reads four data sources per feature — intensity-calibration.json for qa_rounds + total_tokens, cost-*.json glob (via ABS-026 — never a hardcoded slug list, mitigates AP-024 and PMB-003), workflow-state-*.json for phase_count, and the baseline file itself. Aggregation uses an explicit three-tier bootstrap: exact-match pool (entries tagged with current harness_version) → pre-fingerprint pool (entries from before /cverify recorded the field, used with explicit “pre-fingerprint baseline” label) → no-baseline mode (clear message, exit 0 — never compares against zero, mitigates DA-004 self-referential metrics). The skill spawns no subagents (ME-12) — all aggregation, comparison, and report rendering happens inline in the orchestrator’s context.

Several design decisions are worth surfacing for future modifiers. The v1 spec proposed an LLM probe to introspect distinctive harness substrings; round-2 of /creview-spec rejected it (CR-2) as compounding-uncertain — undefined channel between agent and script, stability uncertainty, a new trust boundary, susceptibility to negation-spoofing. A literal version constant is testable, has no trust boundary, and aligns with how harness changes actually get noticed in practice (a human observes “things feel wrong” within one session). Hashing was dropped (HI-1 round-2): neither the model name nor the version is secret, and the literal key is debuggable. Sole-writer enforcement is structural via hooks/sensitive-file-guard.sh (PRH-002 — directly mitigates AP-022 dead-code-in-security-paths) — the hook blocks Edit/Write AND Bash redirects (>, >>, tee) for the fingerprint file, the baseline file, AND scripts/harness-fingerprint.sh itself. PRH-006 lifecycle scoping addressed CR-1 round-2: the script’s protection activates after the first commit lands so the implementation agent can create it during /ctdd GREEN, then the protection is permanent.

The verification surfaced one acceptable drift item (DRIFT-001): the live harness-fingerprint.json lacks schema_version because it was first written before MA-UC-001’s schema_version fix landed. The writer is now correct for all future writes (and any rewrite triggered by version bump or corruption recovery — both verified by test_ma_uc_001_schema_version); the live file will self-heal on next rewrite per BND-004’s fail-open posture. INV-007/INV-009/INV-014 are intentionally structural-only because their entry path is the Skill tool, which isn’t bash-testable end-to-end (per QA-002 — accepted limitation). All other 21 invariants and 11 prohibitions/boundaries have mechanical tests that would fail on regression. 110 tests pass in tests/test-harness-fingerprint.sh; cross-suite coverage in test-architecture-drift.sh, test-sensitive-file-guard.sh, test-allowed-tools-check.sh, test-scripts-namespace-migration.sh, and test-skill-path-discovery.sh covers the integration points.

2026-04-28 — Harness-Fingerprint R2 Hardening

This feature exists because the R2 audit of the harness-fingerprint R1 fix batch had a 71% defect rate. The R1 round patched specific instances (“this command leaks”, “this redirect bypasses”); each subsequent R2 specialist round found a route around the previous patch. The lesson, recorded as PMB-002 and the autonomous-fix-defect-rate feedback note in MEMORY.md, is that audit-fix rounds are themselves untested code — and when the underlying extractor is enumeration-based, every fix is a one-instance patch that the next round routes around. The remedy is to close the bug class structurally: the extractor must be incapable of “missing” a write command because it never enumerates them in the first place.

Three architectural pieces shipped together. The first is canonicalize_path in scripts/lib.sh — a pure-bash segment-stack walker. It is total over arbitrary byte sequences (INV-001), idempotent (INV-003), produces no //, no . segments, no .. segments on absolute paths, no trailing / (INV-002), recognizes ASCII . only as a path-segment dot (INV-002a — Unicode lookalikes U+2024, U+FF0E, U+2026 pass through as ordinary bytes), performs no shell expansion of glob characters (INV-004), and runs in <50ms on 1024-byte input (INV-012). The function lives in lib.sh because both hooks/sensitive-file-guard.sh and hooks/workflow-gate.sh consume it via ABS-001 — no per-hook reimplementation. INV-001a closes a subtle fail-open class: empty stdout on non-empty input would let the matcher receive an empty target and skip pattern comparison, so the function’s contract explicitly forbids that. Property-based tests (tests/test-canonicalize-path.sh) use a pinned seed (RANDOM=42), 1000 inputs, and a corpus where each of the dangerous bytes (*, ?, [, ], /, ., ` , \t, \n, $, backtick, (, {) appears in at least 50 inputs; failures hex-dump via xxd` for replay.

The second piece is the hooks/sensitive-file-guard.sh refactor. The old _extract_bash_targets did per-command dispatch — a chain of case branches (cp), mv), tee), dd), etc.) trying to enumerate “which Bash commands write.” Every R2 round found a missing branch. The new extractor has no per-command dispatch: the default branch over-extracts every non-flag token as a candidate (INV-006), and _check_file_against_patterns filters via canonical-form match. Redirect operators are detected first — >, >>, 1>, 2>, &> in both whitespace-separated (cmd > file) and inline-attached (cmd>file) forms (INV-007). Process substitution sub-tokenizes a single level (INV-007a). _has_write_pattern was extended to flag interpreter+eval-flag chains (bash -c ..., perl -e ..., python -c ..., /usr/bin/env perl ...) — INV-013, with a regression test in tests/test-workflow-gate.sh confirming workflow-gate.sh consumes the shared function via ABS-001 with no local redefinition (INV-013a). Both target and protected pattern flow through canonicalize_path before reaching the matcher (INV-005, INV-008, PRH-004). At hook source-time, a v1 sentinel probe verifies canonicalize_path is present and behaves correctly — if missing or wrong, the hook exits 2 fail-closed before any policy runs (INV-005a). This closes the partial-upgrade class where lib.sh and the guard could end up out-of-sync mid-deploy. PRH-002 makes the no-per-command-dispatch rule structural: 28 disallowed tokens are enumerated in INV-006a’s structural test as a permanent ban list.

The third piece is the --version flag and VERSION_OVERRIDE env-var removal from scripts/harness-fingerprint.sh. AUTH-R2-001 surfaced a confused-deputy class: the testability flag was the autonomous-bump escape hatch — anything that could pass --version N could also forge a fingerprint. The harden: strip both surfaces from production (PRH-003 / INV-009); HARNESS_VERSION=N becomes the sole production input. Tests now inject specific versions via a feature-specific helper at tests/harness-fingerprint-test-helpers.sh (make_test_harness_script <version> <workdir>) that copies the production script to $workdir/harness-fp-test-XXXXXX.sh (mktemp), substitutes the constant via POSIX sed, validates the substitution, and co-locates a copy of lib.sh so SCRIPT_DIR/lib.sh resolves to the under-test source. Critically, the destination filename does NOT match the protected pattern in DEFAULTS (BND-003 — */scripts/harness-fingerprint.sh is the protected glob, the helper writes to $workdir/harness-fp-test-...sh). Per Finding #8 amendment from /creview-spec, the helper lives in a feature-specific file rather than the shared tests/test-helpers.sh — keeps the test surface for one feature out of the global helper namespace.

The migration shipped as two commits per INV-011. The first removes --version from production (intentionally leaving the test suite red); the second migrates the tests to the helper (restoring green). The split makes the production-security decision and the test-infrastructure decision independently revertable — if the helper approach turns out wrong later, the tests can be reverted without re-adding --version to production (which would re-open AUTH-R2-001). Loud failure during the migration window (red tests with explicit fingerprint-mismatch messages) is the deliberate signal, not a silent degradation.

Supporting wiring includes a new path-scoped rule file .claude/rules/canonicalize-path.md (PAT-017), the second dogfood usage of ABS-009 after PAT-001’s hooks-pretooluse rule. Frontmatter declares paths: [scripts/lib.sh] so the body loads into editing context whenever an agent opens lib.sh; tests/test-architecture-drift.sh enforces the rule-file shape (existence, frontmatter, See-link from ARCHITECTURE.md, in-file pointer comment). The setup script now greps for VERSION_OVERRIDE in the existing scripts/harness-fingerprint.sh before installation; if found (pre-R2 install), it force-reinstalls with a clear notice referencing INV-009/PRH-003 (INV-014). This closes the upgrade-path break Finding #7 surfaced — without setup-side detection, an existing user’s pipeline could end up with the post-R2 hook (which probes the new lib.sh) and a pre-R2 script (which still carries VERSION_OVERRIDE) and silently misbehave. PAT-016 was promoted from AP-024 (PMB-003 — frequency 16 missing scripts across 5 features) per /cspec Step 8 approval as a side benefit of this work; the glob-over-directory rule with mandatory count-match drift test now lives full-body in ARCHITECTURE.md.

Verification ran 533 tests across the 7 directly-affected suites with 0 failures. Every spec rule has at least one targeted test; every property-based test has pinned seeds and explicit failure-replay. The verification report flagged one acceptable smell (.correctless/scripts/antipattern-scan.sh exits 1 when stdout is redirected — pre-existing, not a regression from R2) and no drift. The work merged via squash-commit 081f842 as PR #86; this dev-journal entry was written post-merge against main because the branch was already merged when /cdocs ran.

2026-04-30 — Audit Findings Persistence Contract

Corrective action for PMB-005. The original failure was simple in shape: /caudit’s persistence step was described in skill prose (“write per-round findings to audit-{preset}-{date}-round-{N}.json”), nothing enforced it, and on 2026-04-26 a hacker R1 audit transitioned audit-done cleanly with no round-JSON written. The findings existed only as commit-message prose on the squash-deleted audit branch. /cmetrics then derived “days since last Olympics” from history.md mtime — last touched 2026-04-04 — and reported the audit as 16 days stale when it had run the day before. Same shape as silent-telemetry-failure (token tracking 2026-04-14) and AP-022 (dead-code-in-security-paths 2026-04-26). The advisory step looked completed in the orchestrator’s mental model but was never structurally verified.

The feature lands three coupled mechanisms in a single PR. First, the cmd_audit_done precondition gate at hooks/workflow-advance.sh:788. Reads .audit.type and .started_at from the workflow state file, validates .audit.type against ^[a-z][a-z0-9-]{0,31}$ BEFORE glob expansion (so a corrupted state with .audit.type=* or ../etc cannot escape the findings dir — MA-003 instance fix), then iterates audit-{preset}-*-round-*.json and accepts the first file whose started_at field equals state’s started_at byte-for-byte. Content-based string equality, not mtime — robust to ENV-003 (filesystem mtime unreliable after git checkout/git clone/git rebase, exactly the operations a developer might run mid-audit). The gate honors the existing override sentinel for emergencies (INV-008) and writes an audit-specific log entry (gate: "audit-done", bypass_target: "cmd_audit_done") so /cmetrics’s separate audit-done override counter can flag routine bypasses on this gate without conflating them with generic audit-phase overrides. The remediation message names all three load-bearing facts — the literal string Audit findings missing, the expected glob with the actual preset substituted, and the started_at ISO timestamp from state (INV-001a) — so the user can fix the gap from the message alone without reading the source.

Second, scripts/audit-record.sh — the sole writer (ABS-029, INV-006). PAT-003 phase-transition CLI: lives in scripts/, sources lib.sh, accepts CLI positionals, exits 0 on success non-zero on failure with stderr error messages and stdout’s success format being a single line containing the absolute path of the written file (path=$(audit-record.sh write-round ...) consumable). Two subcommands: write-round <preset> <round> <findings-file>|- and append-history <preset> <summary-file>|-. The script’s _state_file helper reads branch_slug from lib.sh and locates .correctless/artifacts/workflow-state-{slug}.json — deliberately with NO ls -t mtime fallback (MA-001 — picking the most recently modified state file across branches would let the writer attribute one branch’s audit to another branch’s started_at, exactly the cross-branch contamination the gate’s content match exists to prevent). Path construction is isolated from external state per PRH-003: only CLI positional args and the hardcoded .correctless/artifacts/findings/ base directory contribute. Reading state’s started_at for the JSON content is permitted because that’s content, not path. The TTY-stdin guard ([ ! -t 0 ] check on - stdin form) emits a clear error rather than blocking forever in interactive testing. append-history uses >> append-only with flock -w 5 to serialize concurrent writers; on lock timeout it emits a warning to stderr and exits 0 (history append failure does NOT block round-JSON write or gate transition — PRH-004’s non-blocking advisory contract). A trap on EXIT/INT/TERM/HUP cleans up the tmp file used during the atomic write phase (QA-R4-005 fix).

Third, /cmetrics’s multi-signal staleness consumer (INV-005, PRH-005). Replaces the original PMB-005 single-mtime read with max(history.md mtime, latest round-JSON mtime) per preset, with an explicit “no data” label when both signals are absent — never silently zero or “infinite” without the label. The consumer side is intentionally mtime-based and fail-open: ENV-003 says mtime is unreliable post-git-op, but /cmetrics is advisory and a slightly-wrong staleness number does not corrupt workflow state. The gate is the authoritative content-based check; the consumer is the advisory mtime-based reading. Layer separation is intentional and documented in INV-005’s “acknowledged residual risk” note. The same /cmetrics change adds a separate audit-done override counter alongside the generic override counter — routine audit-done overrides are the AP-023 recurrence pattern for this gate specifically and warrant their own counter in the Override Health section.

Sensitive-file-guard protection of the writer script itself follows the 2026-04-26 sole-writer convention from CLAUDE.md (harness-fingerprint precedent). DEFAULTS in hooks/sensitive-file-guard.sh gain scripts/audit-record.sh and .correctless/scripts/audit-record.sh plus the bare basename — the test suite verifies blocks against Edit/Write/MultiEdit AND Bash redirects (>, >>, tee, cat | tee, 2>, &>) targeting both source and install-mirror paths (INV-009, four test functions). This is the AP-022 mitigation pattern applied identically: structural enforcement that the writer script cannot be silently replaced by an autonomous agent, which would make the contract unenforceable without anyone noticing.

The structural-test landscape includes some honest weakness markers. INV-006 (“audit-record.sh is the sole writer”) and PRH-005 (“/cmetrics never derives staleness from a single signal”) are both grep-based on skill prose and acknowledged AP-003-class — they catch the obvious case but produce false negatives on rephrased text and false positives on reads. The spec accepts the limit and pairs each with a load-bearing complementary test: PRH-001’s command-name grep (audit-record.sh write-round) is robust to rephrasing and catches the writer-fanout class for INV-006; the behavioral fixture test test_inv005_max_picks_newer_signal (creates a audit-qa-history.md 30 days old, a round-JSON today, asserts the staleness reading uses today) is the load-bearing complement for PRH-005. The pattern of pairing weak-but-cheap structural with strong behavioral is the same shape PAT-017’s tests use.

The rule cluster has 22 entries — 10 invariants, 5 prohibitions, 2 boundary conditions, 4 environment assumptions, 1 architectural addition. ABS-029 sits between ABS-028 and ## Patterns per the spec’s exact placement directive. AP-026 (advisory-prose artifact-write contract) was added to antipatterns.md with the 2026-04-26 incident as its frequency-1 case study; the “How to catch it” prescribes the four-step pattern this feature dogfooded — declare ABS, gate-enforce at phase transition, structural test, multi-signal consumer. PMB-005 is the postmortem entry. Override count of 3 reflects mid-feature stale .claude/hooks/workflow-gate.sh requiring manual resync from source (a known-class instance of the install-drift problem MA-009 / MA-016 flagged); the source hooks/workflow-gate.sh already had tdd-audit in its allowlist, so the syncs were mechanical resyncs, not policy changes. Class fix for the install-drift class itself (hash-pin / version-pin of installed hooks) is deferred to a follow-up — out of scope for a feature focused on findings persistence.

2026-05-06 — carchitect Phase 3: Architecture Adherence Auditor

Phase 3 of the /carchitect roadmap closes the loop between architecture documentation and auditing. Phase 0 reverse-engineered the codebase into a structured ARCHITECTURE.md. Phase 1 made the TDD agent read entrypoints from that document. Phase 2 made the spec agent architecture-aware. Phase 3 makes the auditor architecture-aware — /caudit now spawns an Architecture Adherence Checker agent in every preset that mechanically verifies the codebase against the documented PAT-xxx, ABS-xxx, and TB-xxx entries.

The implementation is entirely prompt-level — a new agent prompt template in skills/caudit/SKILL.md with a corresponding row added to each of the three preset agent tables (QA, Hacker, Performance). Each preset gives the agent a different hostile lens: QA gets “Every documented pattern is violated somewhere,” Hacker gets “Every trust boundary has an unguarded crossing,” and Performance gets “Every layer convention hides a performance shortcut.” The agent’s four check types map directly to the /carchitect roadmap’s four planned capabilities: pattern compliance (layer convention adherence), abstraction invariant checking (dependency direction violations), trust boundary enforcement (anti-pattern presence), and undocumented pattern detection (architecture drift). The last type is informational — it surfaces conventions appearing in 3+ files without a PAT-xxx entry, candidates for /cupdate-arch to formalize.

Three edge-case behaviors are handled by prompt instructions. The dormant-signal fallback (R-004) instructs the agent to emit zero findings and a “skipped” message when ARCHITECTURE.md is missing, has placeholder markers, or has no PAT/ABS/TB entries — it never infers architecture, that is /carchitect’s job. The staleness warning (R-005) uses git log -1 --format='%ai' to compare ARCHITECTURE.md’s last commit date against the most recent source commit; a 30-day gap triggers a single SUSPICIOUS-tier advisory. The exception handling (R-003) instructs the agent to recognize TB-xxx sub-entries (the TB-NNNx pattern where NNN matches the parent and x is a lowercase letter) as documented scoped exceptions, avoiding false positive submissions for intentional deviations like TB-001a or TB-004c.

The architecture_ref field (R-006) is the feature’s contribution to the findings data model. Each finding from this agent carries the specific PAT-xxx, ABS-xxx, or TB-xxx identifier that was violated (or null for undocumented-pattern findings). This field is additive — existing findings without it are valid. The triage agent uses it for deduplication (same entry violated in the same file = same finding regardless of description text). The Regression Hunter (R-010) was updated to read architecture_ref from prior round-JSON files for recurring architecture violation detection, with graceful absence handling for prior runs that predate the field.

The testing approach is keyword-presence (AP-003 class) because all rules are prompt-level instructions. The 48 tests in test-carchitect-phase3.sh verify that the instruction text is present in the SKILL.md prompt for each rule — including the hostile lens framing per preset, the dormant fallback text, the staleness warning mechanism, the TB sub-entry exception handling, the architecture_ref field in the JSON schema example, the four check type descriptions, and the read-only tool access constraint. The source-to-dist sync (PAT-001) is verified by the existing SYNC-001 assertion. This is the standard testing limitation for prompt-level skill modifications — the tests verify instruction presence, not that the LLM follows those instructions at runtime.

2026-05-22 — Review-Driven Mini-Audit Lenses

This feature closes a knowledge gap between Correctless’s review and TDD phases. Prior to this change, the mini-audit spawned six fixed adversarial lenses on every feature regardless of its risk profile — a payments feature got the same “resource bounds” lens as a documentation change. Meanwhile, /creview-spec and /creview deeply analyzed each feature’s specific risks, but that analysis evaporated between phases. The bridge is a structured artifact: review agents write lens recommendations, the mini-audit consumes them, and outcomes are tracked for auditability.

The implementation touches five skill files (skills/ctdd/SKILL.md, skills/creview-spec/SKILL.md, skills/creview/SKILL.md, skills/cmetrics/SKILL.md, skills/cwtf/SKILL.md) and one workflow module (scripts/wf/transitions.sh). The core mechanism is prompt-level: review skills are instructed to write a lens-recommendations-{branch_slug}.json artifact after synthesis, and /ctdd’s mini-audit section is extended to read that artifact, select up to 2 recommended lenses within an 8-agent budget, and instantiate them via a custom lens agent template. The template uses the UNTRUSTED_RECOMMENDATION fence pattern (same as fix-diff-reviewer’s UNTRUSTED_DIFF fence) to wrap the review-generated focus areas and severity guidance — ensuring the custom lens agent treats them as directional guidance, not authoritative instructions. This is the TB-003 / TB-005 mitigation pattern applied to a new trust boundary crossing: LLM-generated review findings flowing into mini-audit agent context.

The design has three key constraints. First, recommended lenses are additive — they never displace the 6 default lenses, especially the two core lenses (hostile-input, cross-component) that catch universal bug classes (PRH-001). Second, review agents write structured recommendations (name, focus areas, severity guidance), not full agent system prompts — the mini-audit owns prompt construction (PRH-002). This prevents review agents from bypassing the mini-audit’s severity calibration, output format contract, and fail-open behavior. Third, the recommendation artifact never gates any pipeline phase transition (PRH-003). This is essential because standard-intensity workflows may not run /creview-spec at all — gating on recommendations would break those workflows.

The artifact schema (ABS-036) follows established patterns: branch-scoped by filename (PAT-004), dormant degradation when absent (PAT-019), gitignored under .correctless/artifacts/. The cmd_done gate in scripts/wf/transitions.sh emits a non-blocking warning when the artifact exists but has no outcomes field — a warning, not a gate, consistent with PRH-003. The LENS field in qa-findings JSON is now an open enum, accepting both the 6 fixed lens values and any recommended lens name. This avoids the cascading test updates that would follow each time a review recommends a novel lens concept. The priority heuristic for selecting which 2 of N recommended lenses to run (CRITICAL/HIGH findings first, then source agent diversity) ensures the most important lenses are chosen when the budget is exceeded. Unselected lenses are logged with ran: false and failure_reason: "budget exceeded" in outcomes for full auditability.

Testing follows the standard keyword-presence approach (AP-003 class) for prompt-level rules, with 80 assertions across 19 spec rules in tests/test-review-driven-lenses.sh. The test file also verifies structural properties: the allowed-tools update in /creview (INV-011), the LENS enum extension in qa-findings schema (INV-012), the non-blocking warning in scripts/wf/transitions.sh (INV-006), and the ABS-036 entry in ARCHITECTURE.md. Two existing test files were updated to accommodate the wording change from “spawns 6 specialist agents” to “spawns the 6 default specialist agents” in the mini-audit progress announcement.

2026-05-23 — Review Intelligence Consumer

This feature completes the second consumer integration for the cross-feature intelligence brief (ABS-037). The parent feature (cross-feature-intelligence) built the aggregation script and wired /cspec as the first consumer; this feature extends /creview-spec and /creview to also consume the brief during their Historical Pattern Integration/Findings sections. The core design decision is the separation between reading and writing: review skills read the brief file directly via jq (no script invocation), preserving the invariant that only /cspec triggers regeneration and occurrence count increments. Without this separation, a single feature pipeline (/cspec -> /creview-spec -> /creview) would trigger three regeneration cycles, crossing the 3-occurrence threshold within one pipeline run and defeating the feedback loop dampener entirely.

The implementation touches both review skill SKILL.md files with identical Intelligence Brief Integration sections. Each section includes an anti-anchoring directive adapted for the review context (distinct from /cspec’s brainstorm examples), a jq command with client-side occurrences >= 3 filtering, and dormant degradation per PAT-019. The Bash(*cross-feature-intel*) allowed-tools pattern was added to both skills’ frontmatter to enable the jq read. The critical structural constraint is INV-003/PRH-001: the 6 adversarial agents in /creview-spec and the single-pass agent in /creview must never see brief data. Only the orchestrator reads the brief during synthesis, preserving the unanchored adversarial analysis that is the review’s primary value. Tests verify this by grepping agent definition files for cross-feature-intel references.

The script (scripts/cross-feature-intel.sh) gains occurrence tracking machinery. On each regeneration: existing entries get their occurrences field incremented by 1, new entries start at 1, and entries that leave the brief (filtered out by scope) have their count preserved in a _dormant_counts metadata section for future re-appearance. Pre-occurrence-tracking entries (without an occurrences field) are treated as 0, so the first run seeds them at 1 — a conservative default that means the dampener works correctly from day one. The _dormant_counts section is capped at 100 entries with alphabetical eviction (an approximation — age tracking was considered but deferred for v1). The atomic write uses an echo+tmp+mv pattern rather than locked_update_file() from lib.sh because the script writes complete JSON from scratch rather than applying a jq filter to existing content — a distinction acknowledged in QA-001.

The --min-occurrences N flag provides script-side filtering for stdout output only: entries below the threshold are excluded from stdout but their occurrence counts are always tracked in the on-disk file. This flag exists for potential future consumers that want filtered output without implementing their own jq filter, but the current review skill integration uses client-side jq filtering directly (INV-002). ABS-037 was updated from “idempotent” to “stateful” and its consumer list now includes both review skills with a note explaining they are pure consumers, not regeneration triggers. TB-003’s mitigation variant text was updated to list /creview-spec and /creview as anti-anchoring directive consumers alongside /cspec.

Two smaller additions round out the feature. /cstatus gains threshold proximity reporting — when the brief exists, it reports how many entries are at each occurrence count below the threshold (e.g., “5 entries at 2/3 occurrences, 3 entries at 1/3”), providing diagnostic visibility for why intelligence is not surfacing in reviews. Review findings artifacts gain an Intelligence brief: metadata line recording consumption status (“consumed” vs “dormant”), providing a persistent record distinguishing “intelligence was unavailable” from “intelligence found nothing relevant.” The 58 tests in tests/test-review-intel-consumer.sh cover all 16 spec rules with keyword-presence and behavioral tests, including boundary conditions for first-ever generation (BND-002), all-below-threshold (BND-001), and entry leave/re-enter via _dormant_counts (BND-003) with corruption handling.

2026-05-24 — Documentation and Artifact Pruning Skill

This feature adds /cprune, the first maintenance-oriented skill in the project. After 71 features and 57 days of development, documentation artifacts accumulate without any removal mechanism – ARCHITECTURE.md has 37 ABS entries, antipatterns.md has 31 AP entries, and there are 373 artifact files in .correctless/artifacts/. When referenced files are deleted (via refactoring, feature removal, or branch cleanup), the entries that reference them become context-token waste and anchor agents on outdated information. /cprune addresses this with a scanner-plus-orchestrator architecture that detects staleness candidates mechanically and handles disposition through two modes.

The core mechanism is scripts/prune-scan.sh (777 lines), a standalone bash scanner that accepts --category and --base flags and outputs a JSON array of staleness candidates. The scanner covers 9 categories: architecture entries (ABS/PAT/TB/ENV with all-dead file references), antipatterns (AP-xxx with all-dead test/script references), CLAUDE.md learnings (feature-specific entries with all-dead references), orphaned artifacts (files for branches that no longer exist), stale deferred findings (open findings whose source review artifact was deleted), AGENT_CONTEXT.md count drift (stated vs actual counts), cross-reference consistency (stale Enforced-at paths), completed specs (merged 30+ days ago), and drift debt (resolved/wont-fix entries older than 90 days). The scanner sources scripts/lib.sh (ABS-001) for branch_slug() and shared utilities, and it uses a deterministic extraction approach – backtick paths, Enforced at fields, Test fields, See-links, and path patterns in Violated when fields. The key design decision for architecture entry detection is the “all-dead” criterion: an entry is only a staleness candidate when ALL extracted file paths are dead. Entries with at least one live reference are never candidates (PRH-003). Class-level antipatterns and conventions/postmortems in CLAUDE.md are excluded from staleness detection regardless of file reference status – the class transcends the instance.

The skill definition at skills/cprune/SKILL.md orchestrates the scanner with two execution modes. Autonomous mode (invoked by /cauto via mode: autonomous in the prompt) auto-executes only low-risk actions: orphaned artifact cleanup, AGENT_CONTEXT.md count corrections, resolved drift-debt removal (90+ days), and spec archiving (90+ days post-merge). It skips categories where >50% of entries are flagged (BND-002 safety valve – this typically indicates a major refactor, not staleness) and entirely excludes CLAUDE.md (PRH-002 – too high-risk for autonomous editing). Interactive mode presents all candidates in a formatted report with per-category disposition options (execute all, review individually, skip). The archive-not-delete design (DD-001/INV-004) ensures documentation entries are moved to dedicated archive files rather than deleted: .correctless/ARCHITECTURE_DEPRECATED.md for architecture entries, .correctless/antipatterns-archived.md for antipatterns, .correctless/CLAUDE_LEARNINGS_ARCHIVED.md for CLAUDE.md learnings. Archived entries retain their original IDs, and the archive write must complete before the source removal – crash-safe ordering that prevents entry loss.

The /cauto integration uses intensity-aware placement (DD-005/INV-012). At high+ intensity, /cprune runs after /cupdate-arch – architecture docs are being updated anyway, so pruning alongside ensures they are both accurate and lean. At standard intensity, /cupdate-arch is skipped entirely, so /cprune runs after /cverify instead. In both cases, /cprune is an internal orchestration action excluded from the ABS-031 canonical step name enum (same pattern as the Step 7.5 backlog sweep). The /cstatus integration (INV-013) runs a lightweight threshold check via the scanner and surfaces a “pruning recommended” signal when orphaned artifacts exceed 10 or stale architecture entries exceed 3. This check is dormant (PAT-019) when scripts/prune-scan.sh is not installed, ensuring no errors on projects that have not adopted this skill.

Security enforcement follows established conventions. The scanner script and all three archive files are protected by hooks/sensitive-file-guard.sh (INV-016), preventing LLM agents from modifying staleness detection logic or injecting entries into archive files. ABS-038 declares the archive file contract with /cprune as the sole writer. The /cauto consolidation step (Step 8.1) staging allowlist includes all three archive files (INV-017), ensuring archive changes during the pipeline are committed. /cprune is explicitly read-only for deferred-findings.json (PRH-004) – it reports stale deferred findings but delegates status updates to /ctriage, avoiding a 5th writer on the ABS-033 multi-writer contract. The 116 tests in tests/test-cprune.sh cover all 19 invariants, 4 prohibitions, 4 boundary conditions, the ABS-038 architecture entry, determinism, and edge cases including empty archives, bulk warnings, no-remote fallback, and real ARCHITECTURE.md entry fixtures (per AP-031).

2026-06-03 — Disallowed-Tools Frontmatter

This feature adds disallowed-tools frontmatter to 12 skills that should never edit source files, applying PAT-018 (structural enforcement over prompt-level instruction) as a defense-in-depth layer alongside the existing allowed-tools whitelist. Claude Code v2.1.150 introduced disallowed-tools in skill YAML frontmatter, which structurally removes listed tools from the model while the skill is active – a blocklist complementing the allowed-tools allowlist.

The 12 skills are split into two groups based on their write requirements. Group A (chelp, cstatus, cdashboard) produces no file output via Write, so all five write-capable tools are disallowed: Edit, Write, MultiEdit, NotebookEdit, CreateFile. Group B (cexplain, cwtf, cmetrics, csummary, cpr-review, cmaintain, cmodel, cmodelupgrade, ctriage) writes artifacts via the Write tool (e.g., .correctless/artifacts/wtf-*), so only four tools are disallowed – Write is retained. The remaining 20 skills are exempt because they legitimately use Edit/Write for source file modifications (e.g., /ctdd writes tests, /creview edits specs).

The implementation touches 24 SKILL.md files (12 source + 12 distribution copies) with a single frontmatter line each. The test file (tests/test-disallowed-tools.sh, 339 lines, 117 assertions) covers 7 spec rules. R-005 is the most interesting structurally: it extracts tool basenames by stripping sub-pattern scoping (e.g., Write(.correctless/artifacts/wtf-*) yields Write) and checks that the disallowed set is disjoint from the allowed set. A sub-rule enforces that Group B skills specifically do not disallow Write. R-007 implements a full partition test – every skill in the skills/ directory must be classified as Group A, Group B, or Exempt. This structural drift test ensures that any new skill added to the project triggers a test failure until the developer classifies it, preventing silent omission of write-protection on read-only skills.

ENV-011 was added to ARCHITECTURE.md for the Claude Code v2.1.150 version dependency. On older versions, the disallowed-tools key is silently ignored – no crash, no enforcement. The allowed-tools whitelist handles protection alone. This graceful degradation means the feature never breaks backward compatibility. The defense-in-depth framing is deliberate: neither layer alone is sufficient (the allowed-tools list could be misconfigured; disallowed-tools could be unsupported), but together they provide both “only these tools” and “never these tools” constraints on the same skill.

2026-06-12 — AP-031 Fixture Divergence Prevention

This feature is the structural answer to two back-to-back postmortems with the same root cause. PMB-010: sync-deferred-backlog.sh parsed review findings with a heading regex expecting ## RS-001: while the real /creview-spec output writes ## Finding RS-001: — all 65 tests passed against hand-written fixtures encoding the wrong format, and the script silently imported 0 of 25 pending findings. PMB-011: the /cprune scanner shipped with three more instances of the same class (17 false positives from basename fixtures, a count regex that matched PAT-003 script before the actual count, drift-debt fixtures missing the real {"drift_debt": [...]} wrapper). The class is “test fixtures diverge from real producer output” — AP-031 in the antipattern catalog. The bet here is that the divergence is introduced at exactly two moments (spec writing and test writing), so prevention belongs in the prompts that govern those moments rather than in a runtime validator.

What was written is almost entirely prose-as-code: three directive blocks plus one structural test. Layer 1 lives in skills/cspec/SKILL.md Step 3 — when a feature parses another Correctless tool’s output, the spec must pin the exact format (heading regex, JSON schema, field names) and cite the producer file path as the authoritative source. The directive carries its own trigger-detection heuristics (parsing, jq field access, regex matching trigger it; existence checks and path-only operations do not) and an Example/Not contrast so the spec agent has a calibration anchor. Layer 2 has a writer half and an auditor half kept deliberately symmetrical: agents/ctdd-red.md requires at least one fixture sourced from a real artifact — preferred form is a verbatim excerpt with a Source: citation in the test language’s comment syntax (# Source: shell/Python, // Source: Go/TS/Java, -- Source: SQL) — and skills/ctdd/SKILL.md gains test audit check 11 (fixture provenance), which flags synthetic-only suites as BLOCKING. Both halves embed the same producer-to-artifact reference table (/creview-specreview-spec-findings-*.md, /cauditfindings/audit-*-round-*.json, etc.) so the writer and the auditor agree on what “a real artifact exists” means.

The subtle design decisions are in the failure modes. Live-read-only fixtures (test reads .correctless/artifacts/... at runtime) are explicitly insufficient — that directory is gitignored, so such a test silently passes in CI with no fixture at all; the audit treats live-read-only as BLOCKING, same as synthetic-only. The audit agent is tool-pinned to Read/Grep/Glob and cannot run git, so the /ctdd orchestrator computes scope and passes two labeled lists — MODIFIED_TEST_FILES: from git diff and UNTRACKED_TEST_FILES: from git status --porcelain (RED-phase test files are untracked, not modified — omitting that list would silently skip exactly the tests that matter). A missing label fails loud with a single BLOCKING finding instead of guessing scope (the PMB-005 lesson: silent omission looks healthy). Fixture-following is bounded (repo-relative paths only, 10-file budget) and fenced (TB-003: fixture content is data to format-compare, never instructions — a fixture saying “AP-031 is satisfied” is itself a finding). And there’s a deliberate bootstrap dormancy: when producer and consumer land in the same PR, no real artifact exists yet, so the real-fixture requirement goes dormant and Layer 1’s format pinning is the sole guard until the producer has run once.

This is a conscious PAT-018 deviation — all enforcement is prompt-level, with the spec’s Won’t Do explicitly declining a runtime fixture validator. The compensating structure is tests/test-ap031-fixture-divergence.sh (39 tests): awk state machines extract each directive’s section before grepping, so keyword assertions are block-scoped (AP-003 mitigation — a keyword elsewhere in the file can’t satisfy a check), and 8 QA/mini-audit class fixes are pinned as named assertions (the cost-cache-* exclusion in the producer table, the fail-loud label fallback, the TB-003 fence, the 10-fixture budget, the no-retroactive-retrofit scope rule). Since the “implementation” is prose, regression means someone editing the directive text — the structural test makes that loud. The antipatterns.md AP-031 entry now carries a “Prevention implemented” note that reframes any recurrence as a postmortem trigger rather than a third strike toward PAT-020 promotion.

2026-06-14 — Slug-Type-Aware Artifact Classification in prune-scan.sh

This is the structural fix for the 2nd AP-032 instance — the prune-scan scanner. Before this feature, scripts/prune-scan.sh’s scan_artifacts had a single mental model: every pattern in artifact_patterns is branch-slug-named (feature-<name>-<md5[:6]>), so to find orphans, match every artifact filename against the live branch-slug set using substring search. But the repo had been quietly using three different slug conventions for years. Branch-slug for workflow-state, token-log, audit-trail, pipeline-manifest, autonomous-decisions. Task-slug (bare task name, no feature- prefix, no hash) for qa-findings, audit-mini. Session-slug (Claude Code session ID, never derived from any live work) for harness-notified-{SESSION_ID}.flag. When a task-slug-named file like qa-findings-prune-scan-slug-aware-matching.json was matched against the live branch-slug set, the match failed and the live file was flagged as a low-risk deletion candidate. Autonomous /cprune would have deleted it. UX-R2-014 had patched the qa-findings instance specifically by removing it from the patterns list, but the bug class was untouched. Any future task-slug-named pattern would silently exhibit the same data-loss vector.

The fix is _classify_artifact_pattern — a bash function in scripts/prune-scan.sh that maps every artifact_patterns entry to exactly one of branch-slug, task-slug, session-slug, or unclassified. It’s total over artifact_patterns (every pattern has a case branch) and defined exactly once (structural test INV-001). When the safety belt checks whether an artifact protects live work, it consults the live-slug set that matches the pattern’s classification — branch-slug patterns against live_branch_slugs (computed via branch_slug() from scripts/lib.sh), task-slug patterns against live_task_slugs (derived from basename(.spec_file, ".md") for each workflow-state-*.json whose .branch is in the live branch set — no .task fallback per EA-003), session-slug patterns are never live-prunable, and unclassified patterns are skipped with an observable skipped_unclassified JSON entry plus stderr advisory (never silently dropped). The producer-pattern table in .correctless/specs/prune-scan-slug-aware.md is the source of truth — INV-008 parses it and the artifact_patterns= assignment line via sed (no prose-grep, no source-and-read) and asserts bidirectional coverage with an allowlist cap of 5.

The other half of the fix is the matching primitive itself. The pre-feature scanner used grep -F "$slug" and unquoted =~ $slug — both substring primitives that cannot distinguish feature-foo-abc from feature-foo-def, or qa-findings-foo from qa-findings-foo-2. The new primitive is bash [[ regex with [-.] delimited-token boundaries: [[ $f =~ ^(.+-)?$slug([-.]|$) ]]. Substring primitives are structurally banned by the new prune-scan-substring-match rule in scripts/antipattern-scan.sh check_shell(). Slug values pass through _slug_is_safe validation at extraction boundaries AND ERE metacharacters are escaped via _escape_ere_metachars before regex interpolation — dual defense ensures malformed slugs are rejected at the boundary AND that any slug slipping through cannot exploit ERE metachar interpretation. MA-001 was the round-2 mini-audit finding that surfaced the ERE escape requirement; the fix is the _escape_ere_metachars helper at lines 222-228 plus the _slug_is_safe gate at lines 251-256.

Six fail-closed paths were added at safety-belt boundaries that pre-feature silently collapsed: (1) empty live-branch-slug set, when git branch returns no live branches, fails non-zero with stderr advisory instead of proceeding with an empty set that would classify every artifact as orphaned (F-001 fix at scripts/prune-scan.sh:744); (2) empty live-task-slug set with the same fail-closed posture; (3) missing realpath — _realpath_tool_available probes for realpath/readlink -f at scan entry, and when neither is available, the scanner exits non-zero with stderr advisory; it never silently falls back to lexical canonicalize_path for symlink-equivalence decisions (this is PAT-020, the canonicalization-fallback antipattern); (4) workflow-state mid-write TOCTOU, where identity comparison uses content-based started_at string equality (primary) then composite task|branch (fallback) then sha256(file) (last resort), never mtime — extending the ABS-029 content-based-match convention to cross-worktree scenarios where the same logical workflow-state may be observed at different paths (MA2-002 fix); (5) non-git BASE_DIR aborts with stderr advisory rather than swallowing git errors silently; (6) lib.sh sourcing failure, where scan_artifacts aborts when branch_slug() isn’t defined after sourcing rather than calling an undefined function.

Schema migration was the other major design decision. The pre-feature scanner emitted a bare JSON array of candidates. The new scanner emits a wrapped object {candidates, skipped_unclassified, protection_set, protection_status}. Consumers (/cprune and /cstatus) read .candidates for the candidate list and have visibility into which patterns were skipped (and why), what protection set was applied (live_branch_slugs, live_task_slugs, session_id), and what protection status was achieved (branch_slug_set_populated, task_slug_set_populated, realpath_available). BND-001 verified the consumer migration — reading the top-level value as an array fails the test. Both skills/cprune/SKILL.md and skills/cstatus/SKILL.md were migrated in the same PR to maintain the consumer contract.

ABS-040 introduces the baseline manifest: a JSON file under .correctless/meta/ recording the operator-acknowledged pattern set. Sole writer is scripts/prune-scan.sh --update-baseline, never set as a side effect of scanning. Autonomous /cprune runs, /cstatus runs, and default-mode /cprune runs all leave the baseline untouched. Update happens only when /cprune SKILL.md invokes the scanner with --update-baseline after interactive human confirmation. For any pattern present in current artifact_patterns but absent from the baseline, candidates emitted via that pattern carry risk: "medium" regardless of safety-belt outcome — preventing auto-promotion of newly-added patterns to low risk without human review. When the baseline file is missing or corrupt, the scanner fails closed to all-medium (INV-011a); it does not proceed as if baseline equaled current set. The baseline file is SFG-protected.

The producer-pattern table approach (INV-008) is itself a deliberate AP-031 satisfaction. Rather than a runtime pattern registry that could drift from the implementation, the spec is the authoritative source. The structural test parses both the spec table and the bash artifact_patterns= assignment via sed and asserts bidirectional coverage at CI time. The real-fixture requirement is satisfied by tests/fixtures/prune-scan/wfstate-real-sample.json — a verbatim 17-line excerpt of a real workflow-state JSON cited via # Source: comment. The fixture exercises the _workflow_state_identity content-based fence — the same primary→fallback→last-resort chain as ABS-029 applied at a new context (cross-worktree state identity instead of audit findings persistence).

The QA round caught 4 BLOCKING findings (F-001 empty live_branch_slugs, F-002 INV-018 dead-code, F-003 silent branch_slug failure, F-004 unescaped task slug in bash ERE) and the mini-audit caught 6 more across rounds 1 and 2 (MA-001 metachar escape, MA-002 pattern_is_new default, MA-003 baseline shape validation, MA-005 parent symlink bypass, MA2-001 realpath fallback, MA2-002 workflow-state identity, MA2-004 set -f noglob). All 10 are fixed and traceable in code via F-NNN fix: / MA-NNN fix: / MA2-NNN fix: comments. 61 tests in tests/test-prune-scan-slug-aware.sh cover all 18 invariants, 2 prohibitions, 2 boundary conditions, the extended EA-001, and the antipattern-scan rule registration. The cverify pass found pre-existing test failures in tests/test-cprune.sh (INV-013-d AP-033 pipefail flake, INV-016-a/b cprune SFG gap) unrelated to this branch — flagged for next debt sprint.

AP-032 is now at frequency 2. Both instances share the same shape: extraction step correct, resolution step incomplete. The cprune-skill 2026-05-24 instance was basename resolution against literal paths (file_exists("lib.sh") returns false when the file lives at scripts/lib.sh). This instance is substring slug matching against delimited tokens. A 3rd instance promotes AP-032 to a PAT-xxx structural rule: “any tool that resolves named references (paths, slugs, identifiers) against on-disk artifacts must define explicit resolution semantics, not lift the comparison primitive from convenience.”

2026-06-15 — Fix-diff reviewer class-shaped bug lens + SFG lift-and-restore backstop

What & why. PMB-019 recurred a class-shaped bug: PR #124 fixed one ARG_MAX site in scripts/build-dashboard.sh and missed the sibling read_file_json helper using the same --arg "$content" pattern. This feature adds a class-shaped bug detection lens to the fix-diff reviewer (agents/fix-diff-reviewer.md): when a fix is scope-narrowed (one site of a multi-instance pattern), the reviewer greps same-directory same-extension sibling modules before approving and emits a HIGH finding unless a SIBLING-DEFERRED: marker enumerates the deferrals. Because the reviewer’s deliverable file is itself SFG-protected (AP-037), the feature also ships an SFG lift-and-restore backstop subsystem so future PRs can develop the guarded file.

What was written. 21 invariants CS-001..CS-021. New: scripts/build-caudit-prompt.sh (the /caudit Step 6a <UNTRUSTED_FINDING_DESCRIPTION> + <PRE_PR_BASE_MARKERS> fence producer), scripts/build-pre-pr-base-markers.sh, scripts/check-no-pending-sfg-lift.sh (CS-012a final-state backstop), a cmd_done sentinel gate in the workflow dispatcher, a dedicated sfg-lift-check CI job, .claude/rules/sfg-deliverable.md, ABS-041 in ARCHITECTURE.md, and ~1300 lines of structural+behavioral tests in tests/test-fix-diff-reviewer-agent.sh (the CS-007 cardinality checklist asserts membership-equality over the 20-ID set).

How it works / hard-won lessons. Three QA rounds and two mini-audit rounds each found real bugs, almost all of the same class the feature exists to prevent: (1) the done-gate sentinel was twice dead — first never written, then keyed its filename on HEAD with content==HEAD so the mismatch branch was unreachable; the fix is a single fixed-name .correctless/artifacts/test-success.sha holding the SHA the suite last passed at, with a behavioral test that constructs the mismatch and asserts refusal. (2) The CS-011 fence producer was prose-only (test-only helper), reopening AP-026 — promoted to a real coded invocation in Step 6a. (3) The producer itself reintroduced PMB-019: jq --arg on unbounded descriptions silently lost findings, and the ceiling fix bounded only the diff while pre-PR-base markers stayed unbounded — fixed class-wide by routing every artifact-sized value through a single stdin→file→cap chokepoint and reserving trusted close fences in a tail post-assembly truncation can’t reach. (4) Character-delimited fences were forgeable (inject a fake rules block / forge a pre-PR-base marker) — fixed with per-invocation nonce-delimited fences + content neutralization. Patterns used: ABS-029 (gate-enforced phase-transition artifact contract), ABS-035 (dispatcher keeps zero cmd_* definitions — the gate is a _-prefixed helper), ABS-010 (byte-equal distribution mirror), PAT-018 (the read-scope deny-list is a prompt-level fallback; the structural Read-guard is deferred to a /carchitect cycle per OQ-010).

The meta-lesson. Every fix round that scope-narrowed (patched one site) sprouted a sibling the next round caught — the feature’s own thesis applied to its own development. The final producer-hardening round explicitly inventoried every argv site and every unbounded body component before declaring the class closed.

2026-06-16 — Cross-Model Spec Review via codex

What was built and why. /creview-spec already had a dormant “external review” path — a config block and a Step 3 stub that never did anything. This feature makes it real: codex (GPT-5.5) becomes a first-class adversarial spec reviewer that runs alongside Claude’s six review agents. The motivation is the project’s founding principle pushed one step further — not just “a different agent grades the work,” but a different model from a different vendor with a different failure distribution. A spec that survives both Claude’s lenses and an independent GPT-5.5 read is materially harder to get wrong. The path stays off by default and silently dormant when no external model is configured, so existing users see no change until they opt in.

What code was written. Two new scripts plus skill wiring. scripts/external-review-run.sh (~660 LOC) is the producer: subcommands review (invoke codex, capture findings), record (append a run to history), set-disposition, pending, and findings-block. cmd_review builds a local -a argv array, injects --sandbox read-only and the --output-schema/--output-last-message flags itself, runs codex with the spec on stdin (never argv), then routes the result through _validate_invocation (closed allowlist), _sanitize_findings (parse-gate, caps, EXT- renamespace, severity coercion), and _within_size_ceiling (4 MiB). It reuses build-caudit-prompt.sh’s _gen_nonce/_neutralize_fences verbatim for the untrusted-output fence. scripts/config-update.sh (~190 LOC) is the sanctioned writer for the two config fields (set-external-model, set-require-external-review) using jq --arg/--argjson + atomic temp+mv, so config never transits a shell redirect. /csetup, /creview-spec (Step 3/3.5), and /cstatus got the wiring; both new scripts were added to the SFG DEFAULTS.

How it works — the trust model. codex is treated as doubly untrusted: its output is shaped exactly like the review findings the orchestrator acts on (TB-008), and its config is the first config input treated as untrusted-against-tampering rather than owner-trusted (TB-001c). The defenses compose: the invocation is a closed allowlist with bin-realpath + flag-shape + model-charset + clamped-timeout validation; the output is parse-gated, bounded, and nonce-fenced before it reaches Claude’s reasoning; and codex findings are advisory-only — renamespaced EXT-NNN, surfaced at the Step 4 human disposition gate, never auto-incorporated. The read-only sandbox bounds writes, not egress — the egress boundary is the opt-in config gate (INV-005 auto-off-when-absent) plus the INV-014 config-time and INV-022 per-run disclosures, deliberately not the sandbox.

The CRITICAL the mini-audit caught (and the design lesson). QA round 1 and the red-team probe both passed, but the mini-audit’s hostile-input lens found that --sandbox read-only lived in the config’s base_args — so a tampered or hand-edited config that dropped it would run codex unsandboxed while every test still passed. That is the AP-022 dead-code-in-security-paths shape exactly: the guard existed but a config could route around it. The fix moves sandbox injection into the producer unconditionally and strips any config attempt to set --sandbox, with a regression test that captures the real argv and asserts the flag is present regardless of config. A round-2 hostile-input re-attack confirmed the fix holds against every bypass vector tried. This is the load-bearing argument for narrow tool allowlists and producer-side enforcement: security-relevant flags belong in code the config cannot reach, not in config the user can edit. Patterns used: ABS-042 (sole-writer producer), ABS-003 (locked_update_file for the history append), INV-009 fence reuse from build-caudit-prompt.sh (don’t reinvent neutralization), PAT-018 (structural enforcement over prompt-level for the sandbox flag).