/cmodelupgrade — Harness Regression Report
Compare the current
{model}+HARNESS_VERSIONcombination’s pipeline metrics against a stored baseline and produce a per-feature regression report. Strictly advisory — never auto-applies anything.
When to Use
- After Anthropic ships a new model (e.g., 4.6 → 4.7) and you want to know whether the workflow’s behavior has shifted on your project
- When
/cspecor/cstatussurfaces aversion_bumpedadvisory (the maintainer has bumpedHARNESS_VERSIONinscripts/harness-fingerprint.shbecause a behavioral change was observed) - To capture an initial baseline for the current model+version with
--capture-baseline(requires at least 2 qualifying runs at the current key, plus human confirmation) - Not for: auto-migrating agents, modifying skill files, or making any change to the workflow. The report is informational; follow-up is human-driven.
How It Fits in the Workflow
This skill sits outside the standard feature pipeline. It’s invoked manually after a model upgrade or when the harness fingerprint mechanism flags a change. The fingerprint is computed at every /cspec Step -1 by scripts/harness-fingerprint.sh — when the literal {model_name}|{HARNESS_VERSION} string differs from the stored value, /cspec and /cstatus surface a one-time advisory pointing here.
What It Does
- Reads the current fingerprint via
bash .correctless/scripts/harness-fingerprint.sh check— extractsmodelandharness_version, constructs the lookup key{model}|{HARNESS_VERSION} - Reads four data sources per feature:
intensity-calibration.json(qa_rounds, total_tokens),cost-*.jsonglob (total_cost_usd from ABS-026 cost artifacts — never hardcoded list),workflow-state-*.json(phase_count),model-baselines.json(the stored baseline for the current key) - Three-tier bootstrap lookup: exact-match pool (entries tagged with current
harness_version) → pre-fingerprint pool (entries from before /cverify recordedharness_version, used with explicit"pre-fingerprint baseline"label) → no-baseline mode (clear message, exit 0 — never compares against zero) - Renders a per-feature regression table: up to N=5 features, columns for baseline/current/delta/percent across qa_rounds, total_tokens, total_cost_usd, phase_count
--capture-baselinemode: aggregates per-feature metrics from the most recent N=5 qualifying runs, requires ≥2 runs (incomplete runs excluded), surfaces source slugs to the human for explicit confirmation before writingmodel-baselines.json. The--auto-confirmflag is testing-only and emits abootstrap_auto_confirmedaudit-trail entry when used
Example
You ran /cspec for a new feature and saw:
Harness has changed (model=claude-opus-4-7 version=2). Run /cmodelupgrade to compare metrics against baseline.
You finish the feature and then run /cmodelupgrade:
Lookup key: claude-opus-4-7|2
Pool: exact-match (3 qualifying runs)
Baseline: claude-opus-4-7|1 (5 features, captured 2026-04-15)
| Feature | qa_rounds Δ | tokens Δ | cost Δ | phases Δ |
| feature/auth-redirect | +1 (+50%) | +180k (+22%) | +$1.10 | 0 |
| feature/email-queue | 0 | +95k (+12%) | +$0.40 | 0 |
| feature/rate-limit | 0 | +210k (+28%) | +$1.55 | 0 |
Token cost delta is consistent at +20-28% across all 3 features. Possible
harness change: increased verbosity in spec/review phases. Consider running
/caudit to look for systemic patterns, or inspect agents/*.md if regressions
appear in a specific agent type.
You decide whether the deltas warrant a deeper investigation. The skill never spawns subagents and never auto-applies any fix.
What It Reads / Writes
| Reads | Writes |
|---|---|
.correctless/meta/harness-fingerprint.json (read-only — sole writer is the script) | .correctless/meta/model-baselines.json (with --capture-baseline only, schema_version: 1 from creation) |
.correctless/meta/intensity-calibration.json (qa_rounds, total_tokens, harness_version) | .correctless/artifacts/cmodelupgrade-{slug}.md (the rendered report) |
.correctless/artifacts/cost-*.json (total_cost_usd glob) | (no other writes — TB-004 / PRH-003 forbid auto-application) |
.correctless/artifacts/workflow-state-*.json (phase_count) | |
.correctless/test-features/baseline.md (controlled-baseline reference feature) |
Common Issues
- “No baseline available”: Either no qualifying runs have been recorded for the current
{model}+{HARNESS_VERSION}key, or the baseline file is missing. Run/cmodelupgrade --capture-baselineafter running/cautoon.correctless/test-features/baseline.md(scaffolded by/csetup). - “need at least 2 qualifying runs”:
--capture-baselinerequires ≥2 complete runs at the current key. Pipeline-aborted runs markedincompletein their workflow state are excluded — degenerate runs cannot poison the baseline. pre-fingerprint baselinelabel: When only entries from before the fingerprint mechanism was added qualify, the report uses them with this explicit label. The exact-match pool (entries tagged withharness_version) is preferred whenever it exists. Pools are never mixed.schema_versionmismatch onmodel-baselines.json: BND-004’s fail-open path emits a one-line stderr warning, treats the baseline as missing, and prompts re-capture. Never errors, never blocks.
Constraints
- Strictly advisory (PRH-001/PRH-003). The exit-code contract is
0(success including no-baseline message),1(unexpected error),2(unrecoverable corruption with no migration). The report never auto-triggers any migration, agent file change, or skill modification. - Per-feature granularity only. Per-skill granularity is deferred until audit-trail records per-phase qa_rounds and token-tracking can backfill from cost artifacts (both upstream changes are out-of-scope here).
- Sole writer of
model-baselines.json. The fingerprint file itself (harness-fingerprint.json) is read-only here — sole writer isscripts/harness-fingerprint.sh. Both files are protected byhooks/sensitive-file-guard.shwhich blocks Edit/Write AND Bash redirects (PRH-002 structural enforcement, mitigates AP-022). - No subagents (ME-12 of harness-fingerprint spec). All aggregation, comparison, and report rendering happens inline. The
Tasktool is excluded fromallowed-tools. - Path discovery via
workflow-advance.sh statusfor every artifact (AP-025 / PMB-004). Never assumes the orchestrator already knows artifact paths from conversation context.