Pre-Registered Roadmap  ·  Year-One Window

The Evaluation Roadmap

A public clock. Ten dated milestones. Pre-registered thresholds. Named external witnesses. The cheapest forgery-resistant external signal a solo project can manufacture.

Versionv1.0 — Published May 2026
Git tagroadmap-v1.0
WindowMay 2026 → April 2027 (year one)
CompanionThe Validation Paradox
Canonical sourcedocs/research/evaluation-roadmap/roadmap-v1.0.md
AuthorLee Wei Bin (Alton), VerifiMind · FLYWHEEL TEAM
LicenseCC BY 4.0
Companion document. The Validation Paradox page named the problem: X / Z / CS are prompt templates running on frontier models, with no labeled evaluation set, no calibration, no execution sandbox, and no inter-judge agreement statistics. It ended on a single line: the only available exit from a closed validation loop is an external signal.

This page is that external signal — a public clock, pre-registered before the data is collected, with failure conditions stated up front. Failure numbers will ship in the same font size as success numbers.
SECTION 0

What VerifiMind is not yet

Before the roadmap, the disclaimer block — written so an enterprise buyer who finds this page through search cannot misread the paradox page as proof of certification:

Honest scope statement, May 2026:
  • VerifiMind is not a certified safety auditor.
  • VerifiMind is not a validated production evaluation tool.
  • VerifiMind does not currently produce calibrated confidence intervals.
  • VerifiMind does not currently execute submitted code in an isolated sandbox.
  • VerifiMind has not yet established inter-judge agreement statistics against a held-out human-labeled set.

What VerifiMind is, today, is a structured hypothesis about how a three-agent critique pipeline (novelty / scoring / code-security) might improve on single-call frontier-model critique. This roadmap is the plan to make that hypothesis falsifiable.

SECTION 1

Pre-registered thresholds

These are stated before any data is collected. Failure to clear them does not kill the project. It triggers a public retrospective that names what changed and why.

ThresholdMetricPre-registered bar
Inter-rater reliabilityCohen’s κ (Landis–Koch scale)κ ≥ 0.60 for usable claims; ≥ 0.80 for production-eligible language
Scoring calibrationExpected Calibration ErrorECE ≤ 0.05 on the held-out labeled set
Cross-family judge validityκcross across Anthropic / OpenAI / Google judgesκcross ≥ 0.60 (lower 95% CI)
Perturbation invarianceAgreement under semantics-preserving perturbationsI > 0.90
Human-anchor agreementJudge vs. human gold on anchor setAH ≥ 0.80
“Better than baseline”F1 lift vs. single-call frontier baselineΔF1 ≥ 0.10, lower 95% CI above 0
CS execution successSandbox runs on labeled vulnerability corpusESR ≥ 0.95 on runnable tasks

The labeled-set size is sized to detect a 10-point F1 improvement at α = 0.05, power = 0.8: 800 items total, ≥ 400 positives per primary slice; 400 adversarial items (≥ 100 per attack family); 200 retest items for test-retest reliability; 200 human-anchor items.

SECTION 2

The Public Roadmap — 10 milestones

Each milestone has (a) a concrete artifact, (b) a pre-registered numerical threshold for what “passes,” and (c) a named external witness identified before the work starts. Every milestone closes with a public retrospective. Silent edits to this page are blocked by a CI workflow that fails the build if roadmap-v*.md is modified without a new git tag.

#DateMilestoneConcrete artifactExternal signal required
M0 May 2026 Roadmap published This page + git tag roadmap-v1.0 + CI workflow that fails on untagged roadmap edits Roadmap referenced by ≥ 1 external blog, mailing list, or social post outside the VerifiMind / YSenseAI ecosystem
M1 Jun 2026 Governance fixes shipped LLM personas renamed to (model, role-in-this-reflection) tuples; SECURITY.md (90-day disclosure window); MAINTAINERS.md with open second slot; GOVERNANCE.md (amendment process); co-maintainer post live ≥ 1 inbound CV or “interested” reply to the co-maintainer post; security disclosure email reachable and monitored
M2 Jul 2026 Seed labeled eval set v0 100 items under intra-annotator test-retest protocol, expanded to 200 items by month-end. Stratified X / Z / CS. Published on Hugging Face Datasets with Zenodo DOI. License: MIT. Dataset DOI minted; ≥ 1 independent download or fork logged; intra-annotator κ from 2-week washout protocol published
M3 Sep 2026 First Cohen’s κ report Pre-registered three-way κ: (a) each agent vs. human gold, (b) same agent vs. itself across reruns with different seeds, (c) cross-family κ across Anthropic / OpenAI / Google judges. Bootstrap 95% CIs. Class-conditional confusion matrices. Numbers published as found — including failure cases. Named external annotator (paid via Prolific) confirms label set independently.
M4 Oct 2026 Co-maintainer onboarded or publicly conceded Either second named maintainer in MAINTAINERS.md with merged independent PR, or a published retrospective explaining why recruitment failed and what changes next Co-maintainer’s first independent PR merged with passing CI, or timestamped + signed concession retrospective
M5 Nov 2026 CS execution sandbox v0 gVisor or Firecracker isolation; no network egress by default; explicit allow-list per evaluation; published threat model covering MCP-specific attack classes (prompt injection via repo content, SSRF, plaintext credentials, response leakage); one-command reproduction harness One named external security researcher invited to break it; their write-up published whether positive or negative
M6 Jan 2027 Z calibration & abstention Z emits confidence + ECE + reliability diagram against the M2 labeled set. Mandatory abstention when cross-judge κ on item type < 0.40. Brier score reported. Reliability diagram + ECE + Brier published with 95% bootstrap CIs and a list of high-confidence wrong answers
M7 Feb 2027 External benchmark — readiness-gated One named benchmark per agent, pre-registered in roadmap-v1.x before running. Candidates: HaluEval (X), MLCommons AILuminate (Z), SWE-Bench Lite or PaperBench (CS). Readiness memo published first. Full prompt logs, seeds, model identifiers, and raw outputs published. Results acknowledged by ≥ 1 external party.
M8 Mar 2027 NIST AI RMF self-attestation Mapping of VerifiMind’s practices to the NIST AI RMF Generative AI Profile Govern / Map / Measure / Manage functions, with honest gap statement One external reviewer (academic, journalist, or practitioner) invited to critique; their critique published
M9 Apr 2027 Year-1 retrospective Full retrospective covering every milestone hit, slipped, or abandoned. Decision: continue OSS-free, pivot, or sunset. Labeled set at 800 items. Retrospective signed and dated; raw progress data and all dataset versions linked
SECTION 3

Kill-conditions

The roadmap is real only if abandonment conditions are stated up front. VerifiMind-as-tool will be retired and the work continued as methodology research if any of the following observable conditions occurs, with a public retrospective explaining the trigger:

  1. No external human validation by day 180 — fewer than 3 independent non-friend users complete a structured evaluation task and consent to be cited.
  2. No ML co-maintainer by day 120 — no qualified collaborator signs a scoped RFC role or completes a substantive PR / review cycle despite a public role spec.
  3. Label reliability failure — after two label-protocol revisions, core label κ remains below 0.60 on a representative sample.
  4. Calibration failure — Z cannot show materially better-than-baseline calibrated confidence on the labeled set, with high-confidence false positives remaining frequent and unexplained.
  5. Security gating failure — CS requires code execution for its advertised claims but no sandbox + threat model are shipped by day 180.
  6. Benchmark falsification without learning — external benchmark shows weak performance and no credible error taxonomy is produced within 30 days.
  7. Cost-revenue inversion — frontier-model and hosting costs exceed project revenue or committed sponsor funding for 3 consecutive months with no credible unit-cost path.
  8. Trust harm — a buyer, vendor, or public reviewer reasonably interprets VerifiMind as claiming validated ML rigor this roadmap admits it does not yet possess.

A kill-condition firing produces a documented decision, not an automatic pivot. The decision is published, dated, and signed.

SECTION 4

Commitment mechanism

A roadmap that lives only on a webpage is editable. Four mechanisms operate together to convert it from aspirational to binding:

  1. Git tags as commitments. This document is tagged roadmap-v1.0. Any modification of a milestone date or definition requires a new tag with a diff and a public reason. git log --tags is the audit trail.
  2. Milestone-keyed retrospectives. Each milestone closes with a public retrospective post. The retrospective is the artifact; the milestone is not complete until the retrospective is published.
  3. Pre-named third-party witnesses. For M3, M5, M7, M8, the external party is named in advance, publicly. A milestone with a named external reviewer cannot be unilaterally marked complete.
  4. Pre-registered failure conditions. Every milestone has an explicit “this counts as a miss” definition stated now, not after the fact. Post-hoc rationalization is visible because the rationalization comes after the pre-registration.

Git tags make silent edits visible. Retrospectives make silent skips visible. Witnesses make false completions visible. Pre-registered failure conditions make rationalization visible. Together they form the cheapest available substitute for the institutional review structure VerifiMind cannot afford.

What this roadmap deliberately does not promise

  • It does not promise revenue, users, or adoption metrics.
  • It does not promise that κ will be high — only that κ will be measured and published.
  • It does not promise the co-maintainer search succeeds — only that the search happens publicly and its outcome is reported.
  • It does not promise VerifiMind becomes a credible eval framework by April 2027. It promises that by April 2027, any external reader can answer “is this project methodologically serious?” from public evidence — in either direction.
SECTION 5

Technical RFC appendix

The full technical appendix — metric definitions (Cohen’s κ, ECE, Brier, F1, ESR with formulas), evaluation dataset spec, inter-rater reliability plan, cross-family judge triangulation, external methodology mapping (RAGAS, ARES, G-Eval, Prometheus-Eval, SWE-Bench, HumanEval), sandbox & security plan, MCP version-pinning, reproducibility checklist (NeurIPS-style), co-maintainer role and compensation terms, and acceptance criteria for the co-maintainer’s first “done” — is published as the canonical markdown source on GitHub. It is the appendix an ML-literate co-maintainer needs answered before joining.

Section B — Technical RFC appendix. Math, dataset spec, metric definitions, co-maintainer terms, reproducibility checklist. GitHub renders LaTeX inline.
Read Section B on GitHub →
Signed and dated. This roadmap is inside the loop it describes. We publish it anyway, because the only available exit from a closed validation loop is an external signal — and a public clock is one of the few external signals a solo project can manufacture honestly.

— Lee Wei Bin (Alton), VerifiMind · May 2026 · Tagged roadmap-v1.0 · CC BY 4.0
See also: The Validation Paradox — the problem this roadmap is the response to. Both pages are companions and link bidirectionally.