The Evaluation Roadmap

A public clock. Ten dated milestones. Pre-registered thresholds. Named external witnesses. The cheapest forgery-resistant external signal a solo project can manufacture.

Version	v1.0 — Published May 2026
Git tag	`roadmap-v1.0`
Window	May 2026 → April 2027 (year one)
Companion	The Validation Paradox
Canonical source	docs/research/evaluation-roadmap/roadmap-v1.0.md
Author	Lee Wei Bin (Alton), VerifiMind · FLYWHEEL TEAM
License	CC BY 4.0

Companion document. The Validation Paradox page named the problem: X / Z / CS are prompt templates running on frontier models, with no labeled evaluation set, no calibration, no execution sandbox, and no inter-judge agreement statistics. It ended on a single line: the only available exit from a closed validation loop is an external signal.

This page is that external signal — a public clock, pre-registered before the data is collected, with failure conditions stated up front. Failure numbers will ship in the same font size as success numbers.

SECTION 0

What VerifiMind is not yet

Before the roadmap, the disclaimer block — written so an enterprise buyer who finds this page through search cannot misread the paradox page as proof of certification:

Honest scope statement, May 2026:

VerifiMind is not a certified safety auditor.
VerifiMind is not a validated production evaluation tool.
VerifiMind does not currently produce calibrated confidence intervals.
VerifiMind does not currently execute submitted code in an isolated sandbox.
VerifiMind has not yet established inter-judge agreement statistics against a held-out human-labeled set.

What VerifiMind is, today, is a structured hypothesis about how a three-agent critique pipeline (novelty / scoring / code-security) might improve on single-call frontier-model critique. This roadmap is the plan to make that hypothesis falsifiable.

SECTION 1

Pre-registered thresholds

These are stated before any data is collected. Failure to clear them does not kill the project. It triggers a public retrospective that names what changed and why.

Threshold	Metric	Pre-registered bar
Inter-rater reliability	Cohen’s κ (Landis–Koch scale)	κ ≥ 0.60 for usable claims; ≥ 0.80 for production-eligible language
Scoring calibration	Expected Calibration Error	ECE ≤ 0.05 on the held-out labeled set
Cross-family judge validity	κ_cross across Anthropic / OpenAI / Google judges	κ_cross ≥ 0.60 (lower 95% CI)
Perturbation invariance	Agreement under semantics-preserving perturbations	I > 0.90
Human-anchor agreement	Judge vs. human gold on anchor set	A_H ≥ 0.80
“Better than baseline”	F1 lift vs. single-call frontier baseline	ΔF1 ≥ 0.10, lower 95% CI above 0
CS execution success	Sandbox runs on labeled vulnerability corpus	ESR ≥ 0.95 on runnable tasks

The labeled-set size is sized to detect a 10-point F1 improvement at α = 0.05, power = 0.8: 800 items total, ≥ 400 positives per primary slice; 400 adversarial items (≥ 100 per attack family); 200 retest items for test-retest reliability; 200 human-anchor items.

SECTION 2

The Public Roadmap — 10 milestones

Each milestone has (a) a concrete artifact, (b) a pre-registered numerical threshold for what “passes,” and (c) a named external witness identified before the work starts. Every milestone closes with a public retrospective. Silent edits to this page are blocked by a CI workflow that fails the build if roadmap-v*.md is modified without a new git tag.

#	Date	Milestone	Concrete artifact	External signal required
M0	May 2026	Roadmap published	This page + git tag `roadmap-v1.0` + CI workflow that fails on untagged roadmap edits	Roadmap referenced by ≥ 1 external blog, mailing list, or social post outside the VerifiMind / YSenseAI ecosystem
M1	Jun 2026	Governance fixes shipped	LLM personas renamed to `(model, role-in-this-reflection)` tuples; `SECURITY.md` (90-day disclosure window); `MAINTAINERS.md` with open second slot; `GOVERNANCE.md` (amendment process); co-maintainer post live	≥ 1 inbound CV or “interested” reply to the co-maintainer post; security disclosure email reachable and monitored
M2	Jul 2026	Seed labeled eval set v0	100 items under intra-annotator test-retest protocol, expanded to 200 items by month-end. Stratified X / Z / CS. Published on Hugging Face Datasets with Zenodo DOI. License: MIT.	Dataset DOI minted; ≥ 1 independent download or fork logged; intra-annotator κ from 2-week washout protocol published
M3	Sep 2026	First Cohen’s κ report	Pre-registered three-way κ: (a) each agent vs. human gold, (b) same agent vs. itself across reruns with different seeds, (c) cross-family κ across Anthropic / OpenAI / Google judges. Bootstrap 95% CIs. Class-conditional confusion matrices.	Numbers published as found — including failure cases. Named external annotator (paid via Prolific) confirms label set independently.
M4	Oct 2026	Co-maintainer onboarded or publicly conceded	Either second named maintainer in `MAINTAINERS.md` with merged independent PR, or a published retrospective explaining why recruitment failed and what changes next	Co-maintainer’s first independent PR merged with passing CI, or timestamped + signed concession retrospective
M5	Nov 2026	CS execution sandbox v0	gVisor or Firecracker isolation; no network egress by default; explicit allow-list per evaluation; published threat model covering MCP-specific attack classes (prompt injection via repo content, SSRF, plaintext credentials, response leakage); one-command reproduction harness	One named external security researcher invited to break it; their write-up published whether positive or negative
M6	Jan 2027	Z calibration & abstention	Z emits confidence + ECE + reliability diagram against the M2 labeled set. Mandatory abstention when cross-judge κ on item type < 0.40. Brier score reported.	Reliability diagram + ECE + Brier published with 95% bootstrap CIs and a list of high-confidence wrong answers
M7	Feb 2027	External benchmark — readiness-gated	One named benchmark per agent, pre-registered in `roadmap-v1.x` before running. Candidates: HaluEval (X), MLCommons AILuminate (Z), SWE-Bench Lite or PaperBench (CS). Readiness memo published first.	Full prompt logs, seeds, model identifiers, and raw outputs published. Results acknowledged by ≥ 1 external party.
M8	Mar 2027	NIST AI RMF self-attestation	Mapping of VerifiMind’s practices to the NIST AI RMF Generative AI Profile Govern / Map / Measure / Manage functions, with honest gap statement	One external reviewer (academic, journalist, or practitioner) invited to critique; their critique published
M9	Apr 2027	Year-1 retrospective	Full retrospective covering every milestone hit, slipped, or abandoned. Decision: continue OSS-free, pivot, or sunset. Labeled set at 800 items.	Retrospective signed and dated; raw progress data and all dataset versions linked

SECTION 3

Kill-conditions

The roadmap is real only if abandonment conditions are stated up front. VerifiMind-as-tool will be retired and the work continued as methodology research if any of the following observable conditions occurs, with a public retrospective explaining the trigger:

No external human validation by day 180 — fewer than 3 independent non-friend users complete a structured evaluation task and consent to be cited.
No ML co-maintainer by day 120 — no qualified collaborator signs a scoped RFC role or completes a substantive PR / review cycle despite a public role spec.
Label reliability failure — after two label-protocol revisions, core label κ remains below 0.60 on a representative sample.
Calibration failure — Z cannot show materially better-than-baseline calibrated confidence on the labeled set, with high-confidence false positives remaining frequent and unexplained.
Security gating failure — CS requires code execution for its advertised claims but no sandbox + threat model are shipped by day 180.
Benchmark falsification without learning — external benchmark shows weak performance and no credible error taxonomy is produced within 30 days.
Cost-revenue inversion — frontier-model and hosting costs exceed project revenue or committed sponsor funding for 3 consecutive months with no credible unit-cost path.
Trust harm — a buyer, vendor, or public reviewer reasonably interprets VerifiMind as claiming validated ML rigor this roadmap admits it does not yet possess.

A kill-condition firing produces a documented decision, not an automatic pivot. The decision is published, dated, and signed.

SECTION 4

Commitment mechanism

A roadmap that lives only on a webpage is editable. Four mechanisms operate together to convert it from aspirational to binding:

Git tags as commitments. This document is tagged roadmap-v1.0. Any modification of a milestone date or definition requires a new tag with a diff and a public reason. git log --tags is the audit trail.
Milestone-keyed retrospectives. Each milestone closes with a public retrospective post. The retrospective is the artifact; the milestone is not complete until the retrospective is published.
Pre-named third-party witnesses. For M3, M5, M7, M8, the external party is named in advance, publicly. A milestone with a named external reviewer cannot be unilaterally marked complete.
Pre-registered failure conditions. Every milestone has an explicit “this counts as a miss” definition stated now, not after the fact. Post-hoc rationalization is visible because the rationalization comes after the pre-registration.

Git tags make silent edits visible. Retrospectives make silent skips visible. Witnesses make false completions visible. Pre-registered failure conditions make rationalization visible. Together they form the cheapest available substitute for the institutional review structure VerifiMind cannot afford.

What this roadmap deliberately does not promise

It does not promise revenue, users, or adoption metrics.
It does not promise that κ will be high — only that κ will be measured and published.
It does not promise the co-maintainer search succeeds — only that the search happens publicly and its outcome is reported.
It does not promise VerifiMind becomes a credible eval framework by April 2027. It promises that by April 2027, any external reader can answer “is this project methodologically serious?” from public evidence — in either direction.

SECTION 5

Technical RFC appendix

The full technical appendix — metric definitions (Cohen’s κ, ECE, Brier, F1, ESR with formulas), evaluation dataset spec, inter-rater reliability plan, cross-family judge triangulation, external methodology mapping (RAGAS, ARES, G-Eval, Prometheus-Eval, SWE-Bench, HumanEval), sandbox & security plan, MCP version-pinning, reproducibility checklist (NeurIPS-style), co-maintainer role and compensation terms, and acceptance criteria for the co-maintainer’s first “done” — is published as the canonical markdown source on GitHub. It is the appendix an ML-literate co-maintainer needs answered before joining.

Section B — Technical RFC appendix. Math, dataset spec, metric definitions, co-maintainer terms, reproducibility checklist. GitHub renders LaTeX inline.
Read Section B on GitHub →

Signed and dated. This roadmap is inside the loop it describes. We publish it anyway, because the only available exit from a closed validation loop is an external signal — and a public clock is one of the few external signals a solo project can manufacture honestly.

— Lee Wei Bin (Alton), VerifiMind · May 2026 · Tagged roadmap-v1.0 · CC BY 4.0

See also: The Validation Paradox — the problem this roadmap is the response to. Both pages are companions and link bidirectionally.

SOURCES & METHODOLOGY

Selected references

The Validation Paradox — companion document
Anthropic, Responsible Scaling Policy v3 — public-roadmap-as-forcing-function model
Apollo Research — We Need A Science of Evals
Apollo Research — The Evals Gap
METR — Common Elements of Frontier AI Safety Policies
NIST AI Risk Management Framework
Guo et al., On Calibration of Modern Neural Networks (ICML 2017) — ECE definition
HaluEval, MLCommons AILuminate, SWE-Bench, HumanEval, PaperBench
Galileo — Why LLM Judges Disagree With Your Experts
Thakur et al., Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (arXiv:2406.12624)
MCP Specification (2025-06-18)
NeurIPS Paper Checklist — reproducibility template

Full reference list with all citations and methodology notes in the canonical markdown source.