The Evaluation Roadmap
A public clock. Ten dated milestones. Pre-registered thresholds. Named external witnesses. The cheapest forgery-resistant external signal a solo project can manufacture.
| Version | v1.0 — Published May 2026 |
| Git tag | roadmap-v1.0 |
| Window | May 2026 → April 2027 (year one) |
| Companion | The Validation Paradox |
| Canonical source | docs/research/evaluation-roadmap/roadmap-v1.0.md |
| Author | Lee Wei Bin (Alton), VerifiMind · FLYWHEEL TEAM |
| License | CC BY 4.0 |
This page is that external signal — a public clock, pre-registered before the data is collected, with failure conditions stated up front. Failure numbers will ship in the same font size as success numbers.
What VerifiMind is not yet
Before the roadmap, the disclaimer block — written so an enterprise buyer who finds this page through search cannot misread the paradox page as proof of certification:
- VerifiMind is not a certified safety auditor.
- VerifiMind is not a validated production evaluation tool.
- VerifiMind does not currently produce calibrated confidence intervals.
- VerifiMind does not currently execute submitted code in an isolated sandbox.
- VerifiMind has not yet established inter-judge agreement statistics against a held-out human-labeled set.
What VerifiMind is, today, is a structured hypothesis about how a three-agent critique pipeline (novelty / scoring / code-security) might improve on single-call frontier-model critique. This roadmap is the plan to make that hypothesis falsifiable.
Pre-registered thresholds
These are stated before any data is collected. Failure to clear them does not kill the project. It triggers a public retrospective that names what changed and why.
| Threshold | Metric | Pre-registered bar |
|---|---|---|
| Inter-rater reliability | Cohen’s κ (Landis–Koch scale) | κ ≥ 0.60 for usable claims; ≥ 0.80 for production-eligible language |
| Scoring calibration | Expected Calibration Error | ECE ≤ 0.05 on the held-out labeled set |
| Cross-family judge validity | κcross across Anthropic / OpenAI / Google judges | κcross ≥ 0.60 (lower 95% CI) |
| Perturbation invariance | Agreement under semantics-preserving perturbations | I > 0.90 |
| Human-anchor agreement | Judge vs. human gold on anchor set | AH ≥ 0.80 |
| “Better than baseline” | F1 lift vs. single-call frontier baseline | ΔF1 ≥ 0.10, lower 95% CI above 0 |
| CS execution success | Sandbox runs on labeled vulnerability corpus | ESR ≥ 0.95 on runnable tasks |
The labeled-set size is sized to detect a 10-point F1 improvement at α = 0.05, power = 0.8: 800 items total, ≥ 400 positives per primary slice; 400 adversarial items (≥ 100 per attack family); 200 retest items for test-retest reliability; 200 human-anchor items.
The Public Roadmap — 10 milestones
Each milestone has (a) a concrete artifact, (b) a pre-registered numerical threshold for what
“passes,” and (c) a named external witness identified before the work starts.
Every milestone closes with a public retrospective. Silent edits to this page are blocked by a CI
workflow that fails the build if roadmap-v*.md is modified without a new git tag.
| # | Date | Milestone | Concrete artifact | External signal required |
|---|---|---|---|---|
| M0 | May 2026 | Roadmap published | This page + git tag roadmap-v1.0 + CI workflow that fails on untagged roadmap edits |
Roadmap referenced by ≥ 1 external blog, mailing list, or social post outside the VerifiMind / YSenseAI ecosystem |
| M1 | Jun 2026 | Governance fixes shipped | LLM personas renamed to (model, role-in-this-reflection) tuples; SECURITY.md (90-day disclosure window); MAINTAINERS.md with open second slot; GOVERNANCE.md (amendment process); co-maintainer post live |
≥ 1 inbound CV or “interested” reply to the co-maintainer post; security disclosure email reachable and monitored |
| M2 | Jul 2026 | Seed labeled eval set v0 | 100 items under intra-annotator test-retest protocol, expanded to 200 items by month-end. Stratified X / Z / CS. Published on Hugging Face Datasets with Zenodo DOI. License: MIT. | Dataset DOI minted; ≥ 1 independent download or fork logged; intra-annotator κ from 2-week washout protocol published |
| M3 | Sep 2026 | First Cohen’s κ report | Pre-registered three-way κ: (a) each agent vs. human gold, (b) same agent vs. itself across reruns with different seeds, (c) cross-family κ across Anthropic / OpenAI / Google judges. Bootstrap 95% CIs. Class-conditional confusion matrices. | Numbers published as found — including failure cases. Named external annotator (paid via Prolific) confirms label set independently. |
| M4 | Oct 2026 | Co-maintainer onboarded or publicly conceded | Either second named maintainer in MAINTAINERS.md with merged independent PR, or a published retrospective explaining why recruitment failed and what changes next |
Co-maintainer’s first independent PR merged with passing CI, or timestamped + signed concession retrospective |
| M5 | Nov 2026 | CS execution sandbox v0 | gVisor or Firecracker isolation; no network egress by default; explicit allow-list per evaluation; published threat model covering MCP-specific attack classes (prompt injection via repo content, SSRF, plaintext credentials, response leakage); one-command reproduction harness | One named external security researcher invited to break it; their write-up published whether positive or negative |
| M6 | Jan 2027 | Z calibration & abstention | Z emits confidence + ECE + reliability diagram against the M2 labeled set. Mandatory abstention when cross-judge κ on item type < 0.40. Brier score reported. | Reliability diagram + ECE + Brier published with 95% bootstrap CIs and a list of high-confidence wrong answers |
| M7 | Feb 2027 | External benchmark — readiness-gated | One named benchmark per agent, pre-registered in roadmap-v1.x before running. Candidates: HaluEval (X), MLCommons AILuminate (Z), SWE-Bench Lite or PaperBench (CS). Readiness memo published first. |
Full prompt logs, seeds, model identifiers, and raw outputs published. Results acknowledged by ≥ 1 external party. |
| M8 | Mar 2027 | NIST AI RMF self-attestation | Mapping of VerifiMind’s practices to the NIST AI RMF Generative AI Profile Govern / Map / Measure / Manage functions, with honest gap statement | One external reviewer (academic, journalist, or practitioner) invited to critique; their critique published |
| M9 | Apr 2027 | Year-1 retrospective | Full retrospective covering every milestone hit, slipped, or abandoned. Decision: continue OSS-free, pivot, or sunset. Labeled set at 800 items. | Retrospective signed and dated; raw progress data and all dataset versions linked |
Kill-conditions
The roadmap is real only if abandonment conditions are stated up front. VerifiMind-as-tool will be retired and the work continued as methodology research if any of the following observable conditions occurs, with a public retrospective explaining the trigger:
- No external human validation by day 180 — fewer than 3 independent non-friend users complete a structured evaluation task and consent to be cited.
- No ML co-maintainer by day 120 — no qualified collaborator signs a scoped RFC role or completes a substantive PR / review cycle despite a public role spec.
- Label reliability failure — after two label-protocol revisions, core label κ remains below 0.60 on a representative sample.
- Calibration failure — Z cannot show materially better-than-baseline calibrated confidence on the labeled set, with high-confidence false positives remaining frequent and unexplained.
- Security gating failure — CS requires code execution for its advertised claims but no sandbox + threat model are shipped by day 180.
- Benchmark falsification without learning — external benchmark shows weak performance and no credible error taxonomy is produced within 30 days.
- Cost-revenue inversion — frontier-model and hosting costs exceed project revenue or committed sponsor funding for 3 consecutive months with no credible unit-cost path.
- Trust harm — a buyer, vendor, or public reviewer reasonably interprets VerifiMind as claiming validated ML rigor this roadmap admits it does not yet possess.
A kill-condition firing produces a documented decision, not an automatic pivot. The decision is published, dated, and signed.
Commitment mechanism
A roadmap that lives only on a webpage is editable. Four mechanisms operate together to convert it from aspirational to binding:
- Git tags as commitments. This document is tagged
roadmap-v1.0. Any modification of a milestone date or definition requires a new tag with a diff and a public reason.git log --tagsis the audit trail. - Milestone-keyed retrospectives. Each milestone closes with a public retrospective post. The retrospective is the artifact; the milestone is not complete until the retrospective is published.
- Pre-named third-party witnesses. For M3, M5, M7, M8, the external party is named in advance, publicly. A milestone with a named external reviewer cannot be unilaterally marked complete.
- Pre-registered failure conditions. Every milestone has an explicit “this counts as a miss” definition stated now, not after the fact. Post-hoc rationalization is visible because the rationalization comes after the pre-registration.
Git tags make silent edits visible. Retrospectives make silent skips visible. Witnesses make false completions visible. Pre-registered failure conditions make rationalization visible. Together they form the cheapest available substitute for the institutional review structure VerifiMind cannot afford.
What this roadmap deliberately does not promise
- It does not promise revenue, users, or adoption metrics.
- It does not promise that κ will be high — only that κ will be measured and published.
- It does not promise the co-maintainer search succeeds — only that the search happens publicly and its outcome is reported.
- It does not promise VerifiMind becomes a credible eval framework by April 2027. It promises that by April 2027, any external reader can answer “is this project methodologically serious?” from public evidence — in either direction.
Technical RFC appendix
The full technical appendix — metric definitions (Cohen’s κ, ECE, Brier, F1, ESR with formulas), evaluation dataset spec, inter-rater reliability plan, cross-family judge triangulation, external methodology mapping (RAGAS, ARES, G-Eval, Prometheus-Eval, SWE-Bench, HumanEval), sandbox & security plan, MCP version-pinning, reproducibility checklist (NeurIPS-style), co-maintainer role and compensation terms, and acceptance criteria for the co-maintainer’s first “done” — is published as the canonical markdown source on GitHub. It is the appendix an ML-literate co-maintainer needs answered before joining.
Read Section B on GitHub →
— Lee Wei Bin (Alton), VerifiMind · May 2026 · Tagged
roadmap-v1.0 · CC BY 4.0
Selected references
- The Validation Paradox — companion document
- Anthropic, Responsible Scaling Policy v3 — public-roadmap-as-forcing-function model
- Apollo Research — We Need A Science of Evals
- Apollo Research — The Evals Gap
- METR — Common Elements of Frontier AI Safety Policies
- NIST AI Risk Management Framework
- Guo et al., On Calibration of Modern Neural Networks (ICML 2017) — ECE definition
- HaluEval, MLCommons AILuminate, SWE-Bench, HumanEval, PaperBench
- Galileo — Why LLM Judges Disagree With Your Experts
- Thakur et al., Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (arXiv:2406.12624)
- MCP Specification (2025-06-18)
- NeurIPS Paper Checklist — reproducibility template
Full reference list with all citations and methodology notes in the canonical markdown source.