Skip to main content

// prototype · public benchmark

EvalOps Workbench

Evaluation as a contract, not a dashboard.

An eval harness that treats a prompt change the way good teams treat a database migration: versioned, regression-tested, blockable. It ships a public benchmark anyone can reproduce in a minute, with no credentials.

// the failure mode

A prompt change improves one case and silently breaks twelve.

Most evaluation happens by vibe. Someone tweaks a prompt, eyeballs a handful of outputs, and ships. The change helps the cases they looked at and quietly degrades the ones they did not. The regression surfaces three weeks later as a support ticket.

The other failure modes rhyme: a harness that only tests happy paths while production hits adversarial inputs, and a black-box pass or fail that never tells the engineer which case actually regressed. Without per-case traceability, every fix is a guess.

// the harness

Score every variant. Pin a baseline. Surface every regression.

01

Dataset and variants

Load a labeled set, then run named strategies over it. Each variant is a versioned candidate, the way a prompt revision or a model swap is.

02

Rubric scoring

Score every answer against per-case rubrics: required phrases present, forbidden phrases absent, threshold met. One shared, inspectable notion of correctness.

03

Pinned baseline and gate

Pin a baseline as a contract. The gate blocks if the aggregate drops below it, and every per-case regression is surfaced with a reason.

// the benchmark

Measured, and reproducible.

The public benchmark scores a baseline prompt variant against a grounded one over a labeled support-QA set, using per-case rubrics: required phrases present, forbidden phrases absent, threshold met. It is the workload behind this system's live telemetry, and it reproduces offline with no credentials and no cost.

The harness exists to make one risk visible: a prompt change can lift the aggregate while quietly breaking individual cases. So it pins a baseline, gates on regressions, and reports the candidate against it case by case. The figures below come straight from /api/benchmark-latest, never seeded.

querying /api/benchmark-latest…

// reproduce

Run it yourself in a minute.

git clone https://github.com/IgnazioDS/evalops-workbench
cd evalops-workbench && pip install -e .
python -m evalops_workbench.benchmark_runner

The run prints the same metrics the endpoint serves and rewrites the committed artifact. Every case, label, and score lives in the public fixture, dependency-free and offline.

// graduation

The path from prototype.

Today the harness is honest about its stage: a real engine, a public benchmark, and live telemetry, scored on a deterministic system-under-test. The next steps are a pluggable live-model target behind a flag, larger labeled fixtures, and a build gate teams can drop into their own CI. The bar it is working toward is NexusRAG: every claim backed by a public repo, a live deploy, and a number you can check in sixty seconds. AI infrastructure, not AI theater.

Want evaluation wired into your build?

Start a conversation