Yes. The evaluation harness and boundary tests are part of the deliverable, and your team keeps them. The point is that quality stays measurable after we leave.

Will you touch our infrastructure?

Only what is needed to make the deploy dependable, and everything we change is documented in the runbook.

What if the evaluation shows it needs a rebuild?

We tell you before spending the budget, not after. Running a diagnostic first removes most of this risk, which is why the diagnostic fee is credited toward a sprint.

You get a fixed quote within the range after a diagnostic or a short scoping call, once the scope is clear.

// sprint

Prototype Hardening Sprint

Take one fragile AI prototype and make it survive real users, real data, and real failures.

For a working prototype that demos well but cannot yet be trusted in front of customers.

$4,000 – $10,0002-4 weeks · Scoped by how many surfaces are in play and how deep the evaluation needs to go.

Brief us · 24h response

We take the prototype that breaks under load and add the layer that makes it hold.

// what you get

The deliverables, in writing.

Retrieval and agent quality, measured
An evaluation harness with a labeled question set, a baseline score, and the specific changes that move the number. Quality becomes a figure you can track, not a feeling.
Observability you can read
Structured logging, request traces, a stats endpoint, and visibility into cost and latency. When something regresses, you see where.
Failure handling at the edges
Graceful degradation, sensible retries, guardrails, and input validation at the system boundary, so a bad input returns a safe answer instead of a confident wrong one.
One deploy you can trust
Health checks, environment hygiene, and a short runbook, so the system comes back up the same way every time.
A handoff your team keeps
A written architecture note and the evaluation harness, both yours to keep running after the sprint ends.

// what it costs

$4,000 – $10,000

Scoped by how many surfaces are in play and how deep the evaluation needs to go.

What the fee covers

Hardening of one system or surface end to end
The evaluation harness, baseline, and the fixes that move it
Observability, failure handling, and a trustworthy deploy
Architecture note and handoff

What it does not

Net-new feature build beyond the surface being hardened
Multi-system or platform work (that is the Agency Backend Build)
Ongoing retainer (available separately after handoff)

// how it runs

2-4 weeks.

Week 1
Instrument and measure. Stand up the evaluation harness, get a baseline, and confirm the real failure modes against your data.
Weeks 2-3
Harden. Move the quality numbers, add observability and failure handling, and make the deploy dependable.
Final week
Handoff. Architecture note, the eval harness in your hands, and a walkthrough so your team can keep it running.

// what you bring

Four things, and we can start.

Write access to the repository, or a fork we can work in
Representative or sample data the system runs against
An engineer for roughly two hours a week
A clear definition of what "good enough to ship" means for you

// questions

Before you brief us.

One surface, end to end. If you need several systems hardened, we scope it up or move to an Agency Backend Build.

// past delivery

The work this is built on.

// what's real

A prototype that held up in a demo becomes a system that holds up with customers, with the numbers to prove it.

Ready when you are.

Brief us · 24h response

// we reply within 24 hours

The deliverables, in writing.

$4,000 – $10,000

2-4 weeks.

Four things, and we can start.

Before you brief us.

One system or several?–

Do you write tests?+

Will you touch our infrastructure?+

What if the evaluation shows it needs a rebuild?+

Is the price fixed?+

The work this is built on.

Ready when you are.