// engineering · how this is built and verified

AI engineering is contract engineering.

AI infrastructure, not AI theater.

This page is for the reader doing diligence. The manifesto below is not a slogan: each law is a concrete piece of engineering with a shape, a failure mode it defends against, and a public repo you can read. Then the telemetry contract, and the check you can run in sixty seconds.

// the method, in diagrams

Four laws.
Each one is a system you can draw.

The manifesto is not a slogan. Each law is a concrete piece of engineering with a shape. The diagram on the right is how it runs; the link is where you can read the source.

request lifecycle · 6 stages

Build the control plane

A model call is one line of code. Everything that makes it survive production, auth, routing, retries, persistence, observability, is the system you actually have to build. Skip it and the demo ships; the product does not.

I build that layer first: the request lifecycle around the model, instrumented end to end, so the model becomes a swappable component instead of the whole architecture.

FastAPI
LangGraph
pgvector

// auditNexusRAG ↗·live deploy ↗

//the model is one stage of six

regression gate · pinned baseline

Run layered evaluation

Quality checked by eyeballing outputs is not quality control. A prompt change lifts one case and silently breaks twelve; nobody notices until it surfaces three weeks later as a support ticket.

Evaluation is a contract the build must satisfy. Pin a baseline, score every candidate case by case, and block the merge on a regression. Every run reproduces from a committed fixture, offline.

pytest
committed fixtures
CI gate

// auditevalops-workbench ↗·live deploy ↗

//a regression blocks the merge

GET /api/stats · schema v1

Instrument honestly

A dashboard that always looks busy is theater. Telemetry that throws a 500 when a dependency is cold is worse than the condition it was meant to report.

One JSON envelope per system: status, mode, workload, real metrics, a fresh timestamp. The page polls it directly. It never returns 5xx; a cold or stale system degrades to last-good or honest zeros, plainly marked.

stdlib endpoint
repo-committed JSON
cron

// auditTELEMETRY_SCHEMA.md ↗

//never 5xx · degrade, do not crash

durable execution · approval gate

Defend the workflow

A demo handles the happy path. Real users deviate, tool calls fail, steps lose state. Without idempotency and an audit trail, a retried step double-charges and nobody can say what happened.

Durable state machines with idempotency keys, bounded retries, a human approval gate on irreversible actions, and a signed audit ledger. The substrate ops automation needs before bolting an LLM onto a cron script.

LangGraph
PostgreSQL
Celery

// auditagent-runbook-orchestrator ↗·live deploy ↗

//the loop is durable, the gate is human

// reading the telemetry

Three honest states, never one fake one.

Diagram 03 is the contract. These are the three states a system can honestly report through it. Which one a system is in comes from the committed code, not from what would look best.

live · production

Real user traffic

A system serving a production workload. Every figure is computed from real requests, here and now.

mode: "live"
workload: "production"
metrics: real traffic

live · benchmark

A reproducible run

A prototype with a deterministic public benchmark. Live numbers, but from a fixture you can rerun, not user traffic. The two are never conflated.

mode: "live"
workload: "benchmark"
metrics: from a fixture

showcase

Honest repo signals

No workload yet, so the endpoint reports repository signals only: commits, stars, primary language. Labeled showcase, never dressed up as traffic.

mode: "showcase"
metrics: commits · stars

// verify it

The sixty-second check.

verify — sixty seconds

$ curl -s https://nexusrag.eleventh.dev/api/stats | jq .
{
  "system": "nexusrag",
  "mode": "live",
  "workload": "production",
  "status": "operational",
  "metrics": { "queries_total": 358, "p95_latency_ms": 0 },
  "schema_version": 1,
  "generated_at": "2026-05-27T..."
}
✓ schema_version 1 · status operational · never 5xx

Every system on the work index has three things: a public repository, a live deploy, and a number you can recompute. The prototypes go further and ship a committed benchmark fixture, so you can clone the repo and regenerate the published run offline, with no credentials and no cost. A claim you cannot check is marketing, not engineering.

// decisions worth defending

Each trade-off, and the road not taken.

// persistence

Benchmarks persist as committed JSON

The prototype benchmarks write their latest run to a committed artifact, refreshed by a scheduled workflow. Anyone can clone the fixture and reproduce the number offline. A database would have traded that away for nothing.

choserepo-committed JSONoveran external database

// failure

Telemetry degrades, it does not crash

When data is stale or a dependency is cold, the endpoint emits its last good cache or honest zeros with a degraded status. A widget that breaks loudly is worse than one that tells the truth quietly.

chosesoft degrade to last-goodovera 500 error

// retrieval

Embeddings are real or labeled

The flagship once shipped a hashed bag-of-words vector as a stand-in for a semantic embedding. It measured lexical overlap, not meaning. It now sits behind an explicit provider flag, with no silent fallback.

chosereal embeddings, flaggedovera lexical stand-in

// scope

One flagship, not six

Only one system carries the production label. The other five are honest prototypes and showcases. Spreading the claim across all six would have been the easy lie; concentrating it where it is earned is the defensible truth.

choseone honest flagshipoversix 'production' badges

// dig in

Start anywhere.

Want this standard applied to your system?

Start a conversation