//the pipelineread left → right

INGESTsource

EMBEDvector

RETRIEVEpgvector

RERANKcohere

GENERATEclaude

EVALdeterministic

SHIPprod

// RAG · agents · backends · eval

Your AI demo works.The system behind it doesn’t.

Eleventh turns fragile retrieval, agent, and AI-backend prototypes into systems that survive real users, real data, and real failures.

git · main · live↑ 6↓ 0last 4h

$git log --oneline --since=2d

8e1b2af refactor(eval): split harness into stages · 6h ago

4d2cdf9 feat(agent): add approval gate retry · 1d ago

2a83b1c chore(deps): bump cohere → 4.2.1 · 2d ago

Deployed0across 6 systems

Instrumented · live0/ 10↑ NexusRAG hot

p95 retrieval0ms−18% wk

// refreshed 7s ago// design preview · stub data

Book a DiagnosticInspect the proofSee the systems

// the pipeline above is the same shape every system below follows

scroll · capabilities

//capabilities

Four pipelines.
All shipped, all in public repos.

Each row is one engineering capability. The diagram on the right is how the system runs in production. Every row links to its source repo and (when applicable) the live deploy. Read the source before the first call.

hybrid retrieval + rerank

Retrieval Contracts for RAG Systems

Your retrieval works in demos but fails on real user queries. Precision drops, irrelevant results surface, and answer quality degrades under messy, unstructured input.

Hybrid retrieval (pgvector + BM25) with reranking, instrumented end-to-end. Built for the gap between a demo that retrieves and a system that survives the next 10,000 queries.

LangGraph
pgvector
FastAPI

// auditNexusRAG ↗·live deploy ↗

//packet traces the live hybrid pathlive

state graph + approval gate

LLM Agent Infrastructure

Your agent works on the demo path but breaks when users deviate. State gets lost between steps, tool calls fail silently, and there is no way to audit what happened.

Multi-step agent workflows on LangGraph with persistent state, tool orchestration, approval gates, retry logic, and deterministic evaluation. Designed for reliability under real interaction patterns.

LangGraph
Claude API
Celery

// auditagent-runbook-orchestrator ↗·live deploy ↗

//the loop is durable, the gate is humanlive

request latency waterfall

FastAPI Backends for AI Products

Your AI feature needs a real backend, not a notebook wrapped in an endpoint. You need async APIs, structured data access, migration discipline, and deployment automation from day one.

Asynchronous FastAPI backends purpose-built for LLM-powered apps. pgvector, Alembic migrations, health checks, and operational readiness on day one.

FastAPI
PostgreSQL
Docker

// auditlanggraph-fastapi-starter ↗·// production template — no demo deploy by design

//p95 · 280ms · measured prod, last 24hreference

regression detection · 12 runs

AI Reliability Engineering

You are scaling prompts before you have evaluation discipline. Quality is checked by eyeballing outputs. Cost and latency are unmeasured. When something breaks in production, there is no way to tell what changed.

Deterministic evaluation frameworks, structured output validation, cost and latency observability, regression detection. The layer most teams skip between prototype and production.

Python
pytest
Prometheus

// auditevalops-workbench ↗·live deploy ↗

//deploy @ run 8 — regression caught at run 9live

//portfolio · 1 flagship + 5 systems

Six systems.
One flagship, five supporting.

One system is production-grade and carries the weight. The other five are public prototypes and showcases, labeled for exactly what they are. Nothing here pretends to be further along than it is.

flagship · Production

Multi-cloud RAG

NexusRAG

Multi-tenant RAG platform

Multi-tenant, multi-cloud RAG platform with SSO, SCIM, RBAC/ABAC, envelope encryption, multi-region failover, and SOC 2 automation. The shipped reference every other repo points to.

retrieve → rank → generate → stream

LangGraph
FastAPI
pgvector
Bedrock + Vertex

Read the outcome Live deploy GitHub

// multi-cloud RAG platform

// supporting fleet · public repos, honest stages

ARShowcaseAgent RunbookDurable executionGitHub DWPrototypeData WatchtowerPre-pipeline drift detectionRead the outcome EVPrototypeEvalOpsLocal-first eval + regression gateRead the outcome RDShowcaseRepo RAG DebuggerSource-aware debuggingGitHub RSPrototypeRevenue SignalExplainable lead scoringRead the outcome

See all six systems, by stage

//clients

Who we build for,
and what the shape looks like.

Three engagement profiles. Each starts with a measurable problem, ends with deployed code in your repo, and gets handed off with the eval harness, observability, and the docs we used to ship it.

FOR ·01

Startups moving past prototype

You demoed the AI feature. Now the team is asking for retrieval that doesn’t fall over, an eval harness with regressions caught, and an API your iOS engineer can actually call.

// engagement4-week build · live API · full handoff

FOR ·02

Agencies delivering AI products

Your team builds the front end and the brand. We ship the AI backend underneath: RAG, agents, eval. Then we disappear before launch with the source in your client’s repo.

// engagementembed · ship · transfer ownership

FOR ·03

Teams evaluating capability

You’re considering an AI investment and want a working reference architecture, an eval against your real data, and a number on what production reliability actually costs.

// engagement2-week pilot · benchmark · go/no-go

//method

Five phases,
one shipping commit.

Every engagement runs through the same sequence. Discovery sets the spec and the evaluation bars; the next four phases build and ship against them. Adjustable by week, not by order.

W1 · Day 1W2 · Day 6W3 · Day 11W4 · Day 16Ongoing · Day 21+

// 4-week standard · adjustable by weekclick a phase to inspect →

Phase 01 / 05 · the spec

Discovery

Days 1–5 · 5 work-days

We sit with the brief, write the system spec, define the evaluation bars, and surface every risk before any code ships. The spec we write here is the contract every phase below is measured against.

// deliverables4 files · spec-only

spec/v1.md: System spec + architecture sketch
eval/bars.yaml: Evaluation criteria + scoring bars
risk/register.md: Risk register + open questions
ops/deploy.tf: Repo + deploy targets confirmed

//telemetry

What the systems
are actually doing right now.

Every system below ships its own public stats endpoint. These numbers are fetched live from those deployed endpoints, polled every 30 seconds. NexusRAG reports production workload; systems running a public benchmark report it live; the rest report repo-derived velocity. When a system is quiet or its endpoint is cold, the tile says so. Nothing here is simulated.

//live· querying endpoints…

NexusRAGquerying…

NexusRAG · live telemetry

hybrid retrieval → rerank → generate → evaluate

// last activityquerying nexusrag.eleventh.dev/api/stats…

source ↗·live deploy ↗·the work ↗

Queries · total…

p95 latency…

Uptime · 30d…

Indexed chunks…

source · nexusrag.eleventh.dev/api/stats

agent-runbook-orchestrator…

Commits · 30d…

LOC…

Last deploy…

source ↗·live ↗

data-quality-watchtower…

Commits · 30d…

LOC…

Last deploy…

source ↗·live ↗

evalops-workbench…

Commits · 30d…

LOC…

Last deploy…

source ↗·live ↗

repo-rag-debugger…

Commits · 30d…

LOC…

Last deploy…

source ↗·live ↗

revenue-signal-copilot…

Commits · 30d…

LOC…

Last deploy…

source ↗·live ↗

// fleet0/6 live

Systems6

Live0

Showcase0

view fleet ↗

//engineering charter

The model isn’t the system.
The system is the work.

The difference is not the model. It’s the system around it. Below are the four laws we won’t ship without. Each is paired with the anti-pattern that gives the principle its enforcement edge.

Law 0101 / 04

Deterministic evaluation

Replacing subjective “looks good” testing with measurable scoring and regression checks. Every prompt change ships with a delta against a frozen baseline.

// won’t accept“it worked when I tried it”

Law 0202 / 04

Cost control

Preventing uncontrolled token usage through architecture, not afterthought optimization. Cost ceilings, caching, and routing live at the request layer.

// won’t accept“we’ll optimize when it gets expensive”

Law 0303 / 04

Latency engineering

Designing systems that respond in real time, not seconds too late. p95 is a feature; warm-paths and streaming are architectural, not optimization.

// won’t accept“the model takes 8s, deal with it later”

Law 0404 / 04

Operational reliability

Building systems that continue working under load, failure, and scale. Health checks, retries, idempotency, and circuit breakers ship on day one.

// won’t accept“works in dev, ship it”

//position

AI engineering is contract engineering.

01Build the control plane.
02Defend the workflow.
03Run layered evaluation.

AI infrastructure, not AI theater.

// from the engineering charter

//contact

Tell us the system.
We’ll tell you the bar.

Tell us about the system you need shipped. We respond in 24 hours with a specific technical plan, a measurable bar, and an honest estimate of weeks and cost.

First call inside one business day.

Brief us in two paragraphs. If we’re not the right fit, we’ll tell you fast and route you to a team that is.

Name

Work email

// about you

Company

Role

Company or product URL

// your system

Project type

Current stage

Current stack

Traffic / volume

Data sensitivity

// your engagement

Timeline

Budget band

Can this become a public reference?

YesPossiblyNo

// what you need

Describe what you need

// 24h response · specific technical plan · honest estimate