Skip to main content
//the pipelineread left → right
01
INGESTsource
02
EMBEDvector
03
RETRIEVEpgvector
04
RERANKcohere
05
GENERATEclaude
06
EVALdeterministic
07
SHIPprod
// RAG · agents · backends · eval

Your AI demo works.The system behind it doesn’t.

Eleventh turns fragile retrieval, agent, and AI-backend prototypes into systems that survive real users, real data, and real failures.

git · main · live↑ 6↓ 0last 4h
$git log --oneline --since=2d
8e1b2af refactor(eval): split harness into stages · 6h ago
4d2cdf9 feat(agent): add approval gate retry · 1d ago
2a83b1c chore(deps): bump cohere → 4.2.1 · 2d ago
Deployed0across 6 systems
Instrumented · live0/ 10↑ NexusRAG hot
p95 retrieval0ms−18% wk
// refreshed 7s ago// design preview · stub data
// the pipeline above is the same shape every system below follows
scroll · capabilities
//capabilities

Four pipelines.
All shipped, all in public repos.

Each row is one engineering capability. The diagram on the right is how the system runs in production. Every row links to its source repo and (when applicable) the live deploy. Read the source before the first call.

hybrid retrieval + rerank

Retrieval Contracts for RAG Systems

Your retrieval works in demos but fails on real user queries. Precision drops, irrelevant results surface, and answer quality degrades under messy, unstructured input.

Hybrid retrieval (pgvector + BM25) with reranking, instrumented end-to-end. Built for the gap between a demo that retrieves and a system that survives the next 10,000 queries.

  • LangGraph
  • pgvector
  • FastAPI
//packet traces the live hybrid pathlive
QUERYEMBEDDENSEpgvectorBM25keywordRERANKcohereTOP-KP@50.94· baseline 0.62 ↑queries · 24h14,217↑ 12%hit · cache0.83↑ 0.04p99 retrieval218msrelevance dist · top-1kmean 0.81

state graph + approval gate

LLM Agent Infrastructure

Your agent works on the demo path but breaks when users deviate. State gets lost between steps, tool calls fail silently, and there is no way to audit what happened.

Multi-step agent workflows on LangGraph with persistent state, tool orchestration, approval gates, retry logic, and deterministic evaluation. Designed for reliability under real interaction patterns.

  • LangGraph
  • Claude API
  • Celery
//the loop is durable, the gate is humanlive
PLANdecomposeACTtool callOBSERVEtool responseEVALscore + logAPPROVE// human gateretry · replaniteration07/ 12 maxretries · 5m2· 2 approvalsrunningcontext · tokens3.2k/ 8k40% · headroom 4.8k

request latency waterfall

FastAPI Backends for AI Products

Your AI feature needs a real backend, not a notebook wrapped in an endpoint. You need async APIs, structured data access, migration discipline, and deployment automation from day one.

Asynchronous FastAPI backends purpose-built for LLM-powered apps. pgvector, Alembic migrations, health checks, and operational readiness on day one.

  • FastAPI
  • PostgreSQL
  • Docker
// auditlanggraph-fastapi-starter ·// production template — no demo deploy by design
//p95 · 280ms · measured prod, last 24hreference
HTTP12MSVALIDATE8MSAUTH18MSQUERY42MSGENERATE185MSRESPOND15MSP95 TOTAL280MSp5088MS·p99412MSmeasured prod · 24hrps · now1,247↑ 8%concurrent24/ 100 max

regression detection · 12 runs

AI Reliability Engineering

You are scaling prompts before you have evaluation discipline. Quality is checked by eyeballing outputs. Cost and latency are unmeasured. When something breaks in production, there is no way to tell what changed.

Deterministic evaluation frameworks, structured output validation, cost and latency observability, regression detection. The layer most teams skip between prototype and production.

  • Python
  • pytest
  • Prometheus
//deploy @ run 8 — regression caught at run 9live
1.00.80.6run 1run 12DEPLOY · 9F2A3C1REGRESSION DETECTEDrun 9 · −0.19 deltasample · evals2,432per runmean score · 7d0.91↓ 0.04 wkcost · per eval$0.038↓ 12%cohere + claude haiku± 1σ
//portfolio · 1 flagship + 5 systems

Six systems.
One flagship, five supporting.

One system is production-grade and carries the weight. The other five are public prototypes and showcases, labeled for exactly what they are. Nothing here pretends to be further along than it is.

flagship · Production

Multi-cloud RAG

NexusRAG

Multi-tenant RAG platform

Multi-tenant, multi-cloud RAG platform with SSO, SCIM, RBAC/ABAC, envelope encryption, multi-region failover, and SOC 2 automation. The shipped reference every other repo points to.

retrieverankgeneratestream
  • LangGraph
  • FastAPI
  • pgvector
  • Bedrock + Vertex

// multi-cloud RAG platform

TENANTmulti-tenantGATEWAYSSO · RBAC/ABACRETRIEVEpgvectorRANKrerankGENERATEgrounded LLMBEDROCKVERTEXmulti-region failover

// supporting fleet · public repos, honest stages

See all six systems, by stage

//clients

Who we build for,
and what the shape looks like.

Three engagement profiles. Each starts with a measurable problem, ends with deployed code in your repo, and gets handed off with the eval harness, observability, and the docs we used to ship it.

FOR ·01

Startups moving past prototype

You demoed the AI feature. Now the team is asking for retrieval that doesn’t fall over, an eval harness with regressions caught, and an API your iOS engineer can actually call.

FOR ·02

Agencies delivering AI products

Your team builds the front end and the brand. We ship the AI backend underneath: RAG, agents, eval. Then we disappear before launch with the source in your client’s repo.

FOR ·03

Teams evaluating capability

You’re considering an AI investment and want a working reference architecture, an eval against your real data, and a number on what production reliability actually costs.

//method

Five phases,
one shipping commit.

Every engagement runs through the same sequence. Discovery sets the spec and the evaluation bars; the next four phases build and ship against them. Adjustable by week, not by order.

W1 · Day 1W2 · Day 6W3 · Day 11W4 · Day 16Ongoing · Day 21+
// 4-week standard · adjustable by weekclick a phase to inspect →
Phase 01 / 05 · the spec

Discovery

Days 1–5 · 5 work-days

We sit with the brief, write the system spec, define the evaluation bars, and surface every risk before any code ships. The spec we write here is the contract every phase below is measured against.

// deliverables4 files · spec-only

spec/v1.md
System spec + architecture sketch
eval/bars.yaml
Evaluation criteria + scoring bars
risk/register.md
Risk register + open questions
ops/deploy.tf
Repo + deploy targets confirmed
//telemetry

What the systems
are actually doing right now.

Every system below ships its own public stats endpoint. These numbers are fetched live from those deployed endpoints, polled every 30 seconds. NexusRAG reports production workload; systems running a public benchmark report it live; the rest report repo-derived velocity. When a system is quiet or its endpoint is cold, the tile says so. Nothing here is simulated.

//live· querying endpoints…

NexusRAGquerying…

NexusRAG · live telemetry

hybrid retrieval rerank generate evaluate

// last activityquerying nexusrag.eleventh.dev/api/stats…

Queries · total
p95 latency
Uptime · 30d
Indexed chunks
source · nexusrag.eleventh.dev/api/stats
//engineering charter

The model isn’t the system.
The system is the work.

The difference is not the model. It’s the system around it. Below are the four laws we won’t ship without. Each is paired with the anti-pattern that gives the principle its enforcement edge.

Law 0101 / 04

Deterministic evaluation

Replacing subjective “looks good” testing with measurable scoring and regression checks. Every prompt change ships with a delta against a frozen baseline.

// won’t accept“it worked when I tried it”

Law 0202 / 04

Cost control

Preventing uncontrolled token usage through architecture, not afterthought optimization. Cost ceilings, caching, and routing live at the request layer.

// won’t accept“we’ll optimize when it gets expensive”

Law 0303 / 04

Latency engineering

Designing systems that respond in real time, not seconds too late. p95 is a feature; warm-paths and streaming are architectural, not optimization.

// won’t accept“the model takes 8s, deal with it later”

Law 0404 / 04

Operational reliability

Building systems that continue working under load, failure, and scale. Health checks, retries, idempotency, and circuit breakers ship on day one.

// won’t accept“works in dev, ship it”

//position

AI engineering is contract engineering.

  1. 01Build the control plane.
  2. 02Defend the workflow.
  3. 03Run layered evaluation.

AI infrastructure, not AI theater.

// from the engineering charter

//contact

Tell us the system.
We’ll tell you the bar.

Tell us about the system you need shipped. We respond in 24 hours with a specific technical plan, a measurable bar, and an honest estimate of weeks and cost.

First call inside one business day.

Brief us in two paragraphs. If we’re not the right fit, we’ll tell you fast and route you to a team that is.

// about you

// your system

// your engagement

Can this become a public reference?

// what you need

// 24h response · specific technical plan · honest estimate