An AI infrastructure startup needed a retrieval system that didn't fall apart in production. Six months later, p95 retrieval at 200ms across a 10M-document corpus, with a real eval harness and a cost ceiling that doesn't break the model.
The client had a working RAG demo. It worked because it was running over a small corpus, the queries were curated, and nobody was paying attention to latency. They needed to go from "works in a notebook" to "answers customer queries at 200ms p95 over 10 million documents," with a measurable accuracy bar and a cost model that didn't blow up the unit economics.
Three things were broken in the prototype:
Before we touched retrieval, we built the eval harness. 800 curated query/answer pairs sourced from real customer logs (with PII redacted), categorized by intent, scored on retrieval precision@k and answer faithfulness. Every PR ran against the harness; merges blocked on regression beyond a small tolerance. This was non-negotiable from day one.
Single-vector dense retrieval doesn't survive 10M docs without help. We layered:
Stages 1 and 2 run in parallel; stages 3 and 4 are sequential. The whole pipeline budget is 200ms p95.
We treat OpenAI and Anthropic as interchangeable behind a router that picks based on (a) request type — long-context vs short, structured vs free-form — and (b) realtime health and rate-limit headroom. If one provider degrades, the router shifts traffic before users feel it. Each route has a recorded cost-per-call so the finance dashboard ties to reality, not estimates.
Every retrieval and every generation emits a structured trace. Per-tenant cost is measured live and gated by configurable ceilings — a tenant that hits the ceiling gets a controlled degradation, not a surprise five-figure bill. The team gets paged on quality regressions (drop in retrieval precision@k), not just latency.
"A RAG system without an eval harness is a chatbot that demos well. The harness is what makes the difference between a demo and a product."
200ms p95 retrieval over 10M docs in production. The eval harness catches retrieval regressions before they ship — three would-have-been-incidents prevented during the build. Cost per query landed within target after the router and cost ceilings went live. The team now ships RAG changes with confidence because the harness gives them a number.
Equally valuable: the engineering team got an opinion they didn't have before about how to evaluate RAG. We left them with a runbook on extending the eval set, prompts for generating synthetic eval pairs, and a process for triaging regressions.
Because the team had shipped RAG in production before — through the unglamorous middle phase where retrieval seems to work but the eval numbers haven't moved. We didn't sell the client on agentic frameworks or chain-of-thought theatre. We sold them on measurement.
Production RAG, evals, cost-capped LLM systems. NDA available on request.