CASE STUDY · AI INFRASTRUCTURE · 6-MONTH BUILD

Production RAG over 10 million documents.

An AI infrastructure startup needed a retrieval system that didn't fall apart in production. Six months later, p95 retrieval at 200ms across a 10M-document corpus, with a real eval harness and a cost ceiling that doesn't break the model.

Industry
AI Infrastructure
Corpus
~10M documents
Duration
6 months
Outcome
200ms p95 retrieval · production

Challenge

The client had a working RAG demo. It worked because it was running over a small corpus, the queries were curated, and nobody was paying attention to latency. They needed to go from "works in a notebook" to "answers customer queries at 200ms p95 over 10 million documents," with a measurable accuracy bar and a cost model that didn't blow up the unit economics.

Three things were broken in the prototype:

  • No eval harness. "It seems good" was the only quality signal.
  • Single-vendor coupling. The system would fall over if OpenAI had a degraded hour.
  • No cost ceiling. A bad query could trigger an unbounded fan-out.

Approach

Build the eval harness first

Before we touched retrieval, we built the eval harness. 800 curated query/answer pairs sourced from real customer logs (with PII redacted), categorized by intent, scored on retrieval precision@k and answer faithfulness. Every PR ran against the harness; merges blocked on regression beyond a small tolerance. This was non-negotiable from day one.

Multi-stage retrieval

Single-vector dense retrieval doesn't survive 10M docs without help. We layered:

  • Stage 1: BM25 over a sharded inverted index for lexical recall (catches rare terms dense retrieval misses).
  • Stage 2: Dense vector search via Pinecone, using OpenAI text-embedding-3-large.
  • Stage 3: Reciprocal rank fusion to merge.
  • Stage 4: Cross-encoder reranking on the top 50 candidates.

Stages 1 and 2 run in parallel; stages 3 and 4 are sequential. The whole pipeline budget is 200ms p95.

Multi-LLM router

We treat OpenAI and Anthropic as interchangeable behind a router that picks based on (a) request type — long-context vs short, structured vs free-form — and (b) realtime health and rate-limit headroom. If one provider degrades, the router shifts traffic before users feel it. Each route has a recorded cost-per-call so the finance dashboard ties to reality, not estimates.

Observability and cost ceilings

Every retrieval and every generation emits a structured trace. Per-tenant cost is measured live and gated by configurable ceilings — a tenant that hits the ceiling gets a controlled degradation, not a surprise five-figure bill. The team gets paged on quality regressions (drop in retrieval precision@k), not just latency.

"A RAG system without an eval harness is a chatbot that demos well. The harness is what makes the difference between a demo and a product."

What we built

  • Ingestion pipeline — chunking, embedding, deduplication, incremental re-indexing on document updates.
  • Hybrid retrieval — BM25 + dense + RRF + cross-encoder rerank, p95 200ms, configurable per tenant.
  • Eval harness — 800 query/answer pairs, run on every PR, blocking merges on regression.
  • LLM router — OpenAI + Anthropic + on-prem fallback, health-aware, cost-aware, latency-aware.
  • Per-tenant cost ceilings — measured live, enforced at the router, with controlled degradation modes.
  • Observability stack — traces, structured logs, retrieval-quality dashboards, alerting on quality regression not just errors.
  • Replay tooling — every customer query can be re-run against a new model or new retrieval config in CI.

Results

200ms p95 retrieval over 10M docs in production. The eval harness catches retrieval regressions before they ship — three would-have-been-incidents prevented during the build. Cost per query landed within target after the router and cost ceilings went live. The team now ships RAG changes with confidence because the harness gives them a number.

Equally valuable: the engineering team got an opinion they didn't have before about how to evaluate RAG. We left them with a runbook on extending the eval set, prompts for generating synthetic eval pairs, and a process for triaging regressions.

Stack

VECTOR STOREPinecone (serverless tier)
LEXICAL INDEXOpenSearch (sharded)
EMBEDDINGSOpenAI text-embedding-3-large
RERANKERCohere Rerank + custom cross-encoder
LLMsOpenAI GPT-4 family + Anthropic Claude family
ROUTERCustom Python router with health-aware policy
EVALCustom harness, 800 graded pairs, run on every PR
OBSERVABILITYOpenTelemetry + Datadog + custom retrieval dashboards
APIFastAPI + async
QUEUESQS for ingestion, Redis for hot cache

Why this engagement worked

Because the team had shipped RAG in production before — through the unglamorous middle phase where retrieval seems to work but the eval numbers haven't moved. We didn't sell the client on agentic frameworks or chain-of-thought theatre. We sold them on measurement.

◆ START A PROJECT

Want similar results?

Production RAG, evals, cost-capped LLM systems. NDA available on request.