LLM Architecture: Production Inference, RAG, Scaling, and Infrastructure Design
#ai#LLM#architecture#infrastructure#RAG#MLOps#devops#system-design
Large Language Model (LLM) architecture is how you turn a foundation model into a reliable production system: routing requests, serving inference at scale, grounding answers with retrieval, enforcing safety, and operating the stack with observability and cost control. This document is an infrastructure-oriented reference—from the request path through RAG, agents, and deployment—complementing agentic AI + MCP and context engineering.
What “LLM architecture” means in production
In research, an LLM is weights and a tokenizer. In production, it is a distributed system with:
| Layer | Responsibility |
|---|---|
| Experience | Chat UI, APIs, SDKs, streaming UX |
| Orchestration | Prompt assembly, tool loops, workflow state |
| Inference | Model serving, batching, GPU scheduling |
| Knowledge | RAG, vector DB, caches, connectors |
| Control | Auth, rate limits, routing, guardrails |
| Operations | Metrics, tracing, evals, cost accounting |
The design goal is predictable latency, quality, cost, and safety under real traffic—not maximum context window size alone.
High-level reference architecture
+------------------+
| Clients / Apps |
| (web, API, agent)|
+--------+---------+
|
v
+------------------+
| API Gateway |
| auth, rate limit |
| request routing |
+--------+---------+
|
+---------------------+---------------------+
| | |
v v v
+----------------+ +----------------+ +----------------+
| Orchestration | | RAG / Search | | Tool / MCP |
| (workflows, | | (retrieval, | | servers |
| agent loop) | | rerank) | | |
+-------+--------+ +-------+--------+ +-------+--------+
| | |
+--------------------+--------------------+
|
v
+------------------+
| Inference Layer |
| (model router, |
| load balancer) |
+--------+---------+
|
+-----------------+-----------------+
| | |
v v v
+-------------+ +-------------+ +-------------+
| LLM pool | | Embedding | | Reranker / |
| (chat, code)| | model pool | | classifier |
+-------------+ +-------------+ +-------------+
| | |
v v v
+--------------------------------------------------+
| GPU / accelerator fleet (K8s, VM, managed API) |
+--------------------------------------------------+
|
v
+------------------+
| Observability |
| logs, traces, |
| evals, cost |
+------------------+
| Component | Role |
|---|---|
| API gateway | Single entry: authentication, quotas, WAF, request IDs, model selection headers. |
| Orchestration | Builds prompts, runs agent loops, manages session state, calls tools. |
| RAG | Retrieves grounded context before or during generation. |
| Inference layer | Routes to the right model endpoint; handles retries, fallbacks, circuit breakers. |
| Model pools | Specialized models: chat, embeddings, reranking, moderation, vision. |
| Observability | End-to-end traces from user query → retrieval → tokens → response quality. |
Inference pipeline: one request end-to-end
A single chat completion in production is rarely “send text to model.” Typical stages:
User message
|
v
[1] Auth + policy check (tenant, scope, PII rules)
|
v
[2] Session load (history, user prefs, durable notes)
|
v
[3] Intent / routing (which model? need RAG? need tools?)
|
v
[4] Context assembly (system prompt + history + retrieved docs)
|
v
[5] Token budget trim (summarize, drop noise, compact old turns)
|
v
[6] Inference call (streaming or batch)
|
v
[7] Post-process (citations, JSON parse, moderation, logging)
|
v
Response to client
Synchronous vs streaming
| Mode | Behavior | When to use |
|---|---|---|
| Streaming (SSE) | Tokens arrive incrementally; TTFB matters. | Chat UX, coding assistants, long answers. |
| Batch / async | Full response or job ID later. | Summarization pipelines, bulk eval, offline jobs. |
| Non-streaming sync | Wait for complete response. | Short structured outputs, tool-call planning steps. |
Infrastructure note: Streaming changes timeout, load-balancer, and retry design. Retrying a half-delivered stream is harder than retrying a single JSON response—use idempotent session keys and client-side resume policies.
Model roles in a multi-model stack
Production systems rarely use one model for everything.
| Model type | Typical use | Serving profile |
|---|---|---|
| Frontier chat | Complex reasoning, agents, coding | High VRAM, lower QPS, higher cost/token |
| Small / fast chat | Classification, routing, simple Q&A | High QPS, low latency, distillation target |
| Embedding | RAG indexing, semantic search | Batch-friendly, high throughput |
| Reranker | Precision on top-k retrieval | Small batches, CPU or GPU |
| Moderation / safety | Input/output policy | Low latency, often separate vendor or small model |
| Multimodal | Image, audio, document understanding | Heavier compute, different pre-processing |
Model router
A router picks the model (or path) per request:
Incoming request
|
v
+--------------+
| Router |
| (rules + |
| classifier) |
+------+-------+
|
+-----+-----+-----+
| | |
v v v
Fast SLM Main LLM Specialist
(cheap) (quality) (code, legal)
Routing signals:
- Explicit user/product tier (free vs pro model)
- Heuristics (token count, language, attachment type)
- Classifier model (“simple FAQ” vs “multi-step agent task”)
- Cost/latency SLO and current queue depth
- Fallback chain when primary pool is saturated or errors
Serving layer: GPUs, batching, and efficiency
Deployment patterns
| Pattern | Description | Trade-off |
|---|---|---|
| Managed API | OpenAI, Azure OpenAI, Anthropic, Bedrock | Fastest to ship; less control over weights and colocation |
| Self-hosted (vLLM, TGI, TensorRT-LLM) | Models on your K8s/VM GPU fleet | Control, data residency, ops burden |
| Hybrid | Sensitive workloads self-hosted; burst to managed API | Complexity in routing and eval parity |
Key serving concepts
Continuous batching — The scheduler groups in-flight requests dynamically so GPUs stay utilized instead of waiting for fixed batch boundaries.
KV cache — Attention key/value tensors for prior tokens are cached so each new token does not recompute the full prefix. Long contexts and multi-turn chats are memory-heavy; cache sizing drives max concurrency per GPU.
Speculative decoding — A small draft model proposes tokens; the large model verifies in parallel. Cuts latency for autoregressive generation when draft acceptance is high.
Quantization (INT8, FP8, GPTQ, AWQ) — Lower precision weights reduce VRAM and increase throughput with small quality trade-offs—common for embedding and mid-tier chat models.
Request queue
|
v
+------------------+
| Scheduler |
| (batch, priority,|
| preemption) |
+--------+---------+
|
v
+------------------+
| GPU worker(s) |
| model weights |
| KV cache pools |
+------------------+
Scaling dimensions
| Dimension | Knob |
|---|---|
| Throughput | More GPU replicas, better batching, smaller quantized models |
| Latency (p95) | Dedicated pools, shorter context, speculative decoding, regional edge |
| Concurrency | KV cache limits per replica; queue + backpressure at gateway |
| Cost | Model routing to SLMs, cache hits, prompt compression, batch offline work |
RAG architecture (retrieval-augmented generation)
RAG grounds the LLM on private or fresh data instead of relying only on training knowledge.
User query
|
v
+-------------+ +------------------+
| Query |---->| Embedding model |
| transform | | (query vector) |
+-------------+ +--------+---------+
| |
| v
| +------------------+
| | Vector DB / |
| | hybrid search |
| +--------+---------+
| |
v v
+-------------+ Top-k chunks
| Optional |<--------------------+
| reranker | |
+------+------+ |
| |
v v
+------------------------------------------+
| Context builder (citations, chunk trim) |
+--------------------+---------------------+
|
v
+-------------+
| Main LLM |
| (generate) |
+-------------+
Ingestion pipeline (offline / async)
Sources (docs, wiki, tickets, code)
|
v
Extract -> Chunk -> Embed -> Index (vector + metadata)
|
v
Optional: graph links, ACL tags, version stamps
| Stage | Design choices |
|---|---|
| Chunking | Fixed size vs semantic splits; overlap; respect document structure (headings, tables). |
| Indexing | Vector only vs hybrid (BM25 + dense); metadata filters (team, product, date). |
| Retrieval | top-k, MMR diversity, multi-query expansion, HyDE-style synthetic queries. |
| Reranking | Cross-encoder reranker on top-50 → top-5 for precision. |
| Generation | Require citations; refuse when retrieval score is below threshold. |
RAG failure modes (architecture responses)
| Failure | Mitigation |
|---|---|
| Wrong chunks retrieved | Better embeddings, hybrid search, reranker, query rewriting |
| Context overflow | Chunk budget, summarization of retrieved set, context engineering |
| Stale index | Incremental ingest, version tags in prompts, TTL on sources |
| Hallucination despite RAG | Citation-only answers, lower temperature, “I don’t know” policy |
Fine-tuning, prompting, and distillation
| Approach | What changes | Best for |
|---|---|---|
| Prompting + RAG | No weight updates | Fast iteration, changing policies, private data via retrieval |
| Fine-tuning (SFT, LoRA) | Adapter or full weights | Tone, format, domain jargon, tool-call style |
| RLHF / DPO / preference tuning | Alignment to human preferences | Safety, helpfulness, ranking quality |
| Distillation | Small model mimics large | Cost reduction at scale for routing or simple tasks |
Architecture guidance:
- Start with routing + RAG + strong system prompts before fine-tuning.
- Fine-tune for stable, repetitive patterns (JSON schema, support tone, internal acronyms).
- Keep a golden eval set and regression tests before and after any weight change.
- Version models like microservices:
chat-v3,embed-v2, with shadow traffic and rollback.
Agents and tools in the LLM stack
An agent is orchestration on top of inference: the LLM plans, calls tools, and loops until done. Infrastructure additions:
Orchestrator (your service)
|
+--> LLM inference (plan / act)
|
+--> Tool gateway (MCP, REST, SQL, code sandbox)
|
+--> Memory store (session, vector, key-value notes)
|
+--> Human approval gate (high-risk actions)
See AI Agentic Systems and MCP for ReAct, plan-and-execute, and MCP host/client/server design.
| Concern | Infrastructure pattern |
|---|---|
| Tool sprawl | Tool registry per role; dynamic tool loading; MCP servers by domain |
| Long sessions | Compaction, structured notes outside context window |
| Unsafe actions | Sandboxed code exec, allowlisted APIs, approval workflows |
| Cost explosion | Max steps per task, per-user budgets, cache tool results |
Safety, guardrails, and policy layer
Guardrails sit around inference, not inside the model only.
Input --> [PII detect] --> [Prompt injection filter] --> LLM
|
Output <-- [moderation] <-- [policy / schema validate] <-------+
| Control | Implementation |
|---|---|
| Input moderation | Block jailbreaks, toxic content, secrets in prompts |
| Output moderation | Category scores, block or rewrite |
| Structured output | JSON schema, grammar constraints (GBNF), function-call validators |
| Tenant isolation | Separate indexes, API keys, and rate limits per customer |
| Audit | Immutable logs of prompts, retrieval IDs, tool calls (with retention policy) |
Defense in depth: gateway rules + model moderation + application-level validation (especially for finance, healthcare, and code execution).
Observability and evaluation
LLM systems need product metrics and model metrics.
What to trace (one trace_id per user request)
| Span | Data |
|---|---|
gateway | user, model route, latency |
retrieval | query, chunk IDs, scores |
inference | model version, input/output tokens, finish reason |
tools | tool name, args hash, latency, success |
guardrails | policy hits, blocked reason |
Core metrics
| Metric | Why it matters |
|---|---|
| TTFB / TTLT | Streaming perceived speed |
| Tokens in/out | Direct cost driver |
| Error rate by model | Routing and fallback health |
| Retrieval hit rate / MRR | RAG quality |
| Task success rate | Agent completion (human or automated eval) |
| Hallucination / citation rate | Grounded answer quality |
Evaluation pipeline
Golden datasets (per use case)
|
v
Offline eval (CI on model / prompt changes)
|
v
Online shadow / A-B (small % traffic)
|
v
Production monitors + human review queue
Automate evals for: exact match on structured tasks, LLM-as-judge (with caution), retrieval recall@k, toxicity regression, and latency/cost budgets.
Deployment topologies
Single region (baseline)
Users --> LB --> App tier --> Inference (GPU node pool) --> Vector DB
|
+--> Object storage (docs, artifacts)
Multi-region / hybrid
| Topology | Use case |
|---|---|
| Read replicas + central GPU | Global users, inference in one compliant region |
| Regional inference | Strict data residency (EU-only, etc.) |
| Edge gateway + central RAG | Low-latency first token; retrieval from regional cache |
Data plane vs control plane: Keep config, evals, and routing rules in a control plane; replicate indexes and model endpoints per region as required by compliance and latency.
Caching strategy
| Cache | What | Benefit |
|---|---|---|
| Prompt / prefix cache | Identical system prompts, tool schemas | Lower cost and latency (provider or vLLM prefix caching) |
| Semantic cache | Similar user queries → prior answers | Cost savings on FAQ-style traffic |
| Embedding cache | Chunk and query vectors | Faster RAG retrieval |
| Tool result cache | Idempotent read-only API responses | Fewer agent steps and external calls |
Invalidate on: document updates, model version change, policy change, or TTL expiry.
Cost and capacity planning
Rough planning variables:
Monthly cost ≈
(input_tokens × price_in + output_tokens × price_out) × requests
+ GPU_hours × $/GPUhr (if self-hosted)
+ vector_db + storage + egress
| Lever | Effect |
|---|---|
| Shorter prompts / compaction | ↓ input tokens |
| Smaller model for easy queries | ↓ $/request |
| Higher batch utilization | ↓ GPU_hours |
| RAG only when needed | ↓ retrieval + context tokens |
| Output length limits | ↓ output tokens |
Capacity: estimate peak QPS, p95 context length, and concurrent agent steps; load-test the inference pool with synthetic prompts that match production token distributions—not average-only tests.
Security and compliance checklist
- Secrets: API keys in vault/K8s secrets; never in prompts or logs.
- Network: Private endpoints for model and vector DB; no public GPU admin ports.
- Data residency: Model provider region, index location, and log storage aligned with policy.
- Retention: Configurable prompt/log retention; PII redaction in traces.
- Supply chain: Pin model artifacts, scan containers, sign OCI images for inference workers.
- Abuse: Per-IP and per-tenant rate limits; anomaly detection on token burn.
Design patterns summary
| Pattern | Summary |
|---|---|
| Gateway + router | One front door; route by task, tier, and SLO. |
| Multi-model pool | Right-size model per step (embed, rerank, chat, classify). |
| RAG with hybrid search | Dense + sparse retrieval, rerank, cite sources. |
| Agent orchestration | Loop over inference + tools; cap steps and cost. |
| Guardrail sandwich | Filter input and output; validate structured actions. |
| Observable by default | Trace retrieval, tokens, tools, and eval regressions. |
| Progressive complexity | Prompt + RAG first; fine-tune and self-host when economics justify ops. |
How this connects to your other notes
| Topic | Post |
|---|---|
| Agent loops, ReAct, MCP | AI Agentic Systems and MCP |
| Context budgets, compaction, long horizons | Context Engineering for AI Agents |
| Distributed trade-offs (availability vs consistency) | CAP Theorem |
Summary
| Area | Takeaway |
|---|---|
| Production LLM | A full stack: gateway, orchestration, inference, knowledge, guardrails, ops—not a single API call. |
| Inference | Streaming, batching, KV cache, and routing dominate latency and cost. |
| RAG | Ingest, retrieve, rerank, then generate with citations and refusal policies. |
| Multi-model | Embed, rerank, chat, moderate—specialized pools beat one giant model for everything. |
| Agents | Orchestration + tools on top of inference; control steps, memory, and risk. |
| Operations | Traces, evals, and token accounting are as important as model choice. |
A solid LLM architecture picks the right model for each step, grounds answers when facts matter, contains agents with policy and budgets, and runs on infrastructure you can measure, scale, and rollback like any other critical service.
Comments