LLM Architecture: Production Inference, RAG, Scaling, and Infrastructure Design

Large Language Model (LLM) architecture is how you turn a foundation model into a reliable production system: routing requests, serving inference at scale, grounding answers with retrieval, enforcing safety, and operating the stack with observability and cost control. This document is an infrastructure-oriented reference—from the request path through RAG, agents, and deployment—complementing agentic AI + MCP and context engineering.

What “LLM architecture” means in production

In research, an LLM is weights and a tokenizer. In production, it is a distributed system with:

Layer	Responsibility
Experience	Chat UI, APIs, SDKs, streaming UX
Orchestration	Prompt assembly, tool loops, workflow state
Inference	Model serving, batching, GPU scheduling
Knowledge	RAG, vector DB, caches, connectors
Control	Auth, rate limits, routing, guardrails
Operations	Metrics, tracing, evals, cost accounting

The design goal is predictable latency, quality, cost, and safety under real traffic—not maximum context window size alone.

High-level reference architecture

                         +------------------+
                         |  Clients / Apps  |
                         | (web, API, agent)|
                         +--------+---------+
                                  |
                                  v
                         +------------------+
                         |   API Gateway    |
                         | auth, rate limit |
                         | request routing  |
                         +--------+---------+
                                  |
            +---------------------+---------------------+
            |                     |                     |
            v                     v                     v
   +----------------+   +----------------+   +----------------+
   | Orchestration  |   |  RAG / Search  |   |  Tool / MCP    |
   | (workflows,    |   |  (retrieval,   |   |  servers       |
   |  agent loop)   |   |   rerank)      |   |                |
   +-------+--------+   +-------+--------+   +-------+--------+
           |                    |                    |
           +--------------------+--------------------+
                                |
                                v
                       +------------------+
                       |  Inference Layer |
                       |  (model router,  |
                       |   load balancer) |
                       +--------+---------+
                                |
              +-----------------+-----------------+
              |                 |                 |
              v                 v                 v
       +-------------+   +-------------+   +-------------+
       | LLM pool    |   | Embedding   |   | Reranker /  |
       | (chat, code)|   | model pool  |   | classifier  |
       +-------------+   +-------------+   +-------------+
              |                 |                 |
              v                 v                 v
       +--------------------------------------------------+
       | GPU / accelerator fleet (K8s, VM, managed API)   |
       +--------------------------------------------------+
                                |
                                v
                       +------------------+
                       | Observability    |
                       | logs, traces,    |
                       | evals, cost      |
                       +------------------+

Component	Role
API gateway	Single entry: authentication, quotas, WAF, request IDs, model selection headers.
Orchestration	Builds prompts, runs agent loops, manages session state, calls tools.
RAG	Retrieves grounded context before or during generation.
Inference layer	Routes to the right model endpoint; handles retries, fallbacks, circuit breakers.
Model pools	Specialized models: chat, embeddings, reranking, moderation, vision.
Observability	End-to-end traces from user query → retrieval → tokens → response quality.

Inference pipeline: one request end-to-end

A single chat completion in production is rarely “send text to model.” Typical stages:

    User message
         |
         v
    [1] Auth + policy check (tenant, scope, PII rules)
         |
         v
    [2] Session load (history, user prefs, durable notes)
         |
         v
    [3] Intent / routing (which model? need RAG? need tools?)
         |
         v
    [4] Context assembly (system prompt + history + retrieved docs)
         |
         v
    [5] Token budget trim (summarize, drop noise, compact old turns)
         |
         v
    [6] Inference call (streaming or batch)
         |
         v
    [7] Post-process (citations, JSON parse, moderation, logging)
         |
         v
    Response to client

Synchronous vs streaming

Mode	Behavior	When to use
Streaming (SSE)	Tokens arrive incrementally; TTFB matters.	Chat UX, coding assistants, long answers.
Batch / async	Full response or job ID later.	Summarization pipelines, bulk eval, offline jobs.
Non-streaming sync	Wait for complete response.	Short structured outputs, tool-call planning steps.

Infrastructure note: Streaming changes timeout, load-balancer, and retry design. Retrying a half-delivered stream is harder than retrying a single JSON response—use idempotent session keys and client-side resume policies.

Model roles in a multi-model stack

Production systems rarely use one model for everything.

Model type	Typical use	Serving profile
Frontier chat	Complex reasoning, agents, coding	High VRAM, lower QPS, higher cost/token
Small / fast chat	Classification, routing, simple Q&A	High QPS, low latency, distillation target
Embedding	RAG indexing, semantic search	Batch-friendly, high throughput
Reranker	Precision on top-k retrieval	Small batches, CPU or GPU
Moderation / safety	Input/output policy	Low latency, often separate vendor or small model
Multimodal	Image, audio, document understanding	Heavier compute, different pre-processing

Model router

A router picks the model (or path) per request:

    Incoming request
           |
           v
    +--------------+
    | Router       |
    | (rules +     |
    |  classifier) |
    +------+-------+
           |
     +-----+-----+-----+
     |           |     |
     v           v     v
  Fast SLM   Main LLM  Specialist
  (cheap)    (quality)  (code, legal)

Routing signals:

Explicit user/product tier (free vs pro model)
Heuristics (token count, language, attachment type)
Classifier model (“simple FAQ” vs “multi-step agent task”)
Cost/latency SLO and current queue depth
Fallback chain when primary pool is saturated or errors

Serving layer: GPUs, batching, and efficiency

Deployment patterns

Pattern	Description	Trade-off
Managed API	OpenAI, Azure OpenAI, Anthropic, Bedrock	Fastest to ship; less control over weights and colocation
Self-hosted (vLLM, TGI, TensorRT-LLM)	Models on your K8s/VM GPU fleet	Control, data residency, ops burden
Hybrid	Sensitive workloads self-hosted; burst to managed API	Complexity in routing and eval parity

Key serving concepts

Continuous batching — The scheduler groups in-flight requests dynamically so GPUs stay utilized instead of waiting for fixed batch boundaries.

KV cache — Attention key/value tensors for prior tokens are cached so each new token does not recompute the full prefix. Long contexts and multi-turn chats are memory-heavy; cache sizing drives max concurrency per GPU.

Speculative decoding — A small draft model proposes tokens; the large model verifies in parallel. Cuts latency for autoregressive generation when draft acceptance is high.

Quantization (INT8, FP8, GPTQ, AWQ) — Lower precision weights reduce VRAM and increase throughput with small quality trade-offs—common for embedding and mid-tier chat models.

    Request queue
         |
         v
    +------------------+
    | Scheduler        |
    | (batch, priority,|
    |  preemption)     |
    +--------+---------+
             |
             v
    +------------------+
    | GPU worker(s)    |
    | model weights    |
    | KV cache pools   |
    +------------------+

Scaling dimensions

Dimension	Knob
Throughput	More GPU replicas, better batching, smaller quantized models
Latency (p95)	Dedicated pools, shorter context, speculative decoding, regional edge
Concurrency	KV cache limits per replica; queue + backpressure at gateway
Cost	Model routing to SLMs, cache hits, prompt compression, batch offline work

RAG architecture (retrieval-augmented generation)

RAG grounds the LLM on private or fresh data instead of relying only on training knowledge.

    User query
         |
         v
    +-------------+     +------------------+
    | Query       |---->| Embedding model  |
    | transform   |     | (query vector)   |
    +-------------+     +--------+---------+
         |                       |
         |                       v
         |              +------------------+
         |              | Vector DB /      |
         |              | hybrid search    |
         |              +--------+---------+
         |                       |
         v                       v
    +-------------+     Top-k chunks
    | Optional    |<--------------------+
    | reranker    |                     |
    +------+------+                     |
           |                            |
           v                            v
    +------------------------------------------+
    | Context builder (citations, chunk trim)  |
    +--------------------+---------------------+
                         |
                         v
                  +-------------+
                  | Main LLM    |
                  | (generate)  |
                  +-------------+

Ingestion pipeline (offline / async)

    Sources (docs, wiki, tickets, code)
              |
              v
    Extract -> Chunk -> Embed -> Index (vector + metadata)
              |
              v
    Optional: graph links, ACL tags, version stamps

Stage	Design choices
Chunking	Fixed size vs semantic splits; overlap; respect document structure (headings, tables).
Indexing	Vector only vs hybrid (BM25 + dense); metadata filters (team, product, date).
Retrieval	top-k, MMR diversity, multi-query expansion, HyDE-style synthetic queries.
Reranking	Cross-encoder reranker on top-50 → top-5 for precision.
Generation	Require citations; refuse when retrieval score is below threshold.

RAG failure modes (architecture responses)

Failure	Mitigation
Wrong chunks retrieved	Better embeddings, hybrid search, reranker, query rewriting
Context overflow	Chunk budget, summarization of retrieved set, context engineering
Stale index	Incremental ingest, version tags in prompts, TTL on sources
Hallucination despite RAG	Citation-only answers, lower temperature, “I don’t know” policy

Fine-tuning, prompting, and distillation

Approach	What changes	Best for
Prompting + RAG	No weight updates	Fast iteration, changing policies, private data via retrieval
Fine-tuning (SFT, LoRA)	Adapter or full weights	Tone, format, domain jargon, tool-call style
RLHF / DPO / preference tuning	Alignment to human preferences	Safety, helpfulness, ranking quality
Distillation	Small model mimics large	Cost reduction at scale for routing or simple tasks

Architecture guidance:

Start with routing + RAG + strong system prompts before fine-tuning.
Fine-tune for stable, repetitive patterns (JSON schema, support tone, internal acronyms).
Keep a golden eval set and regression tests before and after any weight change.
Version models like microservices: chat-v3, embed-v2, with shadow traffic and rollback.

Agents and tools in the LLM stack

An agent is orchestration on top of inference: the LLM plans, calls tools, and loops until done. Infrastructure additions:

    Orchestrator (your service)
         |
         +--> LLM inference (plan / act)
         |
         +--> Tool gateway (MCP, REST, SQL, code sandbox)
         |
         +--> Memory store (session, vector, key-value notes)
         |
         +--> Human approval gate (high-risk actions)

See AI Agentic Systems and MCP for ReAct, plan-and-execute, and MCP host/client/server design.

Concern	Infrastructure pattern
Tool sprawl	Tool registry per role; dynamic tool loading; MCP servers by domain
Long sessions	Compaction, structured notes outside context window
Unsafe actions	Sandboxed code exec, allowlisted APIs, approval workflows
Cost explosion	Max steps per task, per-user budgets, cache tool results

Safety, guardrails, and policy layer

Guardrails sit around inference, not inside the model only.

    Input --> [PII detect] --> [Prompt injection filter] --> LLM
                                                                  |
    Output <-- [moderation] <-- [policy / schema validate] <-------+

Control	Implementation
Input moderation	Block jailbreaks, toxic content, secrets in prompts
Output moderation	Category scores, block or rewrite
Structured output	JSON schema, grammar constraints (GBNF), function-call validators
Tenant isolation	Separate indexes, API keys, and rate limits per customer
Audit	Immutable logs of prompts, retrieval IDs, tool calls (with retention policy)

Defense in depth: gateway rules + model moderation + application-level validation (especially for finance, healthcare, and code execution).

Observability and evaluation

LLM systems need product metrics and model metrics.

What to trace (one `trace_id` per user request)

Span	Data
`gateway`	user, model route, latency
`retrieval`	query, chunk IDs, scores
`inference`	model version, input/output tokens, finish reason
`tools`	tool name, args hash, latency, success
`guardrails`	policy hits, blocked reason

Core metrics

Metric	Why it matters
TTFB / TTLT	Streaming perceived speed
Tokens in/out	Direct cost driver
Error rate by model	Routing and fallback health
Retrieval hit rate / MRR	RAG quality
Task success rate	Agent completion (human or automated eval)
Hallucination / citation rate	Grounded answer quality

Evaluation pipeline

    Golden datasets (per use case)
              |
              v
    Offline eval (CI on model / prompt changes)
              |
              v
    Online shadow / A-B (small % traffic)
              |
              v
    Production monitors + human review queue

Automate evals for: exact match on structured tasks, LLM-as-judge (with caution), retrieval recall@k, toxicity regression, and latency/cost budgets.

Deployment topologies

Single region (baseline)

    Users --> LB --> App tier --> Inference (GPU node pool) --> Vector DB
                      |
                      +--> Object storage (docs, artifacts)

Multi-region / hybrid

Topology	Use case
Read replicas + central GPU	Global users, inference in one compliant region
Regional inference	Strict data residency (EU-only, etc.)
Edge gateway + central RAG	Low-latency first token; retrieval from regional cache

Data plane vs control plane: Keep config, evals, and routing rules in a control plane; replicate indexes and model endpoints per region as required by compliance and latency.

Caching strategy

Cache	What	Benefit
Prompt / prefix cache	Identical system prompts, tool schemas	Lower cost and latency (provider or vLLM prefix caching)
Semantic cache	Similar user queries → prior answers	Cost savings on FAQ-style traffic
Embedding cache	Chunk and query vectors	Faster RAG retrieval
Tool result cache	Idempotent read-only API responses	Fewer agent steps and external calls

Invalidate on: document updates, model version change, policy change, or TTL expiry.

Cost and capacity planning

Rough planning variables:

    Monthly cost ≈
      (input_tokens  × price_in  +  output_tokens × price_out) × requests
      + GPU_hours × $/GPUhr (if self-hosted)
      + vector_db + storage + egress

Lever	Effect
Shorter prompts / compaction	↓ input tokens
Smaller model for easy queries	↓ $/request
Higher batch utilization	↓ GPU_hours
RAG only when needed	↓ retrieval + context tokens
Output length limits	↓ output tokens

Capacity: estimate peak QPS, p95 context length, and concurrent agent steps; load-test the inference pool with synthetic prompts that match production token distributions—not average-only tests.

Security and compliance checklist

Secrets: API keys in vault/K8s secrets; never in prompts or logs.
Network: Private endpoints for model and vector DB; no public GPU admin ports.
Data residency: Model provider region, index location, and log storage aligned with policy.
Retention: Configurable prompt/log retention; PII redaction in traces.
Supply chain: Pin model artifacts, scan containers, sign OCI images for inference workers.
Abuse: Per-IP and per-tenant rate limits; anomaly detection on token burn.

Design patterns summary

Pattern	Summary
Gateway + router	One front door; route by task, tier, and SLO.
Multi-model pool	Right-size model per step (embed, rerank, chat, classify).
RAG with hybrid search	Dense + sparse retrieval, rerank, cite sources.
Agent orchestration	Loop over inference + tools; cap steps and cost.
Guardrail sandwich	Filter input and output; validate structured actions.
Observable by default	Trace retrieval, tokens, tools, and eval regressions.
Progressive complexity	Prompt + RAG first; fine-tune and self-host when economics justify ops.

How this connects to your other notes

Topic	Post
Agent loops, ReAct, MCP	AI Agentic Systems and MCP
Context budgets, compaction, long horizons	Context Engineering for AI Agents
Distributed trade-offs (availability vs consistency)	CAP Theorem

Summary

Area	Takeaway
Production LLM	A full stack: gateway, orchestration, inference, knowledge, guardrails, ops—not a single API call.
Inference	Streaming, batching, KV cache, and routing dominate latency and cost.
RAG	Ingest, retrieve, rerank, then generate with citations and refusal policies.
Multi-model	Embed, rerank, chat, moderate—specialized pools beat one giant model for everything.
Agents	Orchestration + tools on top of inference; control steps, memory, and risk.
Operations	Traces, evals, and token accounting are as important as model choice.

A solid LLM architecture picks the right model for each step, grounds answers when facts matter, contains agents with policy and budgets, and runs on infrastructure you can measure, scale, and rollback like any other critical service.

quyennv.com

LLM Architecture: Production Inference, RAG, Scaling, and Infrastructure Design

What “LLM architecture” means in production

High-level reference architecture

Inference pipeline: one request end-to-end

Synchronous vs streaming

Model roles in a multi-model stack

Model router

Serving layer: GPUs, batching, and efficiency

Deployment patterns

Key serving concepts

Scaling dimensions

RAG architecture (retrieval-augmented generation)

Ingestion pipeline (offline / async)

RAG failure modes (architecture responses)

Fine-tuning, prompting, and distillation

Agents and tools in the LLM stack

Safety, guardrails, and policy layer

Observability and evaluation

What to trace (one `trace_id` per user request)

Core metrics

Evaluation pipeline

Deployment topologies

Single region (baseline)

Multi-region / hybrid

Caching strategy

Cost and capacity planning

Security and compliance checklist

Design patterns summary

How this connects to your other notes

Summary

Comments

What “LLM architecture” means in production

High-level reference architecture

Inference pipeline: one request end-to-end

Synchronous vs streaming

Model roles in a multi-model stack

Model router

Serving layer: GPUs, batching, and efficiency

Deployment patterns

Key serving concepts

Scaling dimensions

RAG architecture (retrieval-augmented generation)

Ingestion pipeline (offline / async)

RAG failure modes (architecture responses)

Fine-tuning, prompting, and distillation

Agents and tools in the LLM stack

Safety, guardrails, and policy layer

Observability and evaluation

What to trace (one trace_id per user request)

Core metrics

Evaluation pipeline

Deployment topologies

Single region (baseline)

Multi-region / hybrid

Caching strategy

Cost and capacity planning

Security and compliance checklist

Design patterns summary

How this connects to your other notes

Summary

Comments

What to trace (one `trace_id` per user request)