quyennv.com

Senior DevOps Engineer · Healthcare, Fanance

Detecting…

LLM Architecture: Production Inference, RAG, Scaling, and Infrastructure Design

#ai#LLM#architecture#infrastructure#RAG#MLOps#devops#system-design

0

Large Language Model (LLM) architecture is how you turn a foundation model into a reliable production system: routing requests, serving inference at scale, grounding answers with retrieval, enforcing safety, and operating the stack with observability and cost control. This document is an infrastructure-oriented reference—from the request path through RAG, agents, and deployment—complementing agentic AI + MCP and context engineering.


What “LLM architecture” means in production

In research, an LLM is weights and a tokenizer. In production, it is a distributed system with:

LayerResponsibility
ExperienceChat UI, APIs, SDKs, streaming UX
OrchestrationPrompt assembly, tool loops, workflow state
InferenceModel serving, batching, GPU scheduling
KnowledgeRAG, vector DB, caches, connectors
ControlAuth, rate limits, routing, guardrails
OperationsMetrics, tracing, evals, cost accounting

The design goal is predictable latency, quality, cost, and safety under real traffic—not maximum context window size alone.


High-level reference architecture

                         +------------------+
                         |  Clients / Apps  |
                         | (web, API, agent)|
                         +--------+---------+
                                  |
                                  v
                         +------------------+
                         |   API Gateway    |
                         | auth, rate limit |
                         | request routing  |
                         +--------+---------+
                                  |
            +---------------------+---------------------+
            |                     |                     |
            v                     v                     v
   +----------------+   +----------------+   +----------------+
   | Orchestration  |   |  RAG / Search  |   |  Tool / MCP    |
   | (workflows,    |   |  (retrieval,   |   |  servers       |
   |  agent loop)   |   |   rerank)      |   |                |
   +-------+--------+   +-------+--------+   +-------+--------+
           |                    |                    |
           +--------------------+--------------------+
                                |
                                v
                       +------------------+
                       |  Inference Layer |
                       |  (model router,  |
                       |   load balancer) |
                       +--------+---------+
                                |
              +-----------------+-----------------+
              |                 |                 |
              v                 v                 v
       +-------------+   +-------------+   +-------------+
       | LLM pool    |   | Embedding   |   | Reranker /  |
       | (chat, code)|   | model pool  |   | classifier  |
       +-------------+   +-------------+   +-------------+
              |                 |                 |
              v                 v                 v
       +--------------------------------------------------+
       | GPU / accelerator fleet (K8s, VM, managed API)   |
       +--------------------------------------------------+
                                |
                                v
                       +------------------+
                       | Observability    |
                       | logs, traces,    |
                       | evals, cost      |
                       +------------------+
ComponentRole
API gatewaySingle entry: authentication, quotas, WAF, request IDs, model selection headers.
OrchestrationBuilds prompts, runs agent loops, manages session state, calls tools.
RAGRetrieves grounded context before or during generation.
Inference layerRoutes to the right model endpoint; handles retries, fallbacks, circuit breakers.
Model poolsSpecialized models: chat, embeddings, reranking, moderation, vision.
ObservabilityEnd-to-end traces from user query → retrieval → tokens → response quality.

Inference pipeline: one request end-to-end

A single chat completion in production is rarely “send text to model.” Typical stages:

    User message
         |
         v
    [1] Auth + policy check (tenant, scope, PII rules)
         |
         v
    [2] Session load (history, user prefs, durable notes)
         |
         v
    [3] Intent / routing (which model? need RAG? need tools?)
         |
         v
    [4] Context assembly (system prompt + history + retrieved docs)
         |
         v
    [5] Token budget trim (summarize, drop noise, compact old turns)
         |
         v
    [6] Inference call (streaming or batch)
         |
         v
    [7] Post-process (citations, JSON parse, moderation, logging)
         |
         v
    Response to client

Synchronous vs streaming

ModeBehaviorWhen to use
Streaming (SSE)Tokens arrive incrementally; TTFB matters.Chat UX, coding assistants, long answers.
Batch / asyncFull response or job ID later.Summarization pipelines, bulk eval, offline jobs.
Non-streaming syncWait for complete response.Short structured outputs, tool-call planning steps.

Infrastructure note: Streaming changes timeout, load-balancer, and retry design. Retrying a half-delivered stream is harder than retrying a single JSON response—use idempotent session keys and client-side resume policies.


Model roles in a multi-model stack

Production systems rarely use one model for everything.

Model typeTypical useServing profile
Frontier chatComplex reasoning, agents, codingHigh VRAM, lower QPS, higher cost/token
Small / fast chatClassification, routing, simple Q&AHigh QPS, low latency, distillation target
EmbeddingRAG indexing, semantic searchBatch-friendly, high throughput
RerankerPrecision on top-k retrievalSmall batches, CPU or GPU
Moderation / safetyInput/output policyLow latency, often separate vendor or small model
MultimodalImage, audio, document understandingHeavier compute, different pre-processing

Model router

A router picks the model (or path) per request:

    Incoming request
           |
           v
    +--------------+
    | Router       |
    | (rules +     |
    |  classifier) |
    +------+-------+
           |
     +-----+-----+-----+
     |           |     |
     v           v     v
  Fast SLM   Main LLM  Specialist
  (cheap)    (quality)  (code, legal)

Routing signals:

  • Explicit user/product tier (free vs pro model)
  • Heuristics (token count, language, attachment type)
  • Classifier model (“simple FAQ” vs “multi-step agent task”)
  • Cost/latency SLO and current queue depth
  • Fallback chain when primary pool is saturated or errors

Serving layer: GPUs, batching, and efficiency

Deployment patterns

PatternDescriptionTrade-off
Managed APIOpenAI, Azure OpenAI, Anthropic, BedrockFastest to ship; less control over weights and colocation
Self-hosted (vLLM, TGI, TensorRT-LLM)Models on your K8s/VM GPU fleetControl, data residency, ops burden
HybridSensitive workloads self-hosted; burst to managed APIComplexity in routing and eval parity

Key serving concepts

Continuous batching — The scheduler groups in-flight requests dynamically so GPUs stay utilized instead of waiting for fixed batch boundaries.

KV cache — Attention key/value tensors for prior tokens are cached so each new token does not recompute the full prefix. Long contexts and multi-turn chats are memory-heavy; cache sizing drives max concurrency per GPU.

Speculative decoding — A small draft model proposes tokens; the large model verifies in parallel. Cuts latency for autoregressive generation when draft acceptance is high.

Quantization (INT8, FP8, GPTQ, AWQ) — Lower precision weights reduce VRAM and increase throughput with small quality trade-offs—common for embedding and mid-tier chat models.

    Request queue
         |
         v
    +------------------+
    | Scheduler        |
    | (batch, priority,|
    |  preemption)     |
    +--------+---------+
             |
             v
    +------------------+
    | GPU worker(s)    |
    | model weights    |
    | KV cache pools   |
    +------------------+

Scaling dimensions

DimensionKnob
ThroughputMore GPU replicas, better batching, smaller quantized models
Latency (p95)Dedicated pools, shorter context, speculative decoding, regional edge
ConcurrencyKV cache limits per replica; queue + backpressure at gateway
CostModel routing to SLMs, cache hits, prompt compression, batch offline work

RAG architecture (retrieval-augmented generation)

RAG grounds the LLM on private or fresh data instead of relying only on training knowledge.

    User query
         |
         v
    +-------------+     +------------------+
    | Query       |---->| Embedding model  |
    | transform   |     | (query vector)   |
    +-------------+     +--------+---------+
         |                       |
         |                       v
         |              +------------------+
         |              | Vector DB /      |
         |              | hybrid search    |
         |              +--------+---------+
         |                       |
         v                       v
    +-------------+     Top-k chunks
    | Optional    |<--------------------+
    | reranker    |                     |
    +------+------+                     |
           |                            |
           v                            v
    +------------------------------------------+
    | Context builder (citations, chunk trim)  |
    +--------------------+---------------------+
                         |
                         v
                  +-------------+
                  | Main LLM    |
                  | (generate)  |
                  +-------------+

Ingestion pipeline (offline / async)

    Sources (docs, wiki, tickets, code)
              |
              v
    Extract -> Chunk -> Embed -> Index (vector + metadata)
              |
              v
    Optional: graph links, ACL tags, version stamps
StageDesign choices
ChunkingFixed size vs semantic splits; overlap; respect document structure (headings, tables).
IndexingVector only vs hybrid (BM25 + dense); metadata filters (team, product, date).
Retrievaltop-k, MMR diversity, multi-query expansion, HyDE-style synthetic queries.
RerankingCross-encoder reranker on top-50 → top-5 for precision.
GenerationRequire citations; refuse when retrieval score is below threshold.

RAG failure modes (architecture responses)

FailureMitigation
Wrong chunks retrievedBetter embeddings, hybrid search, reranker, query rewriting
Context overflowChunk budget, summarization of retrieved set, context engineering
Stale indexIncremental ingest, version tags in prompts, TTL on sources
Hallucination despite RAGCitation-only answers, lower temperature, “I don’t know” policy

Fine-tuning, prompting, and distillation

ApproachWhat changesBest for
Prompting + RAGNo weight updatesFast iteration, changing policies, private data via retrieval
Fine-tuning (SFT, LoRA)Adapter or full weightsTone, format, domain jargon, tool-call style
RLHF / DPO / preference tuningAlignment to human preferencesSafety, helpfulness, ranking quality
DistillationSmall model mimics largeCost reduction at scale for routing or simple tasks

Architecture guidance:

  • Start with routing + RAG + strong system prompts before fine-tuning.
  • Fine-tune for stable, repetitive patterns (JSON schema, support tone, internal acronyms).
  • Keep a golden eval set and regression tests before and after any weight change.
  • Version models like microservices: chat-v3, embed-v2, with shadow traffic and rollback.

Agents and tools in the LLM stack

An agent is orchestration on top of inference: the LLM plans, calls tools, and loops until done. Infrastructure additions:

    Orchestrator (your service)
         |
         +--> LLM inference (plan / act)
         |
         +--> Tool gateway (MCP, REST, SQL, code sandbox)
         |
         +--> Memory store (session, vector, key-value notes)
         |
         +--> Human approval gate (high-risk actions)

See AI Agentic Systems and MCP for ReAct, plan-and-execute, and MCP host/client/server design.

ConcernInfrastructure pattern
Tool sprawlTool registry per role; dynamic tool loading; MCP servers by domain
Long sessionsCompaction, structured notes outside context window
Unsafe actionsSandboxed code exec, allowlisted APIs, approval workflows
Cost explosionMax steps per task, per-user budgets, cache tool results

Safety, guardrails, and policy layer

Guardrails sit around inference, not inside the model only.

    Input --> [PII detect] --> [Prompt injection filter] --> LLM
                                                                  |
    Output <-- [moderation] <-- [policy / schema validate] <-------+
ControlImplementation
Input moderationBlock jailbreaks, toxic content, secrets in prompts
Output moderationCategory scores, block or rewrite
Structured outputJSON schema, grammar constraints (GBNF), function-call validators
Tenant isolationSeparate indexes, API keys, and rate limits per customer
AuditImmutable logs of prompts, retrieval IDs, tool calls (with retention policy)

Defense in depth: gateway rules + model moderation + application-level validation (especially for finance, healthcare, and code execution).


Observability and evaluation

LLM systems need product metrics and model metrics.

What to trace (one trace_id per user request)

SpanData
gatewayuser, model route, latency
retrievalquery, chunk IDs, scores
inferencemodel version, input/output tokens, finish reason
toolstool name, args hash, latency, success
guardrailspolicy hits, blocked reason

Core metrics

MetricWhy it matters
TTFB / TTLTStreaming perceived speed
Tokens in/outDirect cost driver
Error rate by modelRouting and fallback health
Retrieval hit rate / MRRRAG quality
Task success rateAgent completion (human or automated eval)
Hallucination / citation rateGrounded answer quality

Evaluation pipeline

    Golden datasets (per use case)
              |
              v
    Offline eval (CI on model / prompt changes)
              |
              v
    Online shadow / A-B (small % traffic)
              |
              v
    Production monitors + human review queue

Automate evals for: exact match on structured tasks, LLM-as-judge (with caution), retrieval recall@k, toxicity regression, and latency/cost budgets.


Deployment topologies

Single region (baseline)

    Users --> LB --> App tier --> Inference (GPU node pool) --> Vector DB
                      |
                      +--> Object storage (docs, artifacts)

Multi-region / hybrid

TopologyUse case
Read replicas + central GPUGlobal users, inference in one compliant region
Regional inferenceStrict data residency (EU-only, etc.)
Edge gateway + central RAGLow-latency first token; retrieval from regional cache

Data plane vs control plane: Keep config, evals, and routing rules in a control plane; replicate indexes and model endpoints per region as required by compliance and latency.


Caching strategy

CacheWhatBenefit
Prompt / prefix cacheIdentical system prompts, tool schemasLower cost and latency (provider or vLLM prefix caching)
Semantic cacheSimilar user queries → prior answersCost savings on FAQ-style traffic
Embedding cacheChunk and query vectorsFaster RAG retrieval
Tool result cacheIdempotent read-only API responsesFewer agent steps and external calls

Invalidate on: document updates, model version change, policy change, or TTL expiry.


Cost and capacity planning

Rough planning variables:

    Monthly cost ≈
      (input_tokens  × price_in  +  output_tokens × price_out) × requests
      + GPU_hours × $/GPUhr (if self-hosted)
      + vector_db + storage + egress
LeverEffect
Shorter prompts / compaction↓ input tokens
Smaller model for easy queries↓ $/request
Higher batch utilization↓ GPU_hours
RAG only when needed↓ retrieval + context tokens
Output length limits↓ output tokens

Capacity: estimate peak QPS, p95 context length, and concurrent agent steps; load-test the inference pool with synthetic prompts that match production token distributions—not average-only tests.


Security and compliance checklist

  • Secrets: API keys in vault/K8s secrets; never in prompts or logs.
  • Network: Private endpoints for model and vector DB; no public GPU admin ports.
  • Data residency: Model provider region, index location, and log storage aligned with policy.
  • Retention: Configurable prompt/log retention; PII redaction in traces.
  • Supply chain: Pin model artifacts, scan containers, sign OCI images for inference workers.
  • Abuse: Per-IP and per-tenant rate limits; anomaly detection on token burn.

Design patterns summary

PatternSummary
Gateway + routerOne front door; route by task, tier, and SLO.
Multi-model poolRight-size model per step (embed, rerank, chat, classify).
RAG with hybrid searchDense + sparse retrieval, rerank, cite sources.
Agent orchestrationLoop over inference + tools; cap steps and cost.
Guardrail sandwichFilter input and output; validate structured actions.
Observable by defaultTrace retrieval, tokens, tools, and eval regressions.
Progressive complexityPrompt + RAG first; fine-tune and self-host when economics justify ops.

How this connects to your other notes

TopicPost
Agent loops, ReAct, MCPAI Agentic Systems and MCP
Context budgets, compaction, long horizonsContext Engineering for AI Agents
Distributed trade-offs (availability vs consistency)CAP Theorem

Summary

AreaTakeaway
Production LLMA full stack: gateway, orchestration, inference, knowledge, guardrails, ops—not a single API call.
InferenceStreaming, batching, KV cache, and routing dominate latency and cost.
RAGIngest, retrieve, rerank, then generate with citations and refusal policies.
Multi-modelEmbed, rerank, chat, moderate—specialized pools beat one giant model for everything.
AgentsOrchestration + tools on top of inference; control steps, memory, and risk.
OperationsTraces, evals, and token accounting are as important as model choice.

A solid LLM architecture picks the right model for each step, grounds answers when facts matter, contains agents with policy and budgets, and runs on infrastructure you can measure, scale, and rollback like any other critical service.

← All posts

Comments