Context Engineering for AI Agents: Summary and Deep Dive
#ai#agents#context-engineering#prompting#mcp#mcl#llm-systems
This post explains context engineering—a framework for curating the limited information budget (“context”) that an LLM receives during inference—so an AI agent behaves reliably over short and long horizons.
Reference: Anthropic’s article, Effective context engineering for AI agents (link).
TL;DR (quick summary)
- Context is finite. As the context window grows, effective recall and reasoning precision degrade (often described as context rot). You must treat context like an attention budget rather than “more is better”.
- Context engineering evolves prompt engineering. Prompt engineering focuses on instructions; context engineering optimizes what enters the model (instructions, tools, retrieved data, message history, examples, etc.).
- Curation beats stuffing. Use clear system prompts, small high-signal tool sets, and curated examples instead of exhaustive edge-case lists.
- Retrieve just-in-time. Keep lightweight identifiers and load only the needed details at runtime (“progressive disclosure”).
- For long tasks, use special techniques.
- Compaction: summarize old interactions without losing key decisions.
- Structured note-taking: persist durable state outside the context window and re-inject when needed.
- Multi-agent / sub-agent architectures: isolate deep exploration so the main agent stays focused.
1. What is context engineering?
In most agentic systems, each step (or “turn”) involves:
- system instructions,
- user goal and conversation history,
- tool descriptions and tool call results,
- retrieved external documents or code,
- any additional metadata used by the agent/host runtime.
Context engineering is the practice of selecting and shaping the smallest possible set of high-signal tokens that maximizes the probability of achieving the desired behavior at each step.
Anthropic frames it as: curate the configuration of context that is most likely to generate the desired model behavior (link).
2. Why “more context” can make agents worse
2.1 Context window vs attention budget
Even if your model supports a large context window, the model’s compute and attention patterns effectively impose a budget on what it can use reliably.
As context length grows:
- the model must allocate more attention across more tokens,
- retrieval quality for “needle-in-a-haystack” items tends to degrade,
- long-range dependencies can become less precise,
- the agent becomes more vulnerable to confusion and irrelevant reasoning.
This is commonly described as context rot: performance degrades as the amount of context increases, even if the context is “correct”. (link)
2.2 Practical consequence for system design
Instead of “stuff everything the agent touched,” aim for:
- minimal instructions needed to keep behavior aligned,
- minimal tool inventory needed to act effectively,
- minimal retrieved content needed for the next decision,
- selective retention of message history (recent + important decisions).
3. Prompt engineering vs context engineering
Prompt engineering: how you write prompts/instructions (especially system prompts) to get good outputs in a single inference.
Context engineering: how you manage the evolving state sent into the model across a loop of reasoning + tool use + observations.
In an agent loop, the agent repeatedly decides:
- What should I do next?
- Which external information should I load now?
- Which past facts are still relevant?
- Which details are noise?
Context engineering is the discipline behind those decisions (link).
4. The anatomy of effective context (what to curate)
Think of context as several layers, each with a different curation strategy.
4.1 System prompts: “right altitude”, not brittle micromanagement
Anthropic emphasizes a Goldilocks zone:
- too brittle: hard-coding complex if/else behavior in the prompt,
- too vague: leaving the model without concrete guidance or assumptions about shared context.
The goal is a prompt that:
- is extremely clear,
- uses direct language,
- describes behavior “at the right altitude” (not too low-level, not too hand-wavy),
- is structured into sections (e.g. background, instructions, tool guidance, output format).
Practical pattern:
<background_information>
…what this agent is and what it can assume…
<instructions>
…how to decide and act…
## Tool guidance
…what tools exist and when to use which…
## Output description
…exact formatting, completeness rules, and failure modes…
Even if prompt formatting is flexible, the information shape matters: separation, clarity, and constrained outputs.
4.2 Tools: curate for efficiency and unambiguous selection
Tools define the agent’s action space. A bloated or overlapping tool set creates ambiguity:
- the model can’t confidently decide “which tool should be used,”
- the agent might call the wrong tool,
- tool results add more context noise.
Curate tools with these goals:
- self-contained and robust: each tool does one job well,
- descriptive inputs/outputs with a clear contract,
- minimal overlap in functionality,
- token-efficient results (return summaries first, raw data only when needed).
4.3 Examples (few-shot): canonical examples beat laundry lists
Instead of trying to encode every edge case in a giant prompt:
- provide diverse, canonical examples that illustrate the expected behavior,
- let the model generalize rather than memorize exceptions.
For agents, examples are not only “how to answer”—they’re also “how to act” (how to choose tools, how to interpret tool outputs, how to handle partial failures).
4.4 Message history: selective retention
Long agent traces include:
- repeated intermediate steps,
- raw tool outputs,
- failed attempts and retries.
If you keep everything, you increase context rot risk and reduce decision quality.
Instead, keep:
- the latest step observations needed for immediate decisions,
- “decision anchors” (architecture choices, assumptions, discovered constraints),
- compact summaries of tool outcomes rather than raw logs.
5. Context retrieval and agentic search (just-in-time)
Many systems use an embedding or indexing step before inference:
- retrieve relevant documents/snippets,
- inject them into context,
- ask the model to answer.
Anthropic describes a shift towards just-in-time context:
- keep lightweight identifiers (file paths, stored query results, URLs),
- retrieve/load the needed details during the loop using tools,
- refine behavior progressively as new observations come in.
This supports:
- progressive disclosure: the agent discovers what matters at each step,
- smaller, fresher context slices,
- less stale indexing pressure.
5.1 Why just-in-time works well for agents
Agents behave like planners/explorers:
- they start with a high-level goal,
- they run actions to uncover new facts,
- those facts determine the next retrieval.
Just-in-time aligns retrieval with the agent’s evolving plan, avoiding “preload everything” failure modes.
5.2 Hybrid strategy
In practice, a hybrid approach often wins:
- preload a small amount of high-confidence context for speed,
- retrieve additional details on demand as the agent commits to a subtask.
6. Long-horizon tasks: keeping coherence over time
Long tasks (tens of minutes or hours) introduce context limitations:
- you can’t include the whole history,
- the agent needs coherent reasoning after context resets.
Anthropic outlines three techniques (link):
6.1 Compaction
Compaction summarizes a near-context-window interaction trace into a shorter, faithful representation:
- preserve architectural decisions,
- preserve unresolved issues / open questions,
- preserve important constraints and implementation details,
- discard redundant tool logs and repeated observations.
Compaction is best treated as a controlled transformation:
- maximize recall first (don’t lose important facts),
- then iterate towards precision (remove noise).
Key principle: compaction prompts must be tuned to the agent’s domain so they don’t hallucinate or omit subtle constraints.
6.2 Structured note-taking (agentic memory)
Structured note-taking externalizes stable state:
- the agent writes “NOTES” or updates a memory store,
- later steps pull only the relevant notes back into context.
This reduces context overhead while keeping durable continuity.
Implementation idea:
- store notes as small records:
{task_id, status, key decisions, assumptions, blockers, next_steps}, - re-inject only notes whose
task_idmatches the current subtask.
6.3 Multi-agent / sub-agent architectures
Instead of one agent doing everything with one huge context:
- a main agent creates a plan and orchestrates,
- specialized sub-agents do deep work with isolated context,
- each sub-agent returns a distilled summary back to the main agent.
This approach compartmentalizes context rot:
- deep exploration context stays in the sub-agent,
- the main agent keeps a compact view of what matters.
6.4 Architecture sketch: long-horizon agent system
+----------------------+
User/Goal ------------> | Orchestrator Agent |
| (keeps compact state)|
+-----------+----------+
|
| plan + task routing
v
+----------------------------+
| Sub-agent A | Sub-agent B |
| (deep search)| (coding...)|
+-------+------+------^-----+
| |
distilled summaries |
v |
+----------------------+
| Compaction / Notes |
| (agentic memory) |
+----------------------+
|
v
next actions / decisions
7. Design checklist: how to apply context engineering
7.1 At the system prompt level
- Keep system instructions clear and structured.
- Avoid fragile prompt logic that tries to enumerate everything.
- Specify output format and tool usage rules.
7.2 At the tool level
- Minimize tool inventory.
- Ensure tool names and input schemas make intent obvious.
- Return summaries first; provide raw data only when needed.
7.3 At the retrieval level
- Prefer just-in-time retrieval using tool calls.
- Use progressive disclosure: load details only for the next decision.
- Use a hybrid preload/retrieve strategy if you need speed.
7.4 At the long-horizon level
- Use compaction to reset context with high-fidelity summaries.
- Use structured notes for durable state.
- Use multi-agent/sub-agent design when deep exploration is required.
8. Common pitfalls
- Tool bloat: adding more tools increases ambiguity and token noise.
- Raw log injection: dumping full tool outputs into context reduces signal quality.
- Unbounded memory: storing every message forever guarantees context rot.
- Aggressive compaction: summarizing too hard can delete critical constraints.
- No failure-mode rules: agents can waste context by repeatedly exploring dead ends.
Conclusion
Context engineering is the discipline of managing the agent’s limited “attention budget.” Instead of focusing only on prompt wording, you continuously curate:
- system instructions,
- tools and tool outputs,
- examples and message history,
- and runtime-retrieved context.
For short tasks, curation and just-in-time retrieval improve accuracy. For long tasks, compaction, structured note-taking, and multi-agent architectures maintain coherence while avoiding context rot.
Reference: Anthropic – Effective context engineering for AI agents.
Comments