You ask your coding agent to add a new endpoint. It churns for a minute and produces beautiful, idiomatic code that compiles on the first try. You merge it. Then you deploy, and three services fall over. The agent ignored your authentication layer, bypassed the validation patterns every other handler uses, and pulled in a dependency that conflicts with your stack.
None of that was in your prompt. You assumed the agent would just know — it’s been trained on millions of repositories, after all. But it had never seen your repository, your conventions, or your production constraints. The prompt was fine. The context was the problem.
This is the realisation that reorganised how serious teams work with coding agents in 2026: the bottleneck is no longer the model’s ability to write code. Claude Code, Cursor, and Copilot can all generate code fluently. Their ceiling is the quality, relevance, and durability of the context they receive. Managing that context is an engineering discipline in its own right, and it has a name.
What context engineering is
Anthropic formalised the term in September 2025, and the definition that’s stuck is deliberately economical: context engineering is finding “the smallest possible set of high-signal tokens that maximise the likelihood of some desired outcome.” The academic framing (Mohsenimofidi et al., October 2025) is broader — “the deliberate process of designing, structuring, and providing relevant information to LLMs” — and by early 2026 Gartner had its own enterprise version. They all point at the same thing.
The cleanest contrast is with the discipline it replaced:
- Prompt engineering optimises the phrasing of a single request. It’s a writing skill.
- Context engineering optimises the entire information environment the model sees across a multi-step session — system instructions, tool definitions, retrieved code, memory, prior tool outputs, and the task itself. It’s a systems engineering discipline.
The punchline that makes the distinction land: a perfect prompt inside a bloated, irrelevant context still produces mediocre output, while a mediocre prompt inside a surgically curated context often produces great output. Once you’ve seen that a few times, you stop polishing prompts and start engineering context.
The mental model: context is RAM
Andrej Karpathy, who helped popularise the term in mid-2025, offers the analogy that makes it click: an LLM is like a new kind of operating system. The model is the CPU, and its context window is the RAM — the working memory. Just as an operating system carefully curates what fits into limited RAM, context engineering decides what fits into the limited context window on every single turn.
That framing reframes the whole job. You’re not writing instructions; you’re managing a scarce, shared resource — working memory — under contention from many sources, all competing for the same finite space.

Why more space doesn’t solve it: context rot
The natural objection is “context windows are huge now — just put everything in.” This is the most expensive mistake in the field, because of a phenomenon called context rot: model accuracy measurably degrades as the input grows, regardless of the window’s size.
Chroma’s 2025 study tested 18 frontier models and found all of them degrade as context lengthens — through “lost in the middle” (information buried mid-context gets overlooked), attention dilution, and distractor interference (irrelevant tokens actively pull the model off course). Bigger windows raise the ceiling; they do not repeal the rot.

The practical consequence is counterintuitive but firm: filling the window hurts you. A 15-step refactor — tool calls, file reads, shell output, sub-agent hand-offs — can balloon past 100,000 tokens before the model does any real thinking, and quality drops accordingly. Even with a 2-million-token window, you want the model to see only what’s useful for the current step, because latency, cost, and rot all scale with what you load. The job is curation, not accumulation.
The four strategies (the map for this series)
LangChain’s Lance Martin distilled the community’s practices into a four-strategy taxonomy that has become the standard vocabulary. Every technique in this series is one of these four:
- Write — persist context outside the window so the agent can pull it back later. Context files (Claude Code’s
CLAUDE.md, Cursor/Windsurf rules files), scratchpads (todo.md), and memory stores live here. - Select — pull the right context in at the right moment: retrieval over your codebase, memory selection, choosing which files matter for this task.
- Compress — keep only the tokens a step actually needs: summarisation, compaction, returning references instead of blobs.
- Isolate — separate unrelated context to prevent interference: sub-agents with their own windows, per-task stores, the file system as external memory.

What you’re actually supplying
This series is organised around the four kinds of context the model can’t get from pretraining — the things that make code yours:
- Coding style & conventions — your house patterns, architecture invariants, the “never do X” rules. (Mostly Write)
- Internal libraries & code — your private APIs, utilities, and the right way to call them. (Mostly Select / retrieval )
- Institutional knowledge — why decisions were made, runbooks, tribal knowledge that lives in people’s heads. (Write + Select, plus memory)
- Production data — real schemas, configs, and live state, supplied safely. (Select + Isolate, under governance)
A coding agent that compiles but takes down production is one that received zero of these. The discipline is making sure it receives exactly the right slice of each, on each turn, without drowning in the rest.
Write: context files that actually change behavior
The simplest, highest-leverage context engineering you can do is also the most under-used: write down what the agent should know in a file it reads on every session. Claude Code calls this CLAUDE.md; Cursor and Windsurf use rules files; and a cross-tool convention, AGENTS.md, has emerged so the same context works across agents. This is procedural memory — the standing instructions that encode how your team builds software.
The mistake teams make is treating these files like documentation. They’re not docs; they’re a token budget you’re spending on the agent’s behavior, so every line has to earn its place (remember context rot from above — a 5,000-line CLAUDE.md is itself bloat). Good context files are short, specific, and imperative.
# AGENTS.md — payments-service
## Architecture invariants (never violate)
- All money is integer minor units (cents). Never use floats for currency.
- Every handler goes through `withAuth()` and `validate(schema)` — no exceptions.
- DB access only via `src/db/repo.ts`. Never write raw SQL in handlers.
## Conventions
- Errors: throw `AppError(code, msg)`, never return null on failure.
- Tests: co-located `*.test.ts`, use the `makeFixture()` factory.
- Logging: `logger.info({event, ...fields})` — structured, never `console.log`.
## Project facts
- Default branch `main`; CI is GitHub Actions; Node 22.
- Public API contract lives in `openapi.yaml` — update it when routes change.
## Never
- Add a dependency without checking `package.json` for an existing one.
- Touch `legacy/` — it's frozen pending migration.
Notice what this does: it front-loads the invariants (the auth and validation rules whose absence took down production in previous story), states conventions as commands, and ends with explicit prohibitions. It’s the institutional knowledge that normally lives in a senior engineer’s head, made machine-actionable.
Layer your context files. Context isn’t monolithic — it cascades from broad to specific, and the agent should load the relevant layers for wherever it’s working:

A practical loader concatenates the layers from general to specific, so the most local rules win:
from pathlib import Path
def build_context_files(target_file: Path, repo_root: Path) -> str:
"""Compose context from org → repo → directory, most-specific last."""
layers = []
org = Path.home() / ".agent" / "GLOBAL.md"
if org.exists():
layers.append(("organisation", org.read_text()))
repo_md = repo_root / "AGENTS.md"
if repo_md.exists():
layers.append(("repository", repo_md.read_text()))
# walk from repo root down to the target's directory, picking up local rules
for parent in [target_file.parent, *target_file.parents]:
if parent < repo_root:
break
local = parent / "AGENTS.md"
if local.exists() and local != repo_md:
layers.append((f"local:{parent.name}", local.read_text()))
return "\n\n".join(f"<!-- {name} -->\n{body}" for name, body in layers)
Write also covers scratchpads. For state within a long task, have the agent persist a todo.md or plan file outside the window and reload only the summary. This is how agents track multi-step progress without carrying every intermediate thought in context — a Write strategy that directly fights the bloat as seen before.
Select: retrieving your internal libraries
Context files handle the knowledge that fits in a page. They can’t hold your entire codebase — and they shouldn’t try. For “how do we call the internal billing client?” the agent needs retrieval: pulling the relevant code into context on demand. This is where teams reach for RAG, and where a naive implementation quietly fails.
The trap: embedding search alone doesn’t scale on code. As Windsurf’s team documented, indexing is not retrieval. Embedding-based similarity search becomes an unreliable heuristic as a codebase grows — semantically similar code isn’t necessarily the code you need, and chunking by line count shreds functions mid-body. The fix the strong code agents converged on is a hybrid pipeline:

The three retrievers cover each other’s blind spots: grep nails exact symbol names, embeddings catch “code that does something similar,” and the code graph follows real relationships (who calls this, what it imports). A re-ranking pass then orders the merged candidates by genuine relevance so only a tight top-k enters context — keeping the window small and high-signal.
def retrieve_code(query: str, k: int = 8) -> list[Snippet]:
# 1. fan out to complementary retrievers
by_keyword = grep_search(query) # exact symbols, fast, precise
by_meaning = embedding_search(query, chunks) # AST-aware chunks, semantic
by_graph = code_graph_neighbours(query) # callers/callees/imports
# 2. merge and de-duplicate candidates
candidates = dedupe(by_keyword + by_meaning + by_graph)
# 3. re-rank by true relevance, then take a tight top-k
ranked = rerank(query, candidates) # cross-encoder or LLM judge
return ranked[:k]
Two implementation notes that matter more than the retriever choice. Chunk along AST boundaries (whole functions/classes), never fixed line counts, so a retrieved snippet is self-contained. And return the smallest useful unit — a function plus its signature and docstring, not the whole 2,000-line file — because what you retrieve, you pay for in context-rot terms.
You can expose this pipeline to the agent as a tool it calls (search_code(query)), or wire an existing code-graph source through MCP. Either way, the agent now pulls your internal libraries on demand instead of hallucinating their interfaces.
Select: institutional knowledge and memory
The third kind of context — why things are the way they are, runbooks, decisions, tribal knowledge — is partly Write (put durable facts in context files) and partly memory: knowledge that accumulates across sessions and is selected back when relevant.
For semantic memory (a growing store of facts and relationships), you select with the same retrieval machinery as code, indexed by embeddings or a knowledge graph. A few frameworks exist — mem0 and Letta are the visible ones — but most production teams build a thin memory layer over a key-value store plus a summarisation pass, because requirements vary. The pattern:
def recall(task: str, store, k: int = 5) -> str:
"""Select only memories relevant to the current task."""
hits = store.search(task, k=k) # semantic match
return "\n".join(f"- {m.text}" for m in hits) # inject just these
def remember(fact: str, store):
"""Write a durable fact learned this session for future selection."""
store.upsert(fact, embedding=embed(fact))
The discipline is selective recall. Don’t dump the whole memory store into context — that’s how you recreate context rot. Select the handful of memories relevant to the task, the same way you select code. (And beware over-eager memory: an agent that silently injects stale or irrelevant “facts” is worse than one with none.)
Putting Write and Select together
A well-contexted coding agent, at the start of a task, assembles: the layered context files for where it’s working (Write), the top-k retrieved code for the symbols it’ll touch (Select), and the relevant memories about prior decisions (Select) — and nothing else. That assembled context is small, specific, and high-signal. The same agent without this receives a generic prompt and guesses — which is exactly how you get idiomatic code that ignores your auth layer.
Compress: fighting context rot in long sessions
Everything from above degrades over a long session. Each tool call, file read, and shell output adds tokens, and by step 15 you’re back in the context-rot zone — accuracy sliding as the window fills with spent material. Compress keeps the window lean by retaining only what each step still needs.
Three techniques, in rough order of how often you’ll reach for them:
Compaction is the workhorse. When a session approaches the context limit, summarise the conversation so far at high fidelity and restart a fresh window seeded with that summary. Long-range coherence survives; the token-heavy raw history doesn’t. Most production agents do this automatically, but you control what the summary preserves — and that’s a context-engineering decision: keep decisions, open threads, and invariants; drop resolved sub-tasks and verbose tool dumps.
def maybe_compact(messages, model, limit=120_000):
if count_tokens(messages) < limit:
return messages
summary = model.summarize(
messages,
keep="decisions made, current plan, open questions, files modified, "
"invariants discovered; DROP resolved steps and raw tool output",
)
system = messages[0]
return [system, {"role": "user", "content": f"[Compacted context]\n{summary}"}]
Offloading moves bulky content out of the window and keeps a reference. Compress a 50,000-token file or webpage to a 500-token summary plus a path/URL; the agent retrieves the full content only if a later step needs it. Teams report agents naturally adopting this with todo.md and scratch files — the file system as recoverable, near-infinite memory.
Structured tool outputs. Design tools to return the smallest useful result — an ID, a count, a path — not a giant JSON blob the agent has to carry. A tool that returns {"matches": 412, "sample": [...3 items...]} is far kinder to the window than one that returns all 412 rows.
Isolate: sub-agents with their own windows
Some work doesn’t belong in the main agent’s context at all. Isolate gives a sub-task its own fresh window, so its intermediate tokens never pollute the parent’s working memory. The parent delegates a goal and gets back only the result.
The payoff is large and well-documented: Anthropic’s multi-agent research system divided work among sub-agents that each explored independently with their own context windows, then had a lead agent synthesise — a 90.2% performance improvement over the single-agent baseline on their evaluation, precisely because each sub-agent reasoned in a clean, focused window instead of one shared, crowded one. (It used more total tokens — isolation trades token volume for quality and parallelism, which is usually the right trade for hard tasks.)
Isolation shines for parallelisable, mostly read-only work: “find every call site of this deprecated API,” “research how three modules implement caching,” “review these twelve files for the same issue.” Each becomes a sub-agent with a narrow brief and a clean window; the parent stays uncluttered.
def isolated_subtask(brief: str, tools) -> str:
"""Run a sub-agent in its own fresh window; return only the distilled result."""
sub = Agent(system=focused_system_prompt(brief), tools=tools) # clean context
result = sub.run(brief)
return result.summary # only this crosses back into the parent's window
The judgement call: isolate when a sub-task’s process is verbose but its result is compact. Don’t isolate tightly-coupled work where the parent needs the intermediate reasoning — you’ll just pay coordination overhead.
Supplying production data — safely
The fourth kind of context from before — real schemas, configs, live state — is the most valuable and the most dangerous. An agent that can see production data can be far more accurate; an agent that can see too much production data is a breach waiting to happen. Supply it under hard controls:
- Read-only, least-privilege access. The agent gets a scoped, read-only credential to exactly the tables/resources the task needs — never a broad production role, never write access by default. This is the same containment discipline a shell agent needs, applied to data.
- Redaction and PII handling at the boundary. Strip or tokenise secrets and personal data before it enters the context window. A retrieval tool over production should return schemas and shapes, sampled/synthetic rows, and aggregates — not raw customer records.
- Prefer metadata over payloads. Most of the time the agent needs the schema (“what columns exist, what types”), not the data. Supply the structure; withhold the contents unless a task genuinely requires them.
- Audit every access. Log what the agent read, when, and why. Production-data access through an agent should be as auditable as any other privileged access.
Route this through a controlled tool or an MCP server with these guarantees built in, rather than handing the agent a live connection string. Production data is exactly the kind of governed, cross-trust-boundary access where a structured, auditable interface earns its overhead.
ContextOps: govern it like code
Here’s the part that separates teams that win with coding agents from teams that generate expensive technical debt. The context you’ve built — files, retrieval config, memory, data policies — is infrastructure, and infrastructure needs governance. The gap is real: ~91% of engineering organisations have adopted at least one AI coding tool, but few have governance matching that adoption, and roughly 48% of AI-generated code has been found to carry security vulnerabilities. Ungoverned context is where that risk compounds.
Treat context as a first-class, version-controlled asset:
- Version control your context files.
AGENTS.mdand rules files live in the repo, reviewed in PRs like any code. A convention change is a diff, not a Slack message. - Enforce at pre-commit. The invariants you wrote into context files should also be checked mechanically — linters, schema validation, dependency rules — so the agent’s adherence is verified, not trusted. Context tells the agent the rule; the gate confirms it followed it.
- Monitor for drift. Codebases evolve; context files rot. Watch for context that’s gone stale (a convention the code no longer follows, a retrieval index that’s behind) and treat it as a bug.
- Assign ownership. Someone owns each layer of context, the way someone owns a service. Orphaned context decays.
The reason this matters compounds over time: a well-governed context environment doesn’t just improve this quarter’s output. It accumulates your institutional knowledge in a form every agent on the team can act on — and as agents get more autonomous, that asset is what determines whether autonomy amplifies good patterns or replicates chaos across entire features with no human in the loop to catch it.
The whole system
Put all three parts together and a production coding agent’s context, on any given turn, is assembled like this:

Style and knowledge come in through Write and Select. Production data comes in through governed, least-privilege Select. Compress and Isolate keep the window small as work grows. And the governance loop keeps the whole thing honest over time. What reaches the model is a tight, high-signal context — your conventions, the relevant libraries, the pertinent decisions, the necessary data shape — and nothing else.
That’s the discipline. Not a clever prompt, but an engineered information environment: curated on every turn, governed like code, and compounding into the single most valuable asset a team building with agents can own.