One Call to Know. Not Ten to Guess.

Open the trace of any serious coding agent session and watch where the tokens actually go. The agent reads a file. Then another. It greps for a function name. It opens the test. It opens the test's helpers. It reads the package manifest. It re-reads the first file, because by the time it gets to turn fourteen it has forgotten what the signature was. Somewhere around turn twenty it edits two lines.

The edit is the job. Everything before it is the agent asking the codebase what it already is. Every turn of that asking is billable, lossy, and wrong just often enough to matter.

What a coding agent actually spends its tokens on

Concordia University researchers traced token consumption across thirty software development tasks inside the ChatDev framework using GPT-5 Reasoning, mapping the flow to standard SDLC phases. The result, summarized in RDEL #138, is blunt: input tokens account for 53.9% of total usage, and Code Review plus Code Completion alone consume over 86% of all tokens. Only 8.6% goes to the Coding phase itself. The remainder is the agent re-reading context, re-checking assumptions, and re-establishing state it had already established earlier in the conversation.

Vantage's 2026 analysis of agentic coding sessions walks through a representative 50-turn feature implementation. Turns 1-10 send about 5,000 input tokens each. Turns 11-30 send 20,000. Turns 31-50 send 35,000. By the end, the agent is paying to re-transmit everything it has touched, on every single call. Total session: roughly one million input tokens and forty thousand output tokens. A 25:1 ratio. The input side is the line item that scales with ambiguity.

This is the real bill on agentic development, and it is not being paid for code. It is being paid for rediscovery.

Discovery is not overhead. It is the entire invoice.

The instinct is to file discovery under "setup cost." A fixed tax the agent pays to get oriented, then amortized over useful work. The data says otherwise.

Taken together, the Concordia numbers and the Vantage trajectory describe a system where the thing the reader thinks they are paying for -- code generation -- is a rounding error next to the thing they did not know they were paying for: the agent reconstructing, every turn, facts that are already true about the codebase. The function's purpose. The service's contract. The owner. The policy envelope. The test that already exists. These are not discoveries. They are lookups being served as discoveries because nothing in the stack returns them in one call.

The gap between "what is this" and "rediscover what this is" is the discovery tax. It is invisible on the invoice because the invoice only shows tokens. But the tokens are the discovery.

The 39% that disappears between turn one and turn ten

The token cost is the visible half. The other half is that agents are measurably worse at their jobs once the conversation gets long.

Laban et al., in a May 2025 paper titled LLMs Get Lost In Multi-Turn Conversation, tested every top open- and closed-weight model available across six generation tasks, comparing single-turn performance against sharded multi-turn settings. The average performance drop: 39%. The authors attribute the gap not to a loss of raw capability (that piece is small) but to a sharp rise in unreliability: models commit to assumptions in early turns, generate premature final solutions, and then fail to recover when those early turns turn out to be wrong.

This compounds a pattern Nelson Liu and collaborators documented in the canonical "Lost in the Middle" study. When relevant information sits in the middle of a long context, GPT-3.5-Turbo's accuracy on multi-document QA falls below its closed-book baseline. The model does worse with the right answer buried in its context than it does with no context at all. The finding holds even for models marketed as long-context.

Put the two studies next to each other and a picture sharpens. The more turns the agent takes to discover what it needs, the more its context fills with the byproducts of that discovery. The longer the context, the less reliably the model uses the parts that actually matter. The agent is not just paying more per turn. It is getting worse at the task with every turn it spends looking.

Anthropic's own guidance on agent evaluation (Demystifying evals for AI agents) is explicit that errors in multi-turn trajectories propagate and compound, and that number of turns is a first-class metric teams should track. That is not a detail for evaluation engineers. That is the business case for cutting discovery turns to the smallest number that still produces the right answer.

Why "let the agent figure it out" sounded right

There is a real argument for discovery-first agents, and it deserves to be engaged rather than dismissed.

A discovery-based agent generalizes. It does not need the codebase to be annotated, declared, or blessed with structure. Drop it into an unfamiliar repo and it will grep, read, and test its way to a working model. It is robust to drift, because it rebuilds its model every session from whatever the code actually looks like right now. And it does not require a platform team to have done work upstream before the agent becomes useful.

These are not small properties. They are why the pattern caught on in the first place and why Tobi Lütke, Andrej Karpathy, and Simon Willison converged in mid-2025 on context engineering as the name for the emerging skill. Gartner followed in short order. In a July 2025 research note titled Lead the Shift to Context Engineering as Prompt Engineering Fades, the firm declared that "context engineering is in, and prompt engineering is out."

The technique is real. The generalization is real. What the language obscures is that every session pays the discovery cost from scratch. The agent that grepped its way to the payment service's idempotency contract this morning will grep its way there again this afternoon, and again tomorrow, and again next week. Context engineering makes one session good. It does not make the thousandth session cheap.

Spotify's engineering team, writing about their Honk background coding agent in November 2025, documented the breakpoint directly. Their earlier homegrown system was capped at "10 turns per session, three session retries total." Agents forgot the original task after a few turns. Multi-file changes failed. The fix they landed on was not to extend the turn budget. It was to shrink discovery by pre-loading condensed context and intentionally restricting tool access. Fewer turns, narrower tools, more pre-declared structure.

That is not a rejection of discovery. It is an acknowledgment that discovery done once and persisted beats discovery done fresh every session.

Identity is the cache key for intent

The thing a platform team can give an agent that eliminates most discovery turns is not documentation. Documentation rots and the agent cannot trust it. It is not RAG over documentation, for the same reason. It is a declared, verifiable, queryable identity for every unit of the system the agent might need to reason about.

Identity, in the sense we use here, is the set of claims a piece of software makes about itself that are structurally true at the moment the agent asks. It is layered, because different layers answer different questions a coding agent needs to ask:

What is this? A function, a service, a module, a job. Named, typed, addressable.
What does it promise? A behavioral contract. Idempotency, latency, ordering, consistency guarantees. Not a comment. A declaration the system can check.
Who owns it? A team, a person, a policy domain. Resolvable without a Slack search.
What is it allowed to do? A policy envelope. What it can call, what can call it, under what conditions.
How do we know it works? The tests, benchmarks, and invariants that define "correct" for this unit.
What is it connected to? The graph of upstream and downstream dependents, at the level of intent, not just imports.

Each of those questions, in a codebase without an identity layer, is a multi-turn discovery mission. The agent has to read code, read tests, read owners files (if they exist), read the policy engine's rules (if those are even checked in), and then assemble an uncertain guess. In a codebase with an identity layer, each question is one lookup. Verifiable, fresh, and typed.

The economic shape of the change is straightforward. The work of establishing the identity is done once, by the people who actually know the answer, and stored in a form that survives them. Every subsequent agent session, across every tool and every developer, queries it in a single call.

What one call returns that ten turns cannot

There is a property of one-call identity retrieval that ten-turn discovery cannot match, even in principle. A single call to an identity layer returns a claim that is true at the moment of the call, backed by the system's current state, and signed by the team that owns it. A ten-turn discovery produces an agent's inference about what is probably true, assembled from artifacts of varying freshness, with no authority behind it.

That distinction matters when the agent is about to do something. The identity claim is a trust anchor. The inference is a guess. Your review process, your compliance posture, and your incident blast radius all depend on which one the agent acted on.

It also matters for reuse. A claim returned in one call can be cached, audited, versioned, and revoked. An inference assembled across ten turns exists only inside that session's context window, and the next session has to assemble it again from scratch. Every session is a new first day on the job.

The instrumentation you probably don't have yet

Before adopting any of this, measure the problem. Most teams are flying without the dashboard.

Three numbers change the conversation:

Turns per completed task. Not average latency. Not tokens per session. The actual distribution of how many turns your agents take to get a correct result. If the 75th percentile is above ten, you are paying the compounding-error tax visible in the Laban paper. If the 95th percentile is above thirty, the tail is eating your bill.

Discovery-to-edit ratio. Of the tokens sent and received in a session, what fraction corresponds to the agent reading, grepping, and re-orienting versus the agent actually proposing, editing, or executing changes? Concordia's number puts Code Review and Code Completion at over 86% of tokens. If yours is in that range, the discovery tax is your dominant cost, not your agents' reasoning.

Repeated-lookup rate. How often, across sessions, does the agent perform the same discovery action -- reading the same files, grepping the same symbols, asking the same questions? This is the number that most directly quantifies what an identity layer would eliminate. Every repeated lookup is a cache miss on something that should have been a one-call answer.

You do not need a new tool to get these numbers. Agent traces already contain them. What most teams lack is the habit of aggregating them and treating the results as a budget line.

What's still hard

Establishing identity upstream of agents is not a free win and it is not yet a solved problem. We are working through four real challenges in production with the teams adopting this:

Bootstrapping. The identity of an existing codebase is mostly undeclared. Inferring it, declaring it, and getting it reviewed by the humans who actually know the answers is work. The payoff is large but the first migration is not trivial.

Drift. Identity that falls out of sync with code is worse than no identity at all, because it is confidently wrong. The identity layer has to be structurally coupled to the code such that changes to one force review of the other. That is an engineering discipline as much as a tool choice.

Granularity. Too coarse and the identity returns are not specific enough to eliminate discovery. Too fine and the identity layer becomes a parallel codebase nobody maintains. Getting the unit of identity right, per domain, is judgment work.

Authority. A claim is only as strong as the process that produced it. An identity layer populated by an agent that inferred the claims from the code is a faster discovery cache, not a trust anchor. The claims that carry authority are the ones owned by humans with the scope to make them.

None of these are arguments to stay on discovery-first agents. They are the real work of moving off them. The payoff is a development economy where the marginal cost of the hundredth agent session is not the same as the first, and where agents act on declared truth rather than inferred guesses. That shift is what makes agentic development compound instead of plateau.

The agent that rediscovers your codebase every session is not learning. It is paying rent on what you already know. Stop renting. Declare it once. Return it in one call.

One Call to Know. Not Ten to Guess.

What a coding agent actually spends its tokens on

Discovery is not overhead. It is the entire invoice.

The 39% that disappears between turn one and turn ten

Why "let the agent figure it out" sounded right

Identity is the cache key for intent

What one call returns that ten turns cannot

The instrumentation you probably don't have yet

What's still hard

The ribo.dev Team

Related Insights

Atlassian Rovo Dev CLI + Ribo: DNA at the Command Line

ChatGPT + Ribo: Ask Across Tools Without Losing the Thread of Intent