The hidden bill is in the loop
A single model call is usually easy to price. You can look at the input, look at the output, and do the math without much drama. The real bill shows up when an agent does what so many production systems do by default: read the current context, decide what to do, call a tool or another model, send the same context back in, then repeat the whole routine until the task is done.
That pattern sounds harmless in isolation. One extra pass to verify a field. One retry after a tool error. One more round to compare results. No big deal, right? Then the workflow reaches a document with ten pages of messy text, a chat history that keeps growing, and a couple of validation steps that resend the full state every time. The cost stops behaving like a flat line and starts behaving like a pile of receipts someone forgot to file.
The compounding is what catches teams off guard. A request that looks cheap on paper can turn into five or ten calls once the agent starts revisiting the same material. If each pass includes the full conversation, the extracted fields, the tool outputs, and the original document text, token usage grows fast. The model is paying again and again to read things it has already seen. The system is doing work, But it’s also redoing work. That second part is where the money goes.
This is especially easy to miss when the agent seems to make progress at every step. It reads an invoice, extracts totals, notices a mismatch, checks line items, asks for another pass, then rechecks the same lines with slightly different instructions. Each step feels reasonable. The total cost doesn’t feel reasonable anymore.
A lot of teams respond by tuning prompts. Sometimes that helps a bit. Sometimes it just makes the loop more polite. The deeper issue sits in the control flow. Which step needs the large model? Which step only needs extraction? Which step can be handled by a smaller classifier or a deterministic check? Those are routing questions, not prompt questions.
That’s where agent routing comes in. In practice, it’s a decision layer that picks the cheapest reliable path for each step instead of sending every step through the same expensive route. LLM routing should decide whether the agent needs reasoning, simple classification, or a validation pass. If the system keeps resending full context to the biggest model because no one drew those lines, cost will climb even when the prompts look tidy.
So the real accounting problem isn’t one inference. It’s repetition. Read, decide, call, resend, repeat. Once that loop exists, model choice matters, but the shape of the loop matters more.

Where repetition creeps into agent systems
The annoying part about agent costs is that they rarely arrive wearing a name tag. “ It usually starts with something innocent: a retry, A validation step, a second tool call, a longer document than expected. Then the system grows a few extra branches, and token usage starts acting like it’s trying to set a personal record.
A lot of the waste comes from the same pattern showing up in different costumes.
Retry loops are the obvious one. A tool fails, the model tries again, the same input is resent, and the model gets another full bite at the apple. If the failure was transient, fine. If the workflow is designed so that every hiccup triggers a full replay of the conversation, AI agent costs climb fast. The painful part is that retries often carry the entire prior state with them, not just the failed step.
Long conversation histories quietly become a tax. Once an agent has enough back-and-forth, every new turn has to read the old turns unless something trims or summarizes them. That means the system keeps paying to reprocess instructions, earlier outputs, and stale context that may no longer matter. In a chatty support agent, a planning agent, or a document assistant that revisits the same file repeatedly, the history can end up costing more than the actual new work.
Tool-call chains are another source of repetition. A single user request might trigger a classifier, then a retriever, then a parser, then a formatter, then a final response. Each step may look cheap on its own. The catch is that every tool call often asks the model to re-read the same task description, the same extracted data, or the same surrounding context. By the third or fourth hop, the system has not done four independent things. It has done one thing four times with slightly different accessories.
Verification passes are easy to justify and easy to overdo. A second model checks the first model. Then a third pass checks the check. This can be sensible when the output is high stakes, but it becomes wasteful when each verifier gets the full payload again. In practice, teams sometimes pay for a full reread when a narrow check would do the job. A regex, a schema validator, a rules engine, or a small model can often catch the obvious mistakes without dragging the whole conversation back through the oven.
Document-heavy workflows have their own brand of pain. Receipts, invoices, shipping labels, ID scans, archival PDFs, and contract pages all create the same problem in a slightly different outfit: the model sees a lot of text, a lot of layout noise, and a lot of repeated structure. If the workflow extracts fields, checks them, reformats them, then verifies them again, token usage starts rising even when each call looks modest in isolation. A two-page invoice might not seem expensive. Ten invoices, each with extraction plus validation plus retry logic, tell a different story. That’s where repeated work hides.
The same thing happens in multi-step tasks that feel simple from the outside. Suppose an agent receives a scanned receipt, extracts the merchant and total, checks whether the date is valid, compares it against a policy, and then drafts a note for accounting. None of those steps is huge. Together, they can force the system to read the same image-derived text over and over, plus the same instructions and edge cases. The bill doesn’t come from one dramatic request. It comes from the quiet accumulation of passes.
One-shot tasks are usually easy to price. Iterative agent flows are where the surprise lives.
That distinction matters. A one-shot task has a clean boundary. The model gets input, produces output, and the transaction ends. An iterative flow keeps the door open. It may revisit earlier state, re-run checks, branch into tools, or ask for confirmation before finishing. Each extra turn is defensible on its own. Put them together, though, and the system starts spending token budget on rereading rather than doing.
This is why teams often misread their own usage charts. A single prompt may look tame. The full path through the agent can be anything but tame. When you stack retries, history, tools, And verification on top of document-heavy inputs, the math changes quickly. The workflow, not the individual call, is what drives the bill.
Why defaulting to the biggest model is the wrong reflex
A bigger model feels like the safe choice. It usually isn’t.
When a workflow has already grown a few branches, the default instinct is to route everything to the strongest model in the stack and call it insurance. That sounds sensible until you look at the actual work being done. Some steps need planning across messy context, conflicting signals, And tool results that change the next move. Those steps can justify a larger model. Others are much simpler. Pull the invoice number. Decide whether the document is a receipt or a contract. Check whether a field matches a schema. Verify that a date is present. That kind of work often needs precision, not broad reasoning power.
If you send every one of those steps to the largest model, you pay for capability you never use. You also pay in latency. Each pass takes longer, and in an agent flow that delay gets multiplied by retries, tool calls, and verification loops. A single slow response might be tolerable. Five of them in one run start to feel like the app is thinking very hard about whether a PDF is upside down. Users notice that. So do queues.
Cost grows in a less obvious way too. Large models tend to be expensive per token, and agent systems love to feed them the same material over and over. The conversation history comes back. The extracted text comes back. The tool output comes back. Then the draft answer comes back with a few more instructions on top. Before long, The context window is packed with duplicate material, and the bill reflects all of it. Teams often blame the model itself when the real issue is model selection. The model may be fine. The routing is sloppy.

There’s also a habit of treating the strongest model as a universal guardrail. That can backfire. A larger model doesn’t magically fix a workflow that asks the wrong question at the wrong time. If the step is badly framed, the model can still drift, infer too much, or repeat an earlier mistake with more confidence than before. Give it a noisy OCR dump and a vague instruction to “clean this up,” and it may tidy the text while preserving the wrong field mapping. Ask it to verify something that should have been checked by deterministic code, and you may get a fluent answer where a simple yes or no would have been safer.
This shows up a lot in document workflows. Say the system receives an invoice. One step classifies the document type. Another extracts totals and tax values. A third checks whether the OCR output fits a known schema. A fourth decides whether the document needs human review. Only one of those steps might need heavier reasoning. The others can often run on a smaller model or a rule-based check. Sending the whole chain through the biggest model is like using a forklift to move a sticky note. Technically possible. Not a great habit.
The same pattern appears in support agents, intake flows, and search assistants. A large model may be the right choice when the system has to combine evidence, resolve ambiguity, and pick a next action. It’s usually overkill when the task is extraction, classification, or validation. If the job is to say “this is an ID card” or “this field is missing,” the better answer is often the cheaper one that gets it right faster.
The expensive part is often not the hardest step. It’s the step you keep handing to a model that never needed that much power in the first place.
That’s why model size alone is a poor decision rule. It treats all work as if it asks for the same kind of thinking. It doesn’t. Some steps need breadth, some need strictness, and some need almost no reasoning at all. When model selection ignores that difference, the system burns tokens, adds latency, and still misses easy wins. The next problem isn’t choosing a smarter model. It’s choosing a better path.
A routing layer: the practical fix
A routing layer gives the agent a smaller decision point before each expensive step. Instead of sending every request, every page, and every retry straight to the biggest model, the system asks a simpler question first: what kind of work is this?
That sounds almost too plain, which is usually a good sign. In practice, the router can look at task type, confidence, context size, and the current step in the pipeline. Is this extraction, reasoning, or verification? Does the model need the whole chat history, or just the last turn and a single document snippet? Is the input clean enough for a fast pass, Or messy enough that it needs a heavier read? Once those decisions are explicit, the agent stops acting like a single black box and starts acting like a controlled sequence.
For a lot of workflows, smaller models are perfectly fine for the first pass. A simple classifier can sort an incoming document into receipt, invoice, ID scan, Or something that needs manual review. A compact model can often normalize OCR output, fix broken line breaks, and map fields into a schema. “, there’s no prize for asking a giant model to think about it for a full page.
Specialized checks can do even more of the boring work. A regex validator can catch malformed dates. A schema parser can reject half-baked JSON. A checksum or lookup table can confirm whether an ID number fits the expected format. In OCR pipelines, that kind of deterministic validation is often faster and more reliable than asking another model to reread the same text and guess whether the answer feels right. If the step has a clear rule, use the rule.
If a step can be answered with a label, a rule, or a validator, it probably should be.
The larger model still has a job. It just shouldn’t be the default passenger on every trip. Use it when the task really needs synthesis across noisy inputs, cross-document comparison, or a judgment call that can’t be reduced to a tidy rule. A user asks, “These two invoices don’t match. “ That’s a better fit for deeper reasoning. So is a case where the model has to reconcile conflicting evidence from several pages, or interpret a document that OCR flattened into near-gibberish. The bigger model earns its keep there. On a good day, it saves you from a pile of awkward edge-case code. On a bad day, it still beats a smaller model that’s guessing with more confidence than sense.
The important part is fallback logic. A routing layer is only useful if it knows when to give up on the cheap path. If a small model returns low confidence, escalate. If validation fails, route the step to a larger model or a more specific checker. If a page is unreadable, send that page back through a stronger OCR or reasoning pass instead of resubmitting the whole document packet. That single change cuts down on unnecessary context repetition fast, because you stop replaying the entire history just to fix one bad field.
This is where tool calling loops get expensive in a hurry. One model call becomes a tool call, which becomes another model call, which resends the same context because the orchestration layer was built to be safe rather than selective. com/index/unrolling-the-codex-agent-loop/) is a good reminder that the loop itself needs to be visible, not hidden in a pile of retries.
In LLM orchestration, that visibility matters more than people expect. If the router can prune context before each step, the system spends less on tokens and less on repeated reads of the same material. If it can choose a smaller model for classification, A deterministic check for validation, and a larger model only for hard reasoning, the pipeline gets cheaper without getting brittle. The agent becomes less of a “let’s ask the expensive thing again” machine and more of a decision flow with limits, which is a much nicer way to ship software.
The takeaway for production teams
Once the routing layer is in place, the economics get a lot less mysterious. The big savings usually come from removing repeated passes, trimming back unnecessary context, and stopping the system from asking the same question three times in slightly different clothing. A team can spend days shaving a few tokens off a prompt and still lose money because the agent keeps looping through the same document, the same chat history, and the same tool results. That’s the part that tends to bite.
For production agents, the useful habit is to treat each step as a separate choice. What does this step actually need? A simple extraction? A classification? A confidence check? A bit of planning? A full reasoning pass? If you answer that honestly, the model choice usually becomes obvious. The mistake is to start with the model and then force the workflow to fit it. That tends to produce bloated calls and a lot of polite overkill.
A practical rule of thumb looks like this:
- Use a small model when the task is narrow, repetitive, or easy to verify.
- Use a large model when the step really needs broader reasoning, ambiguity handling, or cross-document synthesis.
- Use a specialized check when a narrow validator can confirm the result faster and cheaper than another full model pass.
That last point gets overlooked a lot. Teams often send everything back to the strongest model because it feels safe. In practice, a dedicated validator, a regex check, a schema parser, or a document-specific OCR pass can be enough for many steps. You don’t need a heavyweight answer engine just to confirm that a date is formatted correctly or that a field is missing. That kind of job is for a smaller, tighter tool chain.
There’s also a broader operational lesson here: the cheapest reliable path through the task is usually better than the fanciest single model. A workflow that uses one medium model, one verification step, and one escalation path can outperform a setup that sends every turn straight to the biggest model in the account. Lower latency. Lower spend. Less token churn. Fewer cases where the agent rereads half the world because nobody told it not to.
That shift in thinking matters more than the model leaderboard. “ and you’ll often get an expensive answer. “ and the design gets clearer fast. Sometimes the answer is a small model. Sometimes it’s a larger one. Sometimes it’s a specialized check and no second pass at all. The point is to choose deliberately, step by step, instead of letting repetition set the bill.




