The real problem isn’t model size
A lot of teams hit the same wall in the same way. An AI agent gives an answer that looks polished, the model is clearly capable, and yet the output still misses something obvious to the people in the room. The natural reaction is to blame the model. If a smaller one is shaky, surely a bigger one will be steadier, right?
That guess makes sense. It also shows up everywhere now. AI is already sitting inside code review tools, ticket triage, documentation helpers, support bots, test generators, and the odd internal assistant that helps nobody until Friday afternoon. Once a workflow depends on AI at all, upgrading to larger language models starts to feel like the cleanest fix available. More parameters, better reasoning, fewer embarrassing misses. Nice story. Usually wrong.
A stronger model can answer more fluently, but it still can’t infer the project rules you never gave it.
The catch is easy to miss because model capability and project knowledge get tangled together. A model may be excellent at language, planning, or code synthesis, yet still fail when the missing piece is local context: a repo convention, a permissions boundary, a naming rule, a release process, a product constraint, or a business exception that lives in someone’s head. If that information never made it into the prompt, the model has two choices. It can guess, Or it can refuse to guess. In practice, agents often do the first one with great confidence and a tidy sentence structure.
That’s why bigger models only partly solve the problem. They may reduce some kinds of mistakes. They may catch a subtle contradiction sooner, or write a better patch when the instructions are complete. But they don’t magically acquire the specific, messy, project-local details that make software teams function. They don’t know that one service is owned by finance, that one flag must stay off for staging, or that a certain table is treated as read-only except during a manual migration window. If those rules aren’t available at runtime, the agent is improvising. Fancy improvising, maybe, but still improvising.
The practical lesson is simple enough to be annoying: if the output is unreliable, inspect the system before you buy more model. Look at what the agent was allowed to see, what state it could carry forward, what rules were enforced outside the prompt, and what history it could reuse. In other words, treat reliability as a design problem. The model matters, sure. But the surrounding workflow often matters more.
That point gets sharper as AI agents move from demos into actual engineering work. A demo can survive on generic competence. Production work can’t. The agent has to operate inside your repo, your policies, your release process, your naming conventions, and your team’s half-written documentation. If the setup leaves those things implicit, A larger model will still be guessing at the edges. It may guess better, which is nice, but guessing is still guessing.
So the core distinction in this article is between capability and context. Capability is what the model can do in the abstract. Context is the project knowledge it actually receives before acting. When the second part is thin, size helps less than people expect. When the second part is solid, even a modest model can behave far more predictably. That’s the argument in plain terms: reliability comes from better system design, not scale alone.

What the prompt leaves out
Once you stop blaming model size for every weird answer, the next question gets more interesting: what did the prompt never tell the agent in the first place? That’s usually where the real failure lives.
A prompt can carry a lot, but it can’t magically absorb the stuff your team never wrote down. Permissions live in one place, naming conventions in another, and business rules in someone’s head or an old Slack thread nobody wants to search. The model may still produce something that sounds polished. It can even sound confident. That doesn’t mean it knows the shape of the system it’s acting inside.
Take access boundaries. An agent asked to update a ticket might be perfectly capable of drafting the change, but it doesn’t know whether it’s allowed to touch production settings, edit customer data, or regenerate credentials. Without that context, it may suggest actions that are technically sensible and operationally wrong. Same with feature flags. One team might treat a flag as a short-lived rollout switch. Another might use the same pattern to gate region-specific behavior for months. A model that sees only the prompt can’t tell the difference unless you feed it the rule.
Repository conventions create the same kind of trap. py owns tax calculations, but checkout mirrors some values for display only and should never write back. Maybe getUser()` returns the currently authenticated user in one repo and the account owner in another. Those differences sound tiny to a human who’s lived in the codebase for six months. To an agent, they’re the difference between a decent answer and a mess with a neat haircut.
If a rule lives only in someone’s head, a model will treat it like optional lore.
Internal terminology causes trouble too. Teams love using the same English word to mean three different things. “Customer” might mean the paying company, The end user, or the legal entity on the invoice. “Approval” might mean legal sign-off in one workflow and manager approval in another. “ A larger model may guess one of those meanings from general language patterns, but guessing isn’t the same as knowing. In prompt context, ambiguity often gets smoothed over until it becomes a bug.
Business rules are even less forgiving. Suppose a refund above $500 needs finance approval only for enterprise accounts, except in the EU, where legal review also kicks in. Or a data export is allowed for internal staff, But only after the request is logged and only if the record owner matches the tenant. Or a feature flag can be turned on for staging at will, but production rollout requires a ticket reference and a named approver. These are the details that make an agent look “smart” right up until it takes the wrong branch. The output reads fine. The process doesn’t.
That’s why this failure mode is usually not about raw intelligence. The model can reason. It can summarize. It can pattern-match like a champ. What it can’t do is invent project-specific facts that were never supplied and somehow know which invisible rule matters today. If you ask it to act inside a system, incomplete information is the problem. Bigger weights don’t fix a missing policy doc, a stale access list, or a naming rule that only exists in three people’s heads.
This is also where teams get fooled by good-looking demos. “ The output still sounds plausible. That’s the annoying part. Plausible is cheap. Correct is tied to context.
OpenAI’s prompt engineering guidance points in the right direction here: the model does better when it gets the right constraints and source material up front. But “up front” has a hard ceiling. You can’t cram a whole org chart, repo history, policy set, and operational memory into every prompt without turning the thing into a ransom note. And if you try to test your way out of it, evaluation best practices matter because they expose these misses faster than vibes ever will. A small eval that checks only obvious cases will miss the weird edge where your team’s local rules live.
If the model needs live docs, task history, or tool access to answer safely, that missing context has to come from somewhere other than a one-off prompt. The same issue shows up in systems that connect models to external data and tools, including approaches described in MCP documentation. That’s not a feature request so much as a reality check: if the agent can’t see the rule, it will guess at the rule.
And guessing is where the trouble starts.
Turn context into infrastructure
Once you accept that the problem is missing project context, the next move gets a lot less mystical. Don’t ask the model to remember everything. Build a workflow that supplies the facts it needs, at the moment it needs them.
That starts with explicit state. A chat transcript is a terrible place to keep the truth about a real system. By the third turn, the model may be working from stale assumptions, half-finished plans, or a user correction it only partially absorbed. A better setup keeps current state in structured fields the agent can read directly: active branch, target environment, approved scope, open blockers, current file paths, customer tier, whatever actually governs the next action. When the model has to infer those details from prose, it will occasionally guess wrong in a very confident voice, which is a fun personality trait in a comedian and a bad one in software.
If the system knows the rule, the model does not have to invent it.
Retrieval fills in the rest. The practical trick is simple: don’t stuff every possible detail into the prompt and hope the model sorts it out. Pull source-of-truth material at runtime. That can include product docs, ticket threads, code comments, runbooks, incident notes, and the last few completed tasks in the same workflow. If the agent is updating a billing rule, It should read the billing spec and the related ticket before it writes anything. If it’s touching auth code, it should see the repo conventions and any guardrail notes attached to that area. This is where AI reliability starts to feel less like a model problem and more like a data plumbing problem.
Long context helps, but only if the right material is in there. A bigger context window lets you carry more text, not more wisdom. If the system feeds the model a pile of irrelevant logs and three contradictory drafts, a larger window just gives it more room to be confused. Anthropic’s guidance on prompting with long context is useful here, mostly because it treats context as something you curate, not something you dump in and pray over. OpenAI’s model optimization guidance points in the same direction: make inputs cleaner, more structured, and more directly tied to the task. That advice sounds boring until you compare the output quality before and after. Then it feels like finding a loose cable that was somehow responsible for half the outage.
Policy should live outside the model. This part gets skipped more often than it should, usually because teams hope the assistant will “just know” what it may access or modify. It won’t, and even if it did, that knowledge would be a soft constraint. Permissions, approval gates, And safety rules belong in the system layer where they can be checked every time. If an agent can open a pull request but can’t merge it, the application should enforce that boundary directly. If a workflow forbids deleting files outside a sandbox, that check should happen before the action runs. The model can suggest. The system can refuse. That separation keeps the agent from becoming a very talkative security hole.
The same logic applies to task history. Keep the record close to the workflow, not buried in a random doc someone last edited during a Friday panic. When an agent resumes a task, it should see what was already tried, what failed, what was approved, and what remains open. Otherwise it will rediscover old mistakes with fresh enthusiasm. Structured inputs help here too. A clean task object with fields for goal, constraints, acceptance criteria, owner, and prior actions gives the model something stable to work from. A wall of chat messages doesn’t.
This also makes audits much easier. If the agent took a bad step, you want to know whether the fault came from missing retrieval, stale state, a weak policy check, or a bad instruction. That diagnosis is hard when everything lives inside one giant prompt. It gets much simpler when state, retrieved documents, policy decisions, and task history are separate parts of the workflow. You can inspect each piece. You can replay it. You can patch the broken part without rewriting the whole system.
For teams building agent workflows, this is the more durable path: treat context as an application concern, not a prompt-writing exercise. The model still matters. It just stops being the only place where knowledge can live.
Before you buy another model, fix the workflow
Once you’ve put explicit state, retrieval, policy checks, and task history around the agent, the model question gets a lot less dramatic. You can still choose a larger model, sure. Sometimes that helps with harder reasoning, better code edits, Or fewer weird tool calls. But it should come after the workflow can already tell the agent what it needs to know. If the system is feeding it the right repo facts, permissions, and history, then model choice turns into tuning. Without that, a bigger model is often just a fancier way to guess.
That’s the practical rule I’d use in software engineering teams: audit the failures before you shop for a new model. Look at the cases where the agent went sideways and ask a boring set of questions. Was the needed detail sitting in a ticket, a README, or a Slack thread the agent never saw? Did it miss a naming convention because the docs were stale? Did it try an action it shouldn’t have because the tool layer didn’t block it? Did retrieval bring back the wrong file, or nothing useful at all? Those answers usually point to the real gap.
If the agent only works when the missing context happens to be in the prompt, the prompt is doing too much work.
That sounds obvious when written out, but teams still burn time trying to solve a workflow problem with a model swap. I get why. A larger model feels clean. One purchase, one API change, one hopeful benchmark run. Reality is messier. Agents fail in specific places, and those failures tend to map to missing information or weak guardrails rather than a lack of raw capability.
A quick test helps. Take the worst output and remove the model from the story for a minute. If the agent made a bad call because the rule was never encoded anywhere, the fix is policy. If it guessed wrong because the schema or repo state wasn’t available at runtime, the fix is retrieval or explicit state. If it repeated an old assumption because the task history got dropped between runs, the fix is persistence. In retrieval augmented generation systems, that last part matters a lot. The model can only use what the workflow surfaces, and it can’t respect a boundary it never receives.
That’s why the sequence matters. First, make the project facts machine-readable. Then make the permissions enforceable outside the model. Then make retrieval return the right material instead of a random pile of vaguely related text. After that, test whether the current model still leaves enough on the table to justify an upgrade. At that point, buying a larger model is a sensible tradeoff discussion, not a rescue mission.
There’s a nice side effect here. Once the workflow carries the context, failures become easier to inspect. “ That gives teams something concrete to fix, which is usually a better use of time than arguing about benchmarks over coffee.
So the last move is simple: treat the model as one part of the system, not the system itself. Build the context layer first. Then, if you still need more capability, upgrade the model with a clear reason. That order saves money, reduces surprise, and usually produces agents that behave like part of the team instead of a very confident intern with a partial memory.




