the 95% failure rate is not about the models

the 95% failure rate is not about the models

May 17, 2026
#ai
#agents
#enterprise
#context-engineering
#architecture

5% of enterprise ai projects deliver value. the number has not moved through three generations of frontier models. the operational layer is the moat, and context engineering is the discipline that builds it.

five percent of enterprise ai projects deliver value. the number has not moved through three generations of frontier models. the model is not the problem. the model is already strong. what is missing is the layer underneath that gets the right context to it at the right moment.

this is the headline finding of the most quietly important data set in ai right now. it also explains why the conversation in every c-suite room has the same shape. they spent millions on ai. the ap team is still doing ap the same way. month-end close is still 22 days. reps still hit quota at 24%. the crm still has the same 30% data decay it had in 2022. every leader believes in "going ai-first" and almost none of them can name a metric that has moved.

@vasuman, who runs varick agents and has spent eighteen months deploying ai inside enterprise teams at companies doing $500m to $5b, just published the cleanest articulation of why. sierra's @neilrahilly published the cleanest articulation of how to fix it, naming the missing discipline directly, context engineering, deciding what information an agent has access to at each moment and when it should be used. this article restates vasuman's thesis, layers in rahilly's framework, brings in the supporting data the broader literature has produced, and then takes it one step further. the operational layer is the moat. context engineering is the discipline that builds it. the teams that build both now will own the next decade. the teams that wait for agi will keep flatlining.

the data is consistent across six independent studies

pick any methodology you like. they converge.

studyyearheadline number
mit nanda, genai divideaugust 20255% of integrated ai pilots pull millions in value, the other 95% have nothing to show
bcg20254% have hit ai value at scale
deloitte20256% achieve ai roi within a year
rand202580%+ of ai projects fail, twice the rate of normal it projects
ibm202575% of ai initiatives have not delivered expected roi
mckinsey202578% of orgs use ai regularly, 80%+ report zero ebit impact

the numbers cluster. the methodologies do not. the cluster is not a measurement artifact, it is the actual rate. and critically, the number has been flat through every model generation. gpt-3, gpt-4, gpt-5. claude 2 through claude opus 4.7. gemini 1 through gemini 3. each generation arrived with claims that "this is the one that fixes enterprise." each generation arrived to the same 5%.

a flat number across model generations is the strongest possible evidence that the model is not the bottleneck. if it were, we would have seen at least one generation move the number. three generations have shipped, the number has not moved, the bottleneck is somewhere else.

what did move, ai for engineers

there is one group for whom ai is working. at scale. repeatedly measured. across multiple methodologies.

sourcefinding
github copilot study, 202455% faster on real tasks. 1h 11m vs 2h 41m on the same work.
anthropic internal study, august 2025132 engineers, 100k real claude conversations. ~80% reduction in task completion time.
sundar pichai, january 202675% of new code at google is ai-generated and engineer-approved. the number was 30% in april 2025.

engineers got 2x to 5x more productive. sales reps did not. finance teams did not. marketing teams did not. ops teams did not.

the natural follow-up question is the entire load-bearing argument of this article. why?

the four properties that explain everything

software engineering has four properties that almost no other enterprise function has. these properties are what makes ai useful inside engineering and useless outside it. naming them is more important than any tool, framework, or model.

1. bounded. a function takes inputs and returns outputs. the scope of "fix this bug" lives inside a file or a module. dependencies are explicit and importable. the agent can see the perimeter of the problem.

2. checkable. compilers tell you in milliseconds whether the code parses. tests tell you whether it works. type systems catch entire classes of error before runtime. feedback loop, seconds. the agent gets a verdict on every action.

3. structured substrate. code lives in files, in version control, with a deterministic build pipeline. same input, same output. any state can be replayed. the environment is reproducible.

4. verifiable output. a pull request is a discrete artifact. a reviewer can look at the diff in ten minutes and say yes or no. the work has a clean go/no-go gate.

now contrast that with a finance close.

a finance close involves ap, ar, intercompany reconciliations, fx, accruals, journal entries, and exception handling that spans netsuite, concur, three banks, two erps from acquisitions, a custom intake form, and a slack channel where the controller flags weird stuff. the "process" is documented in an sop that does not match what actually happens. the output is "the close was clean," which takes two senior accountants two days to verify.

none of the four properties hold. not bounded (the work touches twelve systems). not checkable (no millisecond feedback). not structured (state is scattered). not verifiable (the output requires expert judgment over two days). and the same is true for sales ops, marketing ops, customer success, recruiting, legal review, supply chain.

this is also why every "ai for sales / finance / marketing / ops" startup pitches the same demo and dies in the same way. the demo runs against a clean toy version of the workflow. the actual workflow is the messy one. generic ai pointed at a non-bounded, non-checkable, non-structured, non-verifiable workflow gives negative roi. the operator was doing the work in 30 minutes. now they are doing the work in 30 minutes plus another 30 minutes correcting the ai's mistakes.

the labs (openai, anthropic, cursor) poured everything they had into engineering because that is where the four properties hold. they got engineering working. they have not solved the rest, because the rest does not have the substrate.

context engineering, the discipline that builds the substrate

the four-property test tells you whether the substrate exists. context engineering is the discipline of building the substrate when it does not. sierra's neil rahilly published the cleanest framing, three eras of customer interaction map exactly onto three eras of enterprise ai.

era 1, ivr. no reasoning. a menu. press 1 for billing, press 2 for returns. if the customer's issue does not match the menu, they are stuck.

era 2, flow. most ai agents shipped today live here. a predefined flowchart, a digitized sop, an "if this, then that" tree. customers can speak naturally, but the system still operates on rigid branches. as more sops are added, the system becomes harder to manage. when a real-world problem falls outside the flow, the agent escalates or hallucinates.

era 3, context engineering. the agent is guided by goals, constrained by guardrails, and the model itself drives the conversation. the job of the platform is not to script the agent. the job is to deliver the right context, at the right time, so the model can reason effectively and act correctly.

two ideas do the heavy lifting.

progressive disclosure. as tokens grow, recall and accuracy decline. every irrelevant token competes for the model's attention with the tokens that actually matter. the fix is to provide only the minimum, most relevant information at each moment. a customer calls about an international shipment. the agent does not need rules for every country upfront. it needs germany-specific guidance only after it learns the destination is germany. until then, the rest is noise.

conditions. what makes progressive disclosure work is conditions, the rules that decide when a piece of context becomes relevant. state-based, the customer is authenticated, a subscription is loaded, a tool returned specific data. observation-based, the customer mentioned a topic, expressed a desire to cancel, asked about a specific product. once a condition is met, the relevant context is unlocked. the conversation starts minimal and grows precisely as needed. workflows do not disappear. a regulated intake process still needs one. the shift is that the workflow becomes just another piece of context made available when conditions are met, not the organizing paradigm for the entire system.

this is the same pattern agents.md, design.md, vault-as-context, and progressive task disclosure all converge on. sierra named the discipline. the discipline is the answer to "how do you build the substrate when the work is finance, sales, ops, support, not engineering."

this is exactly what the flat 5% number is telling us. the model is not the bottleneck. the model is already strong. what is missing is the layer that gets the right context to it at the right moment. hardcode logic into your agent and you cap its capability at your foresight. build context engineering and your agent inherits every model improvement that ships. an agent that handles five journeys can survive with loose context management. one that handles fifty cannot. without the discipline, the model gets overwhelmed and the experience degrades, regardless of how smart the underlying weights become.

the teams hitting 95% failure are the ones still living in era 2, scripting flows around an era 3 model. the teams in the 5% are the ones who let the model drive and built the discipline of feeding it context on purpose.

the four failure modes

vasuman names four. each one is a real pattern across his eighteen months of enterprise deployments. each one i have seen in the codebases of every team i have audited. together they account for almost every failed ai initiative.

1. they skip the audit

the single biggest predictor of failure. the team starts building before they understand the workflow they are supposedly automating.

the actual workflow always includes things the sop does not mention. the "i always check this spreadsheet first" step. the "i email sarah directly because the system notification does not work" step. the seventeen exception types the team handles every month. the unwritten rule that anything over $5m loops in the controller, even though the threshold says $10m.

vasuman calls the gap between sop and reality the conformance gap. in typical engagements he sees 30%. in exception-heavy workflows like ap exception handling or supply chain disruption, he routinely sees over 70%.

when you build for the documented process, you automate 70% of the volume and break on 30%. the 30% that breaks creates more work than the team had before, because now they have to fix the ai's mistakes on top of doing the work. the audit takes four weeks minimum. it does not feel like ai work, which is why most teams skip it. it is also the medicine you have to take before surgery, and ai implementations are surgery.

2. they throw everything at the llm

once you have an llm, every problem looks llm-shaped. need to extract a value from a document? ask the model. compare two values? ask the model. route based on a number? ask the model.

the team builds an architecture that is 90% llm calls and 10% code. the system is slow, expensive, and hallucinates 10% of the time, which is fine for a chat interface and unacceptable for ap automation.

this compounds with the audit problem. teams that did not audit do not know which parts are pattern-matchable and which need real judgment. so they default to the llm for everything and ship something that breaks. a boring agent is a working agent.

3. agent sprawl

the most silent and most deadly failure mode. it only shows up after month six. by the time you notice, the budget is gone.

sarah in ap builds her own agent in lovable to classify invoices. the controller spins up a separate one for intercompany. the fp&a lead vibe-codes a variance reporter. the cro's chief of staff has a personal agent for qbr prep. the marketing manager built a content agent. the recruiting coordinator built a candidate screener.

multiply that across a 200-person operations org. fifty to a hundred separate ai workflows. each built by a different person. each with its own quirks. each solving similar problems in seven different ways. no common spine. no shared memory. no shared knowledge of how the company actually runs.

marketing's content agent has zero awareness that customer support is dealing with fifty tickets about the exact thing it is writing copy about. finance's invoice agent has no idea that procurement just blacklisted that vendor last week.

then the inevitable. a model gets deprecated. an api endpoint changes. an employee leaves. a vendor pushes a breaking change. fifty of the hundred agents break in production. nobody is on call (vibe-coders can build, they cannot fix). the cto's engineering team is suddenly playing janitor for ai workflows nobody owns.

worse than maintenance, the security and compliance shape is unbounded. every personal agent carries its own api keys, its own data access, its own potential for exfiltration. the legal team finds out three months later that someone's marketing agent has been blasting customer data into a third-party llm api that was not on the approved vendor list.

the fix is architectural and has to be planned from day one. a single orchestration layer that sits on top of the existing stack. shared infrastructure for ingestion, approvals, audit logging, model routing, knowledge. every new use case lands as configuration on top of one platform. no more bespoke vibe-coded side projects.

the economics of this compound. the first agent on the platform takes 12 weeks. the next takes 9. the third takes 4. without the platform, every agent costs roughly the same to build and the integration debt eventually consumes the entire ai budget.

4. they treat ai as a side-project instead of infrastructure

the slowest, hardest-to-spot, most expensive failure mode over time.

most companies budget ai like any other software project. plan, build, ship, declare victory, move on. that logic works for traditional software because once you build it, it stays built. ai is the opposite. every quarter, something underneath shifts. a new release is dramatically better at your workload, or worse, the model you depended on quietly gets distilled and degrades.

anthropic alone retired roughly 9 models in 18 months. openai retired even more. pricing changes quarterly. rate limits get tightened with no warning. in april 2026 anthropic publicly acknowledged that engineering errors had degraded claude code performance for over a month, and that paid max subscribers were hitting their quota in 19 minutes instead of the advertised 5 hours. production workloads built on those guarantees had already broken.

the deployments that actually pay off treat ai as continuously evolving infrastructure with a dedicated team that owns ongoing optimization. they monitor quality. they swap models when better ones ship. they retire agents that have stopped earning their keep. they renegotiate when something breaks underneath them. most importantly they avoid vendor lock-in.

a six-month project plan against a market that moves quarterly is a project plan that is wrong by month two.

the five things the 5% do

vasuman's pattern, distilled.

  1. audit before build. four weeks minimum of mapping the actual workflow before anyone touches a model. output is a digital twin, a live map of how work moves through the org, where the conformance gaps are, what is pattern-matchable, what genuinely needs human judgment. the document matters less than the alignment it forces between the ai team and the operators.

  2. decompose until most of the work is deterministic, then context-engineer the rest. llm goes only where judgment is required. plain code goes everywhere else. most production systems end up as 5 to 10 deterministic steps with maybe one or two model calls in specific places. where the llm does run, treat its inputs as a context engineering problem, not a prompt problem. build composable context blocks each guarded by a state or observation condition, and disclose them progressively as the conversation unfolds. the blocks that should be live at minute zero are tiny. the blocks for "international shipment to germany with a disputed charge" only load when those conditions hit. boring in production is the goal.

  3. build a single orchestration layer. finance, sales, ops, and engineering agents all live on the same platform. they share context. they can talk when they need to. every new use case lands as configuration (context blocks plus conditions) on top of the platform. sprawl is dead on arrival.

  4. stay model-agnostic. abstractions get built at the task level, not the model level. each step routes to the best-fit model at any given moment. when openai deprecates a model or anthropic ships something dramatically better, the routing layer absorbs the change. the workflow keeps running.

  5. treat the deployment as continuously evolving infrastructure. a real team is responsible for ongoing tuning. agents that are not earning their keep get retired. improvements ship every quarter, often every month. the deployments that pay off over five years are the ones tuned monthly. not the ones declared "done" at go-live.

the same pattern, applied at the engineer's scale

vasuman is writing for the c-suite. the same pattern applies at the level of an individual engineer building an agent.

the four properties (bounded, checkable, structured, verifiable) are also the test for whether you are about to ship a working agent or a failing one.

engineer-scale questionwhat to do if the answer is no
is the task bounded? can you draw a perimeter?reduce scope until you can. ship a smaller agent first.
is the output checkable? can a script verify it?build the verifier first. without it, you have no signal.
is the substrate structured? files in git, replayable state, reproducible build?invest in the substrate before the agent. the substrate is the agent's environment. credentials, versioned filesystem, orchestration.
is the output verifiable? can a reviewer say yes or no in ten minutes?define the artifact and the gate before the agent. prs work because the diff is the artifact and the review is the gate.

the four-property test is also the first audit step at engineer scale. before writing the prompt or picking the framework, ask whether the work itself has the four properties. if it does not, build the missing properties into the environment first. the environment is the project. the model is one component inside it.

this is the same conclusion the harness-engineering literature has reached from a different angle. same model, different harness, 52.8% to 66.5% on terminal bench 2.0, a 14-point swing from the harness alone. if you are not the model, you are the harness. the harness is the operational layer. the operational layer is the moat.

the labs are conceding the same point

the clearest signal that the operational layer is the bottleneck comes from the labs themselves.

openai shipped codex cloud, which is openai doing the operational fieldwork inside customer organizations. anthropic shipped managed agents and claude design, which are anthropic doing the same thing in different surfaces. cursor's arr moved past anthropic's api revenue because cursor is not selling the model, it is selling the harness around the model. sierra is doing the same thing on the customer-experience surface, the entire product is context engineering as a service, with ghostwriter generating context blocks and conditions from existing sops, journeys giving operators a no-code editor, and an agent sdk for teams that want to manage the layer as code. every model lab and every serious agent vendor has now figured out that selling the model alone is not enough. selling the runtime alone is not enough. you need the model plus the runtime plus the context engineering discipline plus a team that goes and embeds inside the enterprise and figures out what to actually build for it.

the labs are making the same bet from the supply side. the 5% of enterprises winning are making the same bet from the demand side. if you are an engineer building agents, you are now in the same business as the labs. the work is operational, not algorithmic. the leverage is in the harness, not the weights.

the bet i would take

if a ceo walked in tomorrow with $1m and six months and said "deploy ai in our enterprise, please do not fail," vasuman's path is correct.

monthwork
1audit. embed with operators. map conformance gap. produce digital twin
2architect highest-leverage workflow. decide pattern vs judgment. pick model per step. define hitl touchpoints and audit logs
2.5–4build. soft launch with humans approving every action. track accuracy daily. fix what breaks before going wider
4–6go live. humans still in the loop on high-stakes decisions. watch the metrics that matter (cycle time, error rate, throughput) for at least two weeks before declaring a win
6stand up the continuous-tuning function. the model market will change next quarter. your workflow will change next quarter

by the end of six months you have one workflow live in production with real roi on the books, a platform that absorbs the next workflow in 8 weeks instead of 24, and a tuning team that keeps the whole thing alive going forward.

the same shape, scaled to a personal project. spend the first week on the workflow audit, not the prompt. spend the second week on the orchestration layer that the next four agents will land on top of. build the first agent in week three, watch it for a week, ship the second agent in week five at half the cost.

closing

the "models got smart" chapter is already over. we had a glorious run. the model is already strong enough. it has been strong enough for at least two generations. the next decade belongs to the teams that build the operational layer underneath the models, the ones who let the model drive and feed it the right context at the right moment, not the ones who spend another five years pouring frontier ai onto a mess of systems and wondering why nothing changed.

the discipline has a name now. sierra called it. context engineering. progressive disclosure of relevant information, gated by state and observation conditions, built on a single orchestration layer, tuned continuously as the model market moves underneath it. the 5% are not lucky. they are operational, and operational means context-engineered. that is the whole game.

sources. vasuman, the operational layer is the moat. neil rahilly, context engineering, the key to great agents. 5% / 95% framing originates with mit nanda's genai divide report, august 2025. convergence across bcg, deloitte, rand, ibm, mckinsey gives the number its weight.