Ethan Kurzweil

Ivory Tang

The Selective Autonomy Imperative: The Trillion-Dollar Opportunity in Enterprise AI Belongs to Builders who Master Trust

Every year since 2023 has been declared “the year enterprise agents finally take off.” And every year, the same question lingers: What’s taking so long?

The obvious answers are familiar: better memory, more reliable tool calling, smarter models. But they miss the main impediment.

We’ve had capable reasoning models for over a year now. The bottleneck isn’t intelligence. It’s trust and the engineering harness required to turn intelligence into a product experience an enterprise can actually bet on.

The question isn’t whether agents are possible. It’s who will master the craft (obsessing over the details) to build agents that endure in production?

Three shifts are converging to make this moment real:

The control pane is being built. Enterprises can increasingly manage agents like they manage employees: observe what they’re doing, verify who they are, control what they’re allowed to do. Best practices around observability, identity, and authorization are making governance tractable.
The “work environment” is emerging. A growing ecosystem of primitives is creating a standard surface where agents execute, outputs get reviewed, and workflows become reusable.
Buyers are getting sharper. Spend is finally coupling to ROI. That means fewer experiments but much deeper adoption when a use case actually works.

Building Trust in Agents

Let’s be precise about what we mean by agents. Anthropic defines them as: “systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.”

Most production AI today doesn’t need that level of autonomy. A single, well-tuned model call with retrieval handles the majority of time-saving use cases: support resolution, PDF extraction, CRM field mapping. These aren’t agents by the definition above; they’re workflows with a language model in the middle.

And even here, the trust gap is real. LLMs are nondeterministic. They hallucinate. They drift. Vercel reportedly built an entire framework layer (zod + specialized prompting) just to get models to follow a fixed output schema 99% of the time, up from 97%. That 2% gap sounds small until you’re running 10,000 daily transactions and the errors compound.

Agents magnify this trust problem exponentially.

Every additional turn introduces another opportunity for hallucination, misalignment, or subtle drift and unlike a single model call, errors cascade. A wrong decision in step 3 doesn’t stay in step 3. It contaminates steps 4 through 12. Multiply that across thousands of concurrent sessions and long-lived workflows, and trust stops being a product question.

It becomes the central engineering challenge. This is where the harness comes in.

The Agent Harness

An agent harness is the infrastructure that wraps around an AI model — the chassis and controls around the engine.

It’s what enables an agent to reliably manage and execute long-running tasks, not just respond to single-turn prompts.

The harness manages the full lifecycle:

breaking goals into steps
orchestrating tool use (file access, code execution, API calls)
maintaining memory and state across sessions
tracking progress over time
integrating human review where needed

It handles planning, decides which tools to use, manages context, and determines when to ask for help versus proceed autonomously. This is what moves agents from “cool demo” to “system that can run in production.”

The Control Pane

A working harness isn’t a production-ready harness. True agents — the kind that handle airline booking across tens of thousands of sessions, or maintain coherent design decisions across a multi-month project — face dramatically higher trust requirements. These are long-lived workflows with branching paths, real-world consequences, and large user populations.

To make agents production-ready, you need three things:

1. Evals

Every run needs automatic checks:

Did the agent answer the right question?
Did it follow instructions?
Did it avoid disallowed topics?
Did it stay within constraints?

Model-based graders can assess both quality and goal completion but only if you build the machinery and run it continuously.

2. Observability

With traditional software, you can reason about behavior from inputs and outputs. With agents, the path matters as much as the destination and the path is different every time.

Real observability means capturing the full trajectory of each run:

what tools were called, in what order
what context was retrieved (and what was discarded)
how the plan evolved as new information appeared
what branches were considered and rejected
where the agent spent its token budget

This isn’t logging. It’s a replayable narrative of decision-making — auditable, debuggable, learnable.

At the aggregate level, you need dashboards that surface patterns across thousands of runs:

failure rates by task type
latency distributions
quality scores over time
cost per successful completion

This is how you catch that performance dropped 15% after last week’s prompt change or that one tool is timing out for 8% of users in a region.

The failure modes observability catches tend to fall into three buckets:

Silent failures: answers that look plausible but are subtly wrong (wrong quarter, stale data, missing caveats).
Cascading errors: an early mistake compounds through the workflow until the final output is garbage.
Drift and degradation: behavior shifts slowly over time as retrieved content, APIs, or prompts change.

Observability becomes non-negotiable. It’s how you catch that 100 generations went sideways last night and more importantly, how you trace the root cause before tomorrow night.

3. Authorization

The scariest failure mode (i.e., an agent deleting a production database) usually isn’t a model failure. It’s an authorization failure.

Agents need to be treated like users:

authenticate through enterprise identity systems
receive an identity
get the narrowest permissions possible

But with one critical addition: you must be able to see every action the agent took or even considered.

The key insight: trust is now an engineering problem, not an unsolved research problem. The primitives exist. The patterns are emerging. Evals, observability, and authorization are the bridge from prototype to production — the difference between a weekend project and a system you’d bet your company on.

The General-Purpose Agent Work Environment Emerges

Something important is happening in how teams build and deploy AI capabilities.

Claude’s Skills framework offers a glimpse: a standard format that lets teams author, share, and install reusable capabilities. But instead of static libraries, these are behavioral modules. They encode how work gets done, what steps matter, what “good” looks like, and how context should be interpreted. They’re closer to organizational muscle memory than software dependencies.

Meanwhile, Claude Code has evolved from a coding tool into a multi-purpose work environment. Teams aren’t just prototyping; they’re operating inside their filesystem and the web: drafting deliverables, managing billing, handling communications, planning and shipping production code.

But this does not mean general-purpose environments will own all work.

They’ll push downward into execution but they won’t fully absorb domain-specific deployment surfaces. They’re becoming the place where intent is expressed and experiments begin, not the place where high-stakes production work ultimately lives.

The model ecosystem is also solving a coordination problem that has plagued AI adoption. When individuals use AI privately, gains are scattered and incremental. When workflows become shared (skills, formats, CLAUDE.md), leverage compounds institutionally. The ecosystem tightens across desktop, CLI, web. The organization standardizes around the fastest path. The “right way” becomes whatever the environment makes easiest.

This is why the biggest winners won’t simply have the most capable agents. They’ll capture the most real work inside production systems with clear ownership, permissions, and measurable outcomes: closing books, resolving incidents, completing high-value workflows with ROI. And contrary to what some believe: general-purpose agents don’t eliminate startup opportunity. They clarify it.

They expose the gaps: brittle connectors, unsafe defaults, noisy interfaces, and too much babysitting. Those gaps define the startup surface and they’re where serious value will be captured.

Opinionated Buyers Emerge

Buyer behavior is changing.

The last wave of enterprise AI adoption was driven by FOMO: teams felt pressure to “have agents,” experimental budgets ballooned, proofs-of-concept multiplied and very little made it to production.

What’s different now is simple: when models, workflows, and execution surfaces converge into a cohesive environment, complexity and weak ROI become impossible to hide.

The question is shifting from: “How do we add agents everywhere?” to: “Where does autonomy actually pay for itself?”

The result: less experimentation, clearer build vs buy decisions but dramatically higher commitment when an agent proves a real business case.

Teams will spend an order of magnitude more when they see an agent replacing real work. That’s where stickiness gets earned and where enterprise budgets unlock.

What This Means for Builders

Here’s what the consensus gets wrong: Agents will explode but not indiscriminately.

They’ll explode where trust infrastructure exists. The bottleneck isn’t the model. It’s deployment.

Claude Code, MCP, agent SDKs, Skills — the barrier to spinning up task-specific agents has collapsed. The cost to deploy “good enough” automations is approaching zero, and we’re already seeing the Cambrian explosion.

But that explosion commoditizes capability. When anyone can build, or even just prompt, an agent, the agent itself isn’t the moat.

The opportunity is in owning the deployment layer: the surface where agents execute, earn trust, and become repeatable infrastructure. Where work happens. Where outcomes get approved, shared, and reused.

That shifts what the best target use cases look like. They’ll be:

Complex enough to justify autonomy (not solvable with one model call)
Close to ROI with value realized quickly after deployment
Last-mile dependent where internal builds are too risky or too annoying
Complementary to the model ecosystem not in direct competition with it

The goal for founders is to become the best substrate inside the ecosystem: the layer that enterprises rely on to turn intelligence into trusted execution.

Every successful workflow compounds value. Every production run deepens the moat. Every trust decision becomes a defensible advantage.

Agents are hard to build well. There is real craft here, the craft of sweating details, designing with intention, and selectively shaping autonomy through evals, observability, and authorization.

The winners won’t build the most agents. They’ll own the surface where agents earn trust. And there has never been a bigger, or better-timed, opportunity.

January 20, 2026

Authors