Book a consult

I’m Moving My Enterprise Agents From OpenAI Agents SDK to Claude Managed Agents

4 features that changed my mind after shipping enterprise agents for a $200M IT company

This year, I built enterprise agents for a $200M IT company. The agents help thousands of their employees work with internal data, systems, and workflows, through a chat interface.

When I started, OpenAI’s Agents SDK was a great choice. It had the primitives I needed to build powerful agent backends, and it was already battle-tested in production by other teams.

But after living inside that backend, I recently changed my mind. This article is about the 4 specific reasons why.


1. The fastest sandbox is the one you don’t start

The very first version of the agents I built ran on OpenAI’s Agents SDK with its Hosted Shell tool. To be useful, the agents needed a sandboxed workspace to inspect files and data, run scripts, process larger outputs, and create artifacts, and I wanted that managed instead of running my own container orchestration.

But OpenAI’s hosted shell containers expire 20 minutes after user inactivity. For chat-based enterprise agents, that’s painful. Employees ask something, get pulled into a meeting, return later, or reopen an investigation from last week. If the agents generated useful working state like writing files with intermediate outputs, all that disappeared when the container expired.

When Anthropic shipped Claude Managed Agents and OpenAI shipped Sandbox Agents, both looked like the fix. They targeted the same problem: Production agents need durable workspaces whose state can be paused and resumed later from the same checkpoint. At first glance, the 2 offerings looked similar. But after digging in, I realised they make a different architectural bet about when an agent’s sandbox spins up.

OpenAI’s Sandbox Agents require a sandbox for every agent run. Each run has to resolve a sandbox session before it can start: reuse a live one if available, resume a serialized one if needed, or cold-start a fresh one if not. This is great when the task actually requires a sandbox. But if a user just says “Hi” or asks a simple follow-up that doesn’t need sandbox execution, the request still has to go through the sandbox path and pay its setup cost.

In comparison, Claude Managed Agents are lazy by default. The sandbox is a tool the agent calls only when it needs one. If the task is lightweight, the agent responds without waiting for a sandbox at all.

For enterprise agents, most conversations are mixed. One turn might require a full workspace for actions like inspecting data, running scripts, and writing an Excel file. The next turn might just be “Summarize that”. I don’t want every lightweight turn to pay sandbox cost just because some turns need it. The agents should feel snappy on simple questions, and still be able to do deep workspace-heavy work when the task calls for it.

Anthropic reports a 90%+ drop in p95 time-to-first-token (TTFT) from avoiding this. Since TTFT is the dead air before the agent starts responding, cutting it down matters a lot for user experience.

To be clear, you can recover lazy-sandbox behavior on OpenAI’s Sandbox Agents stack, but it requires slightly more architecture. You’ll need a non-sandbox main agent that delegates to a sandbox agent only when workspace execution is needed, either as a subagent or handoff agent. Those are valid designs, but they come with downsides that don’t exist with Claude’s lazy sandbox design.

  1. The subagent workaround. Say you make the sandbox agent a subagent that your main agent calls. The problem is that the user’s request is now an extra LLM translation away from where the sandbox work happens. With Claude Managed Agents, the main agent that receives the user’s request is the same agent that performs sandbox execution directly when needed and skips it when not. In a subagent design, the main agent has to translate the user’s intent into instructions for the sandbox agent, which then does the sandbox work and returns its output back to the main agent, which then has to interpret that and reply to the user. You can engineer that boundary carefully, but it’s still an extra LLM surface where intent can get compressed, assumptions can sneak in, and findings can get summarized away. The further the actual work sits behind chains of agents, the more the system pays in latency and fidelity.

  2. The handoff agent workaround. Say you instead make the sandbox agent a handoff target the main agent transfers to when sandbox work is needed. This dodges the subagent translation problem, but creates a different one. After a handoff, the sandbox agent becomes the active main agent for the subsequent turns. So when the user inevitably follows up with something lightweight in the same session, like “Show those results in a table instead of in prose”, there are 2 possible outcomes. You either keep that simple turn on the sandbox-backed path and pay in overhead, or you pay another handoff tool call to swap back to the non-sandbox agent. Either way, this handoff policy is something to design, test, and debug.

Personally, these are the real differences I care about. As you might’ve heard before, “Agent harness design is more of an art than a science” — many different architectures can all be valid. But for me, those extra layers are exactly the kind of things I’m trying to cut out when building enterprise agents that are already complex enough on their own.


2. Agent work log should outlive its context window

In the context of enterprise agents, imagine an employee comes back to an old thread from last week and says:

“Update your client analysis with this new info: XXX”

To continue working properly, the agents need to know not just the final analysis they produced last week, but also the work trail of how they got there. For example: which systems were queried, which exact API endpoints were used, what assumptions were made, which files were created, what failed and was ruled out.

Claude Managed Agents ship with an agent-queryable event log per session. Events in each agent session get stored durably in Anthropic’s managed storage. As agents work longer and inevitably hit context window limits, Claude’s agent harness triggers compaction to free up space in the active context. Afterwards, the agent’s context window only contains a summary of its session history. But the full event log still lives outside, and the agent can fetch specific slices of it on demand with a built-in getEvents() tool. If it needs to know exactly which API endpoints worked last week to fetch X data so that it can use them again, it can use this tool to retrieve the exact tool calls it made before.

The OpenAI Agents SDK has no built-in equivalent. Of course, you can build the same thing yourself. For each agent session, you could store all events in your own database, then expose custom tool(s) that let your agent search and retrieve over them. But the point is that you’ll have to build and maintain both the storage solution and the retrieval tools yourself.

We talked about compaction earlier, and I think it’s noteworthy that both OpenAI and Anthropic offer native compaction features that are approached very differently. Anthropic’s compaction produces a human-readable text summary, while OpenAI’s produces an opaque encrypted item. I’ve had genuinely fantastic results with OpenAI’s compaction in long Codex threads, and one might even say that OpenAI has less of a need to offer a native feature for an agent-queryable event log because their compaction is so good at retaining information.

But compaction is lossy by definition. However great OpenAI’s compaction is, there’ll still be edge cases where the agent benefits from being able to query its full event history on demand. For example, if an agent’s debugging why a workflow failed last week, it might need to see the exact error messages it got back from an API call.

As such, I prefer how Claude Managed Agents pair compaction with an in-built agent-queryable durable event log. I expect OpenAI to add this as a feature eventually, since Codex already does it under the hood. But as of today, this stands as a Claude product differentiator.


3. Agent memory needs housekeeping, not just storage

Agent memory is a critical part of production enterprise agents. They need to remember what they learned across different sessions, and importantly use these lessons to compound useful knowledge over time. Valuable memory looks like:

  • Local jargon for internal systems and workflows
  • Which workflows recur where with what rules
  • Which API patterns work for which workflows
  • Which workflows need special handling
  • Which assumptions or mistakes were corrected, and what the corrected versions are

Both Claude Managed Agents and OpenAI Sandbox Agents offer primitives for persistent memory stores that agents can read and write to across sessions, so memory on its own is not unique to either offering.

What Claude Managed Agents offers that OpenAI doesn’t is Dreams. It addresses a real need, that a long-running memory store gets messy. It accumulates duplicates, stale notes, scattered lessons, and contradictions across different user sessions. Without curation, the memory store becomes a junk drawer that agents can’t make use of, and every lesson the memory was supposed to compound instead dilutes.

Dreams are Claude’s async memory curation feature that processes an existing memory store alongside past session history to produce a cleaner memory store. In short, memory stores let the agent remember; Dreams clean up what it remembers.

This combination compounds agents. They can gradually retain recurring workflow knowledge, user preferences, system-specific lessons, prior investigation patterns, and so on. That memory stays usable instead of degrading into noise. And the best part is that I don’t have to build my own custom memory curation pipeline, or even design it. I just have to turn this feature on.


4. Persistence shouldn’t be an agent problem

On this point, I’d like to note that both Claude Managed Agents and OpenAI Sandbox Agents solve this problem, so this isn’t a differentiator between the two. But I want to spend some time talking about this anyway, because it’s the most painful workaround I needed when initially using OpenAI’s Hosted Shell tool, and it’s a great example of the kind of infrastructure work that the new offerings eliminate.

If you recall from earlier, I mentioned that OpenAI’s hosted shell containers expire 20 minutes after user inactivity. As such, the agents needed an escape hatch: persist important files of work elsewhere, and restore them later into a fresh container when the old one expired. To enable this, I gave the agents 3 custom tools:

  • save_files() to a durable storage that I managed
  • list_saved_files() in the durable storage
  • restore_files() from the durable storage

But this meant the agents had to learn extra infrastructure chores on top of their actual work. If they created important files, they had to save them immediately. If the user came back later asking a follow-up question, they had to find and restore the relevant files into the new workspace.

This made the agents become their own backup daemon. It’s tempting because it ships fast, but at the cost of pushing container-management duties onto the agent’s tool surface, where its instructions now had to teach it when to save files, what to save, when to restore, and how not to lose work. None of this is the work the user came for, and on top of that, I watched agents in production forget to save key files under load.

The new offerings by both Anthropic and OpenAI make this anti-pattern go away. Both enable deterministic checkpointing and snapshotting of the agent’s sandbox in the backend, so its working state can be resumed later without the agent running its own backup loop.

Migrating to these new offerings meant I could remove an entire category of agent persistence tools and the prompt instructions that explained how to use them. This gives the agents fewer irrelevant choices, fewer infrastructure details to juggle, and more attention left over for their actual work.

• • •

You should own the needle-movers, not the plumbing

Across my client work, my philosophy on building production agents is that you should spend the most time and effort on the business-specific features that actually move the needle for your users.

With Claude Managed Agents, I get to stop owning the generic plumbing every agent runtime needs: the agent loop, sandbox management, event history storage, compaction strategy, memory curation loop, and so on. These are now all backend features I no longer have to build or maintain myself.

Instead, I can focus on the domain work that have the greatest impact: the workflow logic, the data and system connectors, how the agent sits among the various systems of work, and how it delivers value to users.

To be clear, this isn’t “OpenAI bad”. OpenAI’s Agents SDK with Sandbox Agents is excellent if you want maximum code-level control over orchestration, handoffs, tracing, and sandbox provider choice. Some teams need and care about this a lot.

But in my experience, that’s not where most production agents get stuck. The bottleneck is rarely “How much can we customize?”, but more “How much agent runtime do we no longer have to build and maintain?”

And on that axis, Claude Managed Agents is the more important shift.

Find out where your team should stop spending engineering time on

Book a consult