Long Running Agent Engineering

2026-05-12 · 17 min read

What does it take for an agent to keep working after you leave?

Not "answer a long question." Not "use a big context window." I mean actually keep working. Hours. Days. Maybe weeks. Wake up in a fresh session, understand what happened before, choose the next useful thing, make progress, verify it, leave the workspace cleaner than it found it, and do it again.

For the last few years we have mostly talked about agents as if the hard thing was autonomy inside one conversation. Give the model tools. Put it in a loop. Let it call bash, edit files, search the web, open a browser, run tests. That loop is real, and it is already enough to change how software gets built.

But long running agents expose a different problem. The agent loop is not the product. The harness is.

The model does not naturally persist across turns, context windows, sandboxes, process crashes, or days of work. A fresh session is born with amnesia. It has no idea what the last session tried, which tests failed, which files were half edited, which plan is stale, which shortcut was tempting but wrong, or whether the thing it is about to mark done was already marked done three runs ago and later discovered broken.

That is the real long running agent problem: handoff across amnesia.

The answer emerging across Anthropic, Cursor, OpenAI, Claude Code, Addy Osmani's survey of long running agents, and the Ralph Wiggum community is surprisingly consistent. It is not one magical always awake model. It is not stuffing the whole history into a bigger window. It is a harness that externalizes state into the workspace, restarts agents with fresh context, uses machine verifiable checks as backpressure, and assigns completion judgment to something other than the worker that wants to be done.

Here is the punchline up front:

Long running agents are not long conversations. They are recoverable workflows.

The model is one worker inside that workflow. The durable artifacts are the real continuity layer.

It also helps to separate three ideas people collapse into one phrase: long horizon reasoning, long running execution, and persistent agency. A model can reason through a deep task without running for days. A process can run for days without remembering anything useful. An agent can remember the user without owning one large task. Production systems blur the three, but the engineering problems are different.

Here's what I'll cover:

Why Long Sessions Fail - Context windows rot, agents declare victory early, and half finished work becomes invisible
The Architecture That Won - Fresh worker sessions plus durable workspace artifacts
The Ralph Loop - Why a dumb restart loop beats a single heroic conversation
Initializer, Worker, Judge - The three roles that keep showing up
State Outside the Model - Feature lists, progress logs, plans, git history, tests, and notes
Verification As Backpressure - Why test oracles matter more than better pep talks
Multi Agent Coordination - Why peer to peer locks break and planner worker hierarchies survive
Sandboxing and Rehydration - Why long running execution needs disposable compute and durable state
What This Means For Agent Design - The checklist every long running harness has to answer

Why Long Sessions Fail

The naive version of a long running agent is a single agent in a single conversation with a very large context window.

ONE LONG SESSION
================

user gives goal
  v
agent reads repo
  v
agent plans
  v
agent edits
  v
agent runs tests
  v
agent edits more
  v
context fills
  v
compaction summarizes
  v
agent keeps going
  v
quality degrades
  v
agent loses track of what it knew
  v
agent declares victory too early

This works for small tasks. It fails exactly where long running agents are supposed to matter.

The failure is not just that the context window fills. A 200K or 1M token window still becomes a junk drawer if you keep pushing tool outputs, diffs, plans, screenshots, stack traces, and half obsolete reasoning into it. The model does not get a clean working memory. It gets an archaeological site.

Anthropic's effective harnesses post frames this cleanly: complex tasks span multiple context windows, but each new agent session begins with no memory unless the environment itself tells the story. They describe two predictable failures. First, the agent tries to one shot too much, runs out of context, and leaves a half implemented mess. Second, a later session looks around, sees progress, and decides the whole project is done.

That second failure is the one I keep seeing. The agent is not lazy. It is locally rational. It sees a repo with code, some tests, maybe a UI that loads, maybe a checklist with many items checked. In the absence of a crisp external completion contract, "looks basically done" becomes an attractive stopping point.

Long running work makes this worse because every session inherits ambiguity from the previous one.

WHAT THE NEW SESSION SEES
=========================

files on disk              yes
git history                maybe
last failing test          maybe, if logged
why a shortcut was rejected no, unless written down
which feature is next       maybe, if structured
definition of done          often fuzzy
human intent                compressed into a prompt

Compaction helps, but compaction is not continuity. A summary can preserve some facts, but it cannot replace a workspace that is structured for recovery.

This is the same lesson as agent memory engineering, just at task scale. Memory that lives only in the context window dies when the window dies. Work that lives only in the agent's chain of thought dies when the session dies. If you want continuity, put it somewhere the next worker can read.

The Architecture That Won

The architecture that keeps recurring looks like this:

LONG RUNNING HARNESS
====================

initializer session
  v
creates durable task state:
  - feature list / spec
  - implementation plan
  - progress log
  - init script
  - test oracle
  - git baseline
  v
worker session 1 starts fresh
  v
reads durable state
  v
does one bounded unit of work
  v
runs verification
  v
commits / logs / updates plan
  v
exits
  v
worker session 2 starts fresh
  v
reads durable state
  v
continues
  v
...
  v
judge / evaluator decides whether the goal is actually met

There are variations, but the spine is stable.

Anthropic uses an initializer agent plus repeated coding agents. The initializer creates the environment future agents need: an init.sh, a progress file, a feature list, and a first git commit. Subsequent agents read the state, pick one not yet passing feature, implement it, test it end to end, update the progress log, and commit.

The community Ralph Wiggum pattern is the minimal version:

while true:
  run agent with PROMPT.md
  agent reads IMPLEMENTATION_PLAN.md
  agent picks next task
  agent edits files
  agent runs tests
  agent updates plan
  agent commits
  agent exits

The important thing is not the loop. The important thing is what the loop forces. Every iteration starts with fresh context. Every iteration rehydrates from disk. Every iteration must leave disk in a state the next iteration can understand.

Blake Crosley's Ralph Loop writeup describes the same pattern through stop hooks: intercept exit attempts, persist state to the filesystem, and restart with a fresh context window until machine verifiable completion criteria are met. Geoffrey Huntley's community guide reduces it to a beautiful primitive: a shell loop feeding a prompt file to the agent, with the implementation plan on disk acting as shared state between otherwise isolated runs.

That is the thing people keep underestimating. The loop can be dumb if the workspace is smart.

DUMB LOOP, SMART WORKSPACE
==========================

loop.sh                  dumb
PROMPT.md                stable contract
AGENTS.md                operating manual
SPEC.md                  user intent
IMPLEMENTATION_PLAN.md   task queue
PROGRESS.md              memory of work
git log                  recovery trail
tests                    backpressure

No blackboard server. No bespoke orchestration database. No vector store. No "agent society" with vibes based coordination. Markdown files, git, tests, and a process supervisor.

Annoyingly simple. Annoyingly effective.

The Ralph Loop

The Ralph loop works because it replaces one degrading conversation with many clean attempts.

SINGLE LONG CONVERSATION
========================

context starts clean
  v
tool output accumulates
  v
attention diffuses
  v
old mistakes stay in context
  v
summaries compress away details
  v
agent gets tunnel vision

RALPH LOOP
==========

iteration 1: fresh context, reads disk, makes progress, writes disk
iteration 2: fresh context, reads disk, makes progress, writes disk
iteration 3: fresh context, reads disk, makes progress, writes disk
...

The agent is not continuous. The workspace is.

This flips the unit of autonomy. You stop asking, "Can this one conversation survive for ten hours?" You ask, "Can each session leave enough evidence that the next session can continue without asking me?"

That means the agent's job is not only to build. It has to maintain the run state.

A good Ralph prompt usually contains four contracts:

RALPH PROMPT CONTRACT
=====================

1. Orient:
   Read AGENTS.md, SPEC.md, IMPLEMENTATION_PLAN.md, recent git log.

2. Choose:
   Pick one high value unfinished task. Keep scope bounded.

3. Execute:
   Edit files, run the relevant checks, fix failures.

4. Handoff:
   Update the plan, write progress notes, commit clean work, exit.

This is not glamorous. It is project management for an amnesiac coworker.

The loop also gives you a natural escape hatch. If the agent goes off track, you edit the plan. If the prompt is too loose, you add a guardrail. If the tests are weak, you strengthen the oracle. If the agent keeps duplicating work, you make completed work more visible. If it keeps touching unrelated files, you narrow the write scope.

The prompts you start with are never the prompts you end with. Long running harnesses are tuned by watching failure patterns.

That is why Ralph is more than a meme. It is the first pattern that made the correct abstraction obvious: the human sits outside the loop and engineers the environment, not inside the loop approving every step.

Initializer, Worker, Judge

The roles keep converging:

THREE ROLES
===========

Initializer:
  turns fuzzy user intent into durable workspace structure

Worker:
  makes bounded progress against that structure

Judge:
  decides whether the stated completion condition is actually met

Sometimes these are separate prompts. Sometimes separate models. Sometimes separate processes. Sometimes the judge is a test suite. Sometimes it is a small evaluator model. But the roles are conceptually different, and mixing them is where harnesses get mushy.

Initializer

The initializer is the first agent that touches the task. Its job is not to implement the product. Its job is to make implementation possible across many future sessions.

Anthropic's initializer writes a comprehensive feature list. In their claude.ai clone example, the feature list expanded the user's high level prompt into hundreds of end to end feature requirements, all initially marked failing. This prevents the later worker from inventing a tiny definition of done.

A good initializer creates:

INITIALIZER OUTPUTS
===================

SPEC.md
  expanded user intent, constraints, non goals

FEATURES.json
  machine readable feature checklist
  each item has steps and pass/fail status

IMPLEMENTATION_PLAN.md
  ordered work queue
  small tasks, dependencies, current status

PROGRESS.md
  append only lab notebook

init.sh
  how to boot the environment
  how to run smoke tests

EVALS.md or tests/
  completion checks
  known oracle commands

The initializer is where you spend tokens to save tokens later. Every future worker starts faster because the workspace already has a map.

Worker

The worker should not be asked to "finish the project." That is how you get giant diffs, brittle code, and fake completion.

The worker should be asked to make one bounded unit of progress.

GOOD WORKER LOOP
================

1. Get oriented
2. Verify the repo is not already broken
3. Pick one failing feature or plan item
4. Implement it
5. Test it like a user would
6. Mark it passing only after evidence
7. Commit and update progress
8. Stop

The stop matters. A worker that never stops slowly turns into the bad single session architecture. Fresh starts are not overhead. Fresh starts are the mechanism that keeps drift from compounding.

Judge

The worker should not be the final judge of completion.

Workers want to be done. Not emotionally, obviously, but statistically. The completion token is attractive. The model has a strong prior toward wrapping up once the output looks coherent. On long horizon tasks this creates false positives.

Claude Code's /goal productizes this separation. You give Claude a completion condition. After each turn, a separate evaluator model checks whether the condition has been met. If the answer is no, the evaluator's reason becomes guidance for the next turn. The worker model is not the only judge of its own success.

That one design detail is huge.

WITHOUT JUDGE
=============

worker: I think all tests pass and the feature is done.
loop: exits

WITH JUDGE
==========

worker: I think all tests pass and the feature is done.
evaluator: The transcript shows lint passed, but no browser test was run.
loop: continue with reason
worker: runs browser test, finds bug, fixes it

OpenAI's harness engineering post describes a similar review loop: Codex writes code, reviews its own changes, requests additional agent reviews locally and in the cloud, responds to feedback, and iterates until reviewers are satisfied. They explicitly call this a Ralph Wiggum loop.

The pattern generalizes:

WORKER/JUDGE SEPARATION
=======================

Worker:
  produce artifact
  surface evidence
  propose completion

Judge:
  inspect evidence
  compare against condition
  say yes/no
  provide next constraint if no

The judge does not have to be smarter than the worker. It just has to be fresh, narrower, and less invested in the worker's local narrative.

State Outside The Model

Long running agents need durable state, but not all state is the same.

STATE TYPES
===========

Intent state:
  What are we building?
  What constraints matter?
  What counts as done?

Progress state:
  What has already been tried?
  What worked?
  What failed?
  What is next?

Environment state:
  How do you run this thing?
  Which commands verify it?
  Which services or env vars are required?

Evidence state:
  Which tests passed?
  Which screenshots prove the UI works?
  Which metrics show accuracy improved?

Recovery state:
  What changed in git?
  Which commit is known good?
  How do you revert a bad attempt?

If this state lives only in the transcript, the next session has to reconstruct it. If it lives on disk, the next session can read it.

Anthropic's scientific computing post is the cleanest non web app example. Claude worked over multiple days on a differentiable cosmological Boltzmann solver and reached sub percent agreement with the reference CLASS implementation. The interesting part is not that the model wrote numerical code. The interesting part is the harness discipline around it: reference implementation, test oracles, persistent notes, git history, and quantifiable progress.

Scientific computing makes the verification problem unusually crisp. You can compare your solver to CLASS or CAMB. You can plot error over time. You can watch the agent get closer to a reference implementation. That gives the run a real gradient.

Most coding tasks have weaker oracles, so you have to build them.

WEAK ORACLE
===========

"The dashboard looks good"
"The migration is complete"
"The API is fast enough"

STRONGER ORACLE
===============

Playwright test creates a dashboard, filters by date, exports CSV
and verifies row count equals seeded fixture.

Migration script runs against staging snapshot and leaves zero rows
matching old schema.

Benchmark p95 stays under 200ms for 1,000 representative requests.

Long running agents magnify weak specs. A human can carry fuzzy intent across a week because humans have common sense, memory, and the ability to ask clarifying questions. An unattended agent will happily optimize the wrong proxy for hours.

The more autonomy you grant, the more literal the state layer has to become.

Verification As Backpressure

A long running agent without verification is just a text generator with file permissions.

Verification is what turns motion into progress.

NO BACKPRESSURE
===============

agent edits code
agent says it is done
loop continues or stops based on vibes

WITH BACKPRESSURE
=================

agent edits code
test fails
agent must fix
browser flow fails
agent must fix
reference error too high
agent must improve
judge says evidence incomplete
agent must gather evidence

This is why end to end tests matter so much. Anthropic observed that Claude would often mark features complete after shallow checks. Once explicitly prompted to use browser automation and test as a human user would, performance improved. That matches my experience. Unit tests are useful, but they are often too close to the implementation. Browser tests force the agent to confront the product surface.

The right verification depends on the domain:

VERIFICATION BY DOMAIN
======================

Web app:
  Playwright / Puppeteer user flows
  screenshots
  accessibility checks
  console/network errors

Backend service:
  integration tests
  contract tests
  seeded database checks
  load tests

Migration:
  dry run on snapshot
  row count invariants
  reversible rollback
  before/after schema diff

Research:
  source citations
  contradiction search
  reproduce key numbers
  independent second pass

Scientific code:
  reference implementation
  numerical tolerances
  convergence plots
  parameter sweeps

The best verification is machine checkable and hard to game. The worst verification is asking the same model, in the same context, "are you sure?"

That does not mean model judges are useless. They are useful when they judge surfaced evidence against a narrow condition. Claude Code's /goal docs are careful about this: the evaluator does not run commands or read files independently. It judges what Claude has surfaced in the conversation. So the completion condition has to include how the worker should prove it.

Bad:

/goal finish the auth refactor

Good:

/goal npm test -- test/auth exits 0, npm run lint exits 0,
git status is clean, and no files outside src/auth and test/auth changed

The judge cannot save you from a vague goal. It can enforce a crisp one.

Multi Agent Coordination

Single worker loops are enough for many tasks. But the moment you want to run hundreds of agents on one codebase for weeks, coordination becomes the whole game.

Cursor's scaling agents post is useful because it talks about what failed. Their first approach let agents coordinate as peers through a shared file. Agents would check what others were doing, claim a task, update status, and use locks to prevent duplicate claims.

This sounds reasonable. It is also exactly the kind of distributed system that gets weird fast.

PEER TO PEER COORDINATION
=========================

agent A reads task file
agent B reads task file
agent C reads task file
agent A claims task
agent B sees stale view
agent C waits on lock
agent A forgets to release lock
agent B duplicates work
agent C blocks
shared file becomes a bottleneck

The problem is not that agents cannot coordinate. The problem is that peer to peer coordination asks every worker to think about the global project while also doing local implementation. That is too much.

Cursor moved toward a planner worker judge hierarchy:

PLANNER / WORKER / JUDGE
========================

planner:
  owns global decomposition
  assigns tasks
  updates project plan

workers:
  own local execution
  do not coordinate with peers
  push completed changes

judge:
  decides whether to continue
  triggers next cycle

This is the same role separation again, just scaled out.

Workers should not coordinate with other workers if you can avoid it. They should receive a task with a bounded write scope, complete it, and report back. The planner should own the global dependency graph. The judge should decide whether the current state is good enough to continue, merge, or stop.

This has a strong human engineering analogue. You do not ask every engineer on a large project to constantly negotiate the whole roadmap with every other engineer. You create ownership boundaries. You run reviews. You integrate. You keep the shared state legible.

The hard part is choosing the grain size.

TASK TOO SMALL
==============

coordination overhead dominates
agents spend more time reading state than changing it
planner becomes bottleneck

TASK TOO LARGE
==============

worker drifts
diff becomes impossible to review
merge conflicts spike
judge cannot evaluate cleanly

RIGHT SIZE
==========

one coherent behavior change
clear write scope
clear verification command
clear integration path

Cursor's product follow up, Expanding our long running agents research preview, says long running agents produced substantially larger PRs while keeping merge rates comparable to other agents. That is the product significance. The harness lets agents take on work that previously exceeded the practical size of a single agent session.

But "larger PRs with comparable merge rates" is not magic model dust. It is the result of better state, better delegation, better judges, and better recovery.

Sandboxing And Rehydration

Long running agents need a computer. That computer should be disposable.

An agent that can run commands, install packages, edit files, open browsers, and call APIs is powerful enough to be useful and powerful enough to be dangerous. If you run it on your laptop with all your cookies, SSH keys, cloud credentials, and private files, the blast radius is ugly.

The long running version makes this worse. A five minute agent can do damage. A five day agent can do creative damage.

So the production architecture increasingly separates durable harness state from disposable compute.

OpenAI's Agents SDK update points in this direction: model native harnesses, sandbox execution, filesystem tools, memory, manifests, and state rehydration. The key idea is that the agent gets a controlled workspace with the files, tools, and dependencies it needs, while credentials and durable orchestration live outside the sandbox.

SEPARATED ARCHITECTURE
======================

harness / control plane
  - goal
  - run state
  - memory
  - evaluator
  - credentials broker
  - logs
  - checkpoints

sandbox / compute plane
  - repo checkout
  - tools
  - dependencies
  - browser
  - generated code
  - test execution

If the sandbox dies, the run should not die. The harness should rehydrate a fresh sandbox from the last checkpoint, mount the workspace, hand the worker the current state, and continue.

This is the same principle again: state must outlive the worker.

BAD
===

agent process holds state in memory
sandbox dies
run dies

GOOD
====

agent writes state to workspace/checkpoint
sandbox dies
harness starts new sandbox
worker reads state
run continues

Sandboxing also changes how you think about tools. In a local interactive agent, giving bash broad access is convenient. In a long running cloud agent, every tool is a capability grant. Network, filesystem, credentials, browser profile, package installation, deploy keys, issue tracker access, email access. Each one needs scope.

The Ralph community guide makes this point bluntly: assume the agent environment will be popped at some point, then ask what the blast radius is. That is the right mental model.

The best long running harnesses will feel boring operationally:

BORING PRODUCTION HARNESS
=========================

least privilege sandbox
ephemeral compute
durable checkpoints
append only logs
credential broker
network policy
artifact capture
human interrupt
budget limit
rollback path

Boring is good. Boring means the agent can be weird without the system becoming weird.

The Product Shape

There are two product directions converging.

The first is the practitioner loop: prompt files, plans, hooks, shell scripts, git commits. This is how power users run agents overnight today. It is messy, flexible, and close to the metal.

The second is the productized loop: /goal, cloud agents, background tasks, research previews, SDK harnesses, managed sandboxes. This turns the same patterns into a UX that normal teams can use.

PRACTITIONER LOOP
=================

PROMPT.md
AGENTS.md
IMPLEMENTATION_PLAN.md
loop.sh
stop hooks
git commits
local tests

PRODUCTIZED LOOP
================

goal condition
background agent
managed sandbox
fresh evaluator
artifact viewer
PR integration
budget controls
resume / cancel

The underlying mechanics are more similar than they look.

Claude Code's /goal is basically a session scoped Ralph loop with a model judge. Cursor's long running agents are a cloud product built from planner worker judge orchestration. OpenAI's Agents SDK is standardizing the sandbox and filesystem substrate. Anthropic's harness posts are turning the workflow into repeatable environment design.

The abstraction is moving up the stack.

In 2024, you wrote your own while loop.

In 2025, you wrote prompt files and hooks.

In 2026, the loop is becoming a product primitive.

But the product primitive still has to answer the same questions:

Where does state live?
What does a new worker read first?
How does it choose work?
How does it prove progress?
Who decides it is done?
How do you recover from a bad turn?
What happens when the sandbox dies?
What is the budget?
What is the blast radius?

The UI can hide the loop. It cannot remove the harness.

The Failure Modes

Long running agents fail differently from short running agents.

Short running agents fail by making a bad tool call, hallucinating an answer, editing the wrong file, or stopping too soon.

Long running agents fail by accumulating drift.

LONG RUNNING FAILURE MODES
==========================

Premature victory:
  agent sees partial progress and declares the whole goal done

Plan rot:
  implementation plan no longer matches the repo

State pollution:
  progress files become verbose, stale, or contradictory

Test gaming:
  agent changes tests to pass instead of fixing behavior

Tunnel vision:
  worker over focuses on one subsystem and forgets global constraints

Coordination collisions:
  parallel workers duplicate work or fight over files

Filesystem pollution:
  scratch files, debug scripts, generated artifacts accumulate

Budget runaway:
  loop continues because stop criteria are fuzzy

Silent regression:
  later worker breaks earlier feature because no smoke test runs first

Each failure suggests a harness feature.

FAILURE TO HARNESS RESPONSE
==========================

Premature victory
  external judge, crisp goal condition

Plan rot
  planning refresh pass, disposable plan, git diff against reality

State pollution
  structured progress log, summarization, pruning

Test gaming
  explicit rule: do not weaken tests
  protected eval files
  diff review

Tunnel vision
  fresh starts
  smoke tests before new work
  bounded tasks

Coordination collisions
  planner owned task assignment
  disjoint write scopes

Filesystem pollution
  cleanup step
  ignored scratch directory
  clean git status requirement

Budget runaway
  max turns
  max spend
  spawn budget
  time budget

Silent regression
  invariant smoke test at start of every session

This is why long running agent engineering looks less like prompt hacking and more like operating a tiny software organization. You need task intake, planning, execution, QA, review, release, rollback, observability, and security. The agent is the worker. The harness is the company.

What This Means For Agent Design

Here are the questions every long running agent system has to answer.

THE TEN QUESTIONS
=================

1. What is the durable state layer?
   Markdown files, JSON checklists, git history, database rows, event log?

2. What is the unit of work?
   One feature, one test, one PR, one research section, one benchmark target?

3. How does a fresh worker orient?
   Which files does it read first? Which commands does it run before editing?

4. What is the completion condition?
   Is it measurable? Can the worker surface evidence for it?

5. Who judges completion?
   Worker, separate model, test suite, human, or some combination?

6. What is the backpressure?
   Tests, browser automation, reference implementation, lint, benchmarks, review?

7. How does the system recover?
   Git commit, checkpoint, rollback script, sandbox snapshot, progress log?

8. How is parallelism coordinated?
   Planner assigned tasks, disjoint write scopes, queue, locks, merge queue?

9. What are the budgets?
   Tokens, wall clock, turns, dollars, spawned agents, files changed?

10. What is the blast radius?
   Sandbox permissions, credentials, network, data access, deploy authority?

My current bias:

Fresh sessions beat giant sessions. A fresh context window that reads good state from disk is better than a stale context window carrying ten hours of tool output. Restarting is not giving up. Restarting is garbage collection.

The workspace is the memory bus. Plans, progress logs, feature lists, tests, screenshots, git commits, and benchmark outputs are not side effects. They are the continuity layer. If the next worker cannot understand the run from disk, the harness is broken.

Judges should be separate from workers. The worker can propose done. Something else should decide done. Ideally tests. Sometimes a model evaluator. Often both. The judge should inspect evidence, not vibes.

External verification matters more than longer reasoning. A mediocre plan with a strong oracle will often beat an elegant plan with no backpressure. The agent needs reality to push back.

Keep worker scope small. A long running system does not require each worker to do a long task. It requires the whole system to sustain progress across many bounded tasks.

Make state disposable and regenerable. Plans rot. Progress logs bloat. Specs change. A good harness can regenerate the plan from the current repo and goal. Treat planning artifacts as useful scaffolding, not sacred truth.

Sandbox by default. Long running agents should assume hostile inputs, accidental exfiltration, bad generated code, and runaway loops. Least privilege is not paranoia. It is table stakes.

The human's job moves up a level. You stop micromanaging tool calls and start designing the environment: better specs, better evals, better prompts, better ownership boundaries, better recovery points.

That last point is the real mindset shift.

When code was scarce, the human wrote code.

When code became cheap, the human reviewed code.

When agents became persistent, the human designs the system in which code keeps getting written after they leave.

The New Job: Harness Engineering

OpenAI calls this harness engineering, and I think that phrase is going to stick.

Harness engineering is the work around the model that makes the model useful over time:

HARNESS ENGINEERING
===================

scaffold the workspace
write the operating manual
shape the prompt
choose the tools
define the state files
build the evals
create the judge loop
capture artifacts
bound the budget
secure the sandbox
observe the traces
tune from failure

This is different from traditional software engineering. You are not only writing deterministic code paths. You are designing an environment that a non deterministic worker can repeatedly enter, understand, act inside, and leave in a better state.

That is why the best long running agent harnesses feel weirdly old fashioned. Git. Markdown. Shell scripts. JSON checklists. Test suites. Logs. Small commits. Clear ownership. These are not legacy habits. They are the primitives that survive context death.

The future of long running agents is not one immortal session thinking forever. It is many mortal sessions, each with a clean context window, waking up inside a workspace that remembers.

So back to the original question: what does it take for an agent to keep working after you leave?

Not a bigger prompt.

Not just a better model.

A durable state layer. A crisp goal. A fresh worker loop. A judge that is not the worker. Tests that push back. Git history that tells the story. Sandboxes that can die without killing the run. Logs that let the human tune the system when it fails.

The model is the engine. The harness is the vehicle.

And the companies that get this right will not merely have "agents that run longer." They will have agents that can be trusted with larger units of work because the work is recoverable, inspectable, and verifiable.

That is the threshold that matters.

Not autonomy as theater.

Autonomy with a receipt.