Hard Lessons Building Agents Since GPT-3.5
· 20 min read
I've been building AI agents at Fintool since GPT-3.5. Three years of shipping to professional investors, in a domain where a wrong number costs someone millions and you never get your credibility back. In those three years we've rewritten the product I don't know how many times. Every major model release made half our code obsolete overnight.
The agent loop is deceptively simple: a model in a while loop with access to tools, a working memory, and a stopping condition. Everything hard about this job happens in the gap between that one-paragraph description and a system you can bet a customer on. Tool schemas, context layout, retry policy, failure recovery, eval design, cost control, observability — each of those is a research project that will be obsolete in six months.
Here's what I've actually learned. Not the glamorous lessons. The hard ones.
- Code is a commodity now — The mindset shift most engineers haven't made
- English is the programming language — And most engineers aren't fluent
- Become the model — The one skill that compounds
- Meet the model like a new person — Every release is a new teammate; you have to chat with them
- The bitter lesson of scaffolding — Everything you build has a life expectancy of a few months
- Eval-driven development — Good evals turn your agent into a self-improving loop
- Observability or die — Non-determinism × dozens of tools = perfect logs or no product
- Cost, latency, quality — sponsor tokens, win quality — Why I always pick quality
- Your setup is replacing the OS — If you're not living in an agent terminal, you're four tiers behind
- Hire for taste, not credentials — The filter that actually predicts who ships
The Mindset Shift: Code Is a Commodity Now
The biggest thing I got wrong early was treating agent building like traditional software engineering.
It isn't. The entire premise has inverted.
In the old world, code was the valuable artifact. You wrote it carefully. You reviewed it. You tested it. You protected it. Every function was a small investment you didn't want to throw away. The craft was in writing precise, deterministic instructions that a machine would execute the same way every time. You could reason about it. You could step through it in a debugger. Good engineers were people who could hold complex deterministic systems in their head and reason their way to correctness.
In the new world, code is a commodity. An agent writes a thousand lines in thirty seconds. You delete two thousand lines when a new model ships. Code has the half-life of a news cycle. What's valuable is not the code itself — it's the taste to know which code to write, which to delete, which prompt to ship, which eval to build, which tool to give the model, and how to read a non-deterministic trace and figure out what went wrong.
This is not a technical shift. It's a mindset shift. And most engineers have not made it.
OLD WORLD NEW WORLD
───────── ─────────
Business logic Prompt + instructions
↓ ↓
Frameworks Agent loop
↓ ↓
Libraries Tools (fs, bash, APIs)
↓ ↓
Deterministic CPU Non-deterministic model
| Old world (deterministic software) | New world (agentic software) | |
|---|---|---|
| Code | The asset. Preserve it. | A commodity. Delete it often. |
| Behavior | Deterministic. Same input → same output. | Non-deterministic. Same input → different outputs. |
| Debugging | Step through the stack. Find the bug. | Read a trace. Form a hypothesis. Rewrite context. Retry. |
| Correctness | Unit tests with fixed assertions. | Evals with rubrics over statistical behavior. |
| Control flow | You write the chain of actions. | The model chooses the chain of tool calls. |
| Primary language | Python, TypeScript, Rust. | English. |
| Architecture | Layers, abstractions, frameworks. | Minimal scaffolding, filesystem, raw API. |
| Skill that compounds | Systems design, clean code. | Taste, model empathy, eval design. |
| Mental model of the machine | A CPU executing your instructions. | A coworker interpreting your instructions. |
| What you optimize | Performance, reliability, DX. | Quality, latency as proxy for quality. |
| Lifespan of your work | Years to decades. | Weeks to months. |
Everything in this essay is downstream of this shift. Evals, observability, deletion discipline, hiring — all of it is what happens after you've accepted that the old playbook doesn't apply.
Engineers who can't make this shift will fight the model. They'll cling to types and schemas and validators. They'll build ten layers of scaffolding to pretend the system is deterministic. They'll protect the code they wrote because that's what the old world rewarded. And the next model release will eat it all.
This is why it's a people problem. The architecture can be taught. The tools can be taught. The mindset cannot. Either you see that code is now cheap and taste is now everything, or you don't. The people who see it ship great agents. The people who don't ship Rube Goldberg machines wrapped around a model they don't understand.
English Is the Programming Language
The job is writing very good text instructions to a non-deterministic system that kind of understands what you mean.
That's the craft. That's it.
Prompting is not a trick. It's the new programming. Every word matters. Ordering matters. What you leave out matters more than what you put in. The difference between a vague instruction and a precise one is the difference between a useless agent and a product you can bet a customer on.
A prompt is three things stacked together: an identity (who the model is), a contract (what it must and must not do), and a context (what it has to work with on this turn). Get the identity wrong and the model is confused about its role. Get the contract wrong and it hedges, over-refuses, or hallucinates authority it doesn't have. Get the context wrong — wrong position in the window, wrong format, too much irrelevant noise — and attention goes to the wrong tokens and the model does the wrong thing.
Position matters. Instructions at the top of the prompt compete with instructions at the bottom, and the middle is a dead zone where both fade. Tool outputs dumped inline between your instructions will pull the model's attention away from what you told it to do. A good prompt is laid out like a document the model can skim: clear sections, the most important rule first, the freshest context at the bottom where recency bias works for you.
Traditional software engineering trains you for the opposite mindset. Determinism. Types. Unit tests with fixed inputs and fixed outputs. If the function misbehaves, you step through it in a debugger and find the bug.
None of that works here. The model is the function. You don't step through it. You read what it did, form a hypothesis about what it misunderstood, and rewrite your instructions. You ship the instructions. A different user hits it with different context and the model misunderstands in a new way. You rewrite again.
Engineers who can't hold this in their head will fight the model. They'll try to constrain it with schemas, validators, regex parsers, ten layers of scaffolding to make it deterministic. Those ten layers are the first things to delete when the next model ships.
English is a skill. Most engineers do not have it. That's now a hiring bar.
Become the Model
The best agent builders I know do one thing in common: they become the model.
When I'm prompting or designing a tool, I'm not thinking about the model from the outside. I'm trying to be it. I read my own prompt as if I were the model receiving it, token by token. I ask: which instruction lands first? When I hit the tool definitions, can I tell from the name and the description alone when to call each one, or do I have to go re-read the system prompt? When the tool returns, is the output shaped so I can act on it, or is it a blob I have to re-parse on every turn? Do I have working memory to write to, or am I re-deriving the same facts on every step?
This is the single highest-leverage skill in agent building, and you cannot shortcut it. You build it by spending thousands of hours with the model. Prompting it. Watching it fail. Reading its traces. After enough reps, you start to feel what it will do before you run it. You stop shipping-and-waiting. You ship the thing you already simulated in your head.
Applied to agent building, this means your best tool is an internal model-of-the-model. It tells you when an instruction is ambiguous, when a tool output is too noisy, when context is in the wrong position, when the agent will retry fruitlessly instead of asking for help. You stop building defensively and start building for what the model actually needs.
The simulation has mechanical content. When you read your own prompt as the model, you're tracking where attention will concentrate (the start, the end, explicit headers), where it will thin out (the middle of a long context), which tool you'd reach for first given the task, what you'd do if that tool returned nothing, and whether the instructions give you permission to stop or force you to keep going until you hallucinate something. Every one of those is a prediction you can check against the actual run. Do the reps and the predictions get accurate.
The cleanest test of whether an engineer can build agents: ask them what a specific prompt will cause the model to do. If they can predict the first three steps and the first likely failure mode, they're a builder. If they say "let me just run it and see," they're still learning.
Meet the Model Like a New Person
Every time a new model drops, you have to meet it.
New model drops
↓
Stop everything (day 0)
↓
Chat with it for an hour ← feel the new personality
↓
Run full eval suite ← find regressions & unlocks
↓
List scaffolding to delete ← what did the model just eat?
↓
Ship the deletions that week
Not benchmark it. Not point your eval harness at it and declare victory. Meet it. Sit down and chat with it for an hour. Ask it weird things. Push on its edges. Try your actual hardest prompts and feel where it's different from the last one. Notice which idioms it has absorbed, which failure modes it's shed, which new quirks it ships with.
Every model has a personality. GPT-3.5 was eager, wrong, and forgetful. Claude 2 was cautious, articulate, refused things it shouldn't have. Claude 3.5 Sonnet was the first model that felt like a real collaborator. GPT-5 reasons differently than o3. The models don't just get better on a scalar — they get different.
One of my favorite lines: you need to test the model, not to test it. You need to chat with it to understand its capabilities, to understand how to prompt it, to understand where it will reach first. It's like meeting a new colleague — you don't hand them a standardized test, you sit down with coffee and get a feel for them.
This is taste. You can't automate it. And the engineers who skip this step and go straight to the eval harness will miss every paradigm shift the new model enables.
At Fintool we run model-release drills. Every major model drop, we stop. Drop everything. Re-run the evals, yes — but before the evals, the whole team spends a day just chatting with the model. Asking what's new. Figuring out what we can now delete. Finding the new capability that makes our current code obsolete. If we skipped the drill, we'd miss the paradigm shift, and missing a paradigm shift in AI is lethal.
The Bitter Lesson of Scaffolding
Everything you build has a life expectancy of a few months.
You are always one model away from the model eating your scaffolding.
Model ships → you build scaffolding → ships in prod
│
▼
Next model ships and eats it
│
▼
You delete scaffolding → repeat
I watched this happen over and over:
- Vision scaffolding. Before multi-modal models, we ran a separate vision-to-text model whose job was to describe images so the LLM could "see" them. Obsolete the day Claude and GPT went multi-modal.
- Math scaffolding. Early models couldn't do
52 × 36reliably. We spun up a Python code interpreter just to do basic arithmetic. Obsolete. - Structured output scaffolding. Regex parsers, JSON validators, brittle retry loops for schema violations. Obsolete the moment function calling and structured outputs shipped in the API.
- Prompt scaffolding. The Codex system prompt went from 310 lines on o3 to 104 lines on GPT-5. Two-thirds of the instructions were teaching the model things the next model already knew.
The hardest scaffolding deletion of my career was semantic search and RAG. We spent a year on the full retrieval stack: a search cluster, embeddings generated on our entire corpus and re-generated every time we changed the embedding model, real-time query embedding, chunking strategies that got more baroque by the month, hybrid scoring that merged vector similarity with keyword results, a rerank model on top to clean up the fusion, and a parallel evaluation harness just to measure retrieval quality in isolation from the generation step. It was our crown jewel. Every engineer on the team had touched it.
Then the frontier shifted. Context windows jumped an order of magnitude. Agents learned to use bash and a filesystem. And it dawned on me that the modern agent doesn't need a retrieval pipeline at all. It greps. It finds. It opens a file and reads it end to end. The filesystem is the interface. Search is a tool call, not an infrastructure layer.
I wrote the RAG obituary and we deleted the embedding pipeline. A year of engineering. Gone. The agent got better and our infrastructure got simpler.
The current fashionable scaffolding is skills — markdown files that teach the model how to do a DCF, a legal memo, a financial analysis. We're building them. Every agent company is. They're essential today.
They will also be obsolete. The next generation of frontier models will be post-trained on exactly these kinds of skills. The model will know how to build a DCF without our 400-line skill file telling it to add back stock-based comp. The skill gets baked into the weights. And when that happens, the right move is to delete the skill. Not update it. Delete it.
Scaffolding will not survive AGI. Every piece of code you write to compensate for a current model limitation is a temporary bridge. The model will catch up, and when it does, your bridge becomes technical debt.
Teams that celebrate deleting code win. Teams that protect what they built lose. Every model release, someone on the team should be getting applause for deleting a pipeline.
Eval-Driven Development
If everything you build is temporary, how do you ship anything without breaking it on every model change?
Evals.
The only thing in agent engineering that doesn't rot is a great eval set. The model changes? Run the eval. The prompt changes? Run the eval. You deleted 2,000 lines of scaffolding? Run the eval. If the score goes up, ship it. If it goes down, figure out why. Evals are the spec. They are the ground truth that survives when everything else changes.
But here's the order most teams get wrong: you test the model with your hands before you eval it with code. Taste before test. You cannot write a rubric for "good" until you've sat with the model on real examples and felt where it succeeds, where it drifts, where it gives you a plausible wrong answer that reads as confident. The eval is the operationalization of taste. Skip the taste step and you'll build an eval that measures the wrong things — perfectly, and for years, as your agent regresses in ways your scorecard never sees.
Generic NLP metrics don't work. BLEU and ROUGE are irrelevant for agent work. You need domain-specific evals with rubrics written by people who know the domain cold. A good eval case is: a concrete input, a precise definition of "correct" (not "helpful" — correct), a rubric that a human expert and an LLM judge will agree on, and a failure mode that you've actually observed in production. The set should include adversarial cases — inputs engineered to trip specific failure modes: wrong entity disambiguation, fabricated citations, subtle numeric errors, confidently-wrong completions when the honest answer is "I don't know." If your eval doesn't include the failure modes you've seen in the wild, it's a vanity scorecard.
Here's the multiplier most people miss: once you have good evals, your agent becomes a self-improving loop.
┌──── Agent runs task ────┐
│ ▼
Prompt / tool Eval scores output
updated │
▲ ▼
└───── Agent reads ───────┘
eval feedback
Point the agent at a narrow task and its eval set, and it will iterate on its own prompt, its own tools, its own approach until the score improves. The eval is both scorecard and teacher. The agent debugs itself against it. For simple tasks, this closes the loop almost entirely — you define success precisely, walk away, come back to a better agent.
Building evals is harder than building the agent. People massively underestimate this. Your eval is your moat. It's also the single artifact that lets you move fast without breaking production when the model changes under you.
Don't start by writing the agent. Start by writing the eval. If you can't produce 100 concrete examples of "correct," you don't understand the problem well enough to build the agent.
Observability Or Die
LLMs are non-deterministic. Agents run dozens of tool calls. Each tool can fail. The API can rate-limit or timeout. You're fetching user data, hitting third-party services, streaming deltas to the UI. In a single conversation, the number of things that can go sideways is enormous.
If your logs are bad, you're dead. You cannot debug what you can't see.
A production-grade trace captures the full shape of a turn: the exact prompt the model received (after all your templating and context assembly), the tool definitions it had access to, every tool call with its arguments and its response, any intermediate reasoning tokens, the final output, and a stable ID that ties the run to the user, the model version, and the code SHA. Miss any of those and you will hit a bug you cannot reproduce.
The traces have to be searchable and diff-able. You want to answer questions like: across all runs last week, how often did this tool return an empty result? When the model got confused, what was the context length? Did this regression start the day we swapped the model? That means traces in a store you can query, not just a log file. We use Braintrust for this, but the requirement is more important than the vendor — you need structured, queryable traces with one row per run and one nested record per step.
Good observability changes how you build. You stop speculating about failures and start watching them. You notice a tool returns malformed JSON 3% of the time. You notice 40% of your context is a tool output the model doesn't read. You notice a skill instruction is being ignored because it's buried in the middle of the prompt where attention drops. You notice the model silently retries a broken tool four times before giving up and answering from its priors. None of this is visible without traces. All of it compounds into "the AI is dumb today." It's not dumb. Your observability is dumb.
Cost, Latency, Quality — Sponsor Tokens, Win Quality
Every agent decision comes back to a triangle: cost, latency, quality. You can't have all three. My bet, every single time, is quality.
QUALITY
/\
/ \
/ ✓ \ ← always pick this
/ \
/________\
COST LATENCY
Here's the economic reality: a lot of agent companies right now are losing money per query. They're sponsoring tokens to win adoption, betting that gross margins will improve as intelligence gets cheaper. The math works out. Intelligence is collapsing 10× per year — the model that's expensive today is free in eighteen months. The token sponsorship gets paid back by the price curve.
But the adoption doesn't come back.
If you lose adoption because your agent was cheaper but worse, you will spend 10× more on sales and marketing trying to win those users back than you would have spent just serving them the best model from day one. Customer acquisition in agent products is front-loaded: professional users decide in the first few interactions whether your agent is trustworthy. If you shipped mediocre output to save $0.50 per query, you've burnt a customer you'll never get back.
The brighter side is this: people will pay for more intelligence. Professional investors, lawyers, doctors, engineers — they are not price-sensitive to the model tier. They are price-sensitive to wrongness. Give them the best model, charge accordingly, don't apologize.
You still have to be excellent at the operational side. The mechanics that matter, in rough order of leverage:
- KV cache discipline. Put your stable prompt prefix (identity, tools, skills, long docs) at the top and keep it byte-identical across turns so the provider can cache it. A small reorder that breaks the cache can 10× your bill and your latency for no quality gain.
- Parallel tool calls. If the model can fire three tool calls at once, let it. Serial tool chains are latency killers for no reason.
- Streaming to the UI. Tokens to the user in under a second feel fast even when the full turn takes fifteen. First-token latency is the number users feel.
- Durable workflows for long-horizon tasks. The moment an agent run is measured in minutes, a process restart or a deploy will kill it. Wrap the loop in a durable workflow engine so retries, crashes, and human-in-the-loop pauses don't lose state. This is non-negotiable above a certain task length.
- Token discipline. Every token of irrelevant context is a tax on both price and quality. Prune tool outputs before they re-enter the prompt. Compact older turns. Summarize. The LLM Context Tax covers the playbook.
Operational wins keep you alive. Quality wins the market. Don't confuse the two.
Cheap + fast + wrong is not a product. It's a money-losing demo.
Your Setup Is Replacing the OS
You cannot build at the edge of a technology you don't use.
My daily setup looks like this: tmux, five Claude Code terminals running in parallel, wired to my email, calendar, phone, SMS, WhatsApp, contacts, and files via CLIs.
┌──────────────── tmux ─────────────────────────────┐
│ CC #1 │ CC #2 │ CC #3 │
│ Fintool code │ research │ email / cal │
├──────────────────┼───────────────┴────────────────┤
│ CC #4 │ CC #5 │
│ writing │ infra / personal automations │
└───────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
filesystem web / APIs CLIs (gog, wacli,
imsg, peekaboo…)
That's my operating system now. The GUI is vestigial. I don't "open an app." I describe what I want and an agent does it, across my whole life, with the tools I've wired up for it.
This isn't a flex. It's the only way I know to stay calibrated on what agents can do. Every personal task I do with an agent teaches me something I can apply to Fintool. Every frustration with a tool that isn't agent-ready becomes an opportunity. My life is the live eval.
And here's the industry reality: the terminal and the agent are replacing the OS. The agent-with-tools is the primary interface for anyone who takes this technology seriously. The people who are still operating through point-and-click UIs are four tiers behind the frontier. They will not build good agents because they don't feel, in their hands, what an agent is supposed to be.
If your daily workflow is "write code in an IDE, paste errors into ChatGPT," you cannot build an agent. You are not a power user of the primitive. You have no taste.
And it's not generational. I've hired excellent agent builders in their 40s. I've rejected 23-year-olds who grew up with ChatGPT and still treat it like a search engine. It's not age. It's mindset — curiosity that borders on obsession. The engineers who get it try every new model the day it drops, run it against their private evals, live inside an agent terminal, and have strong opinions about which model is best for which task. The engineers who don't get it are waiting for a framework.
Hire For Taste, Not Credentials
After three years of hiring, here's the filter I trust:
Hire people who already can't put the tools down.
Not the best resume. Not the most credentialed. The ones whose GitHub has a top-tier agentic side project and whose personal setup is unhinged. Custom CLIs wired to everything they own. A memory system. A folder of prompts. A CLAUDE.md per repo. Five parallel agents in tmux. You can tell within thirty seconds of them sharing their screen whether they've been in the seat for thousands of hours.
The #1 positive signal I look for is a top agentic product on GitHub plus a crazy personal agent setup. Those two together are unfakeable. They can't be crammed for an interview. They're evidence of a person who's been obsessed with this technology for long enough to have developed taste.
A friend told me a line I keep coming back to: if a candidate lists LangChain as their orchestrator, they haven't run an agent in production. I think he's right. Frameworks that were best practice in 2023 are technical debt now. The engineers at the frontier use the raw API and write their own orchestration because they've learned the hard way that the abstractions hide exactly the things you need to tune. If you hear "LangChain" in a senior-hire interview in 2026, it's a red flag. The candidate is a paradigm behind.
Everything else — systems design, ML background, domain expertise — can be taught or paired around. The taste for agents cannot. It only comes from thousands of hours in the seat, and you can't fake it in an interview. The tell: ask them to debug a real agent trace in front of you. Watch their eyes. Do they scan it like a log they've read a thousand times, or do they freeze?
That five-second reaction is worth more than an hour of system design.
The Meta Lesson: Become the Model
If you remember one thing from this essay, let it be this:
Become the model.
Every other lesson is downstream. You can only write good prompts if you can simulate the model reading them. You can only hire well if you can tell, in seconds, whether another human has done the simulation. You can only delete scaffolding fearlessly if you know what the model can already do. You can only build evals that matter if you've felt the failure modes from the inside. You can only meet a new model like a new person if you have the reference frame of every model that came before.
The model is your coworker, your teammate, your function, your collaborator, your spec. Understanding it deeply — not benchmarking it, not abstracting it away with a framework, but being it — is the only skill in agent building that compounds. Everything else rots with the next release.
Scaffolding dies. Evals and people compound. Taste is the moat.
Become the model. Everything else follows.