Nicolas Bustamante

Hard Lessons Building Agents Since GPT-3.5

· 16 min read

I've been building AI agents at Fintool since GPT-3.5. Three years of shipping to professional investors, in a domain where a wrong number costs someone millions and you never get your credibility back. In those three years we've rewritten the product I don't know how many times. Every major model release made half our code obsolete overnight.

Here's what I've actually learned. Not the glamorous lessons. The hard ones.

  • Code is a commodity now — The mindset shift most engineers haven't made
  • English is the programming language — And most engineers aren't fluent
  • Become the model — The one skill that compounds
  • Meet the model like a new person — Every release is a new teammate; you have to chat with them
  • The bitter lesson of scaffolding — Everything you build has a life expectancy of a few months
  • Eval-driven development — Good evals turn your agent into a self-improving loop
  • Observability or die — Non-determinism × dozens of tools = perfect logs or no product
  • Cost, latency, quality — sponsor tokens, win quality — Why I always pick quality
  • Your setup is replacing the OS — If you're not living in an agent terminal, you're four tiers behind
  • Hire for taste, not credentials — The filter that actually predicts who ships

The Mindset Shift: Code Is a Commodity Now

The biggest thing I got wrong early was treating agent building like traditional software engineering.

It isn't. The entire premise has inverted.

In the old world, code was the valuable artifact. You wrote it carefully. You reviewed it. You tested it. You protected it. Every function was a small investment you didn't want to throw away. The craft was in writing precise, deterministic instructions that a machine would execute the same way every time. You could reason about it. You could step through it in a debugger. Good engineers were people who could hold complex deterministic systems in their head and reason their way to correctness.

In the new world, code is a commodity. An agent writes a thousand lines in thirty seconds. You delete two thousand lines when a new model ships. Code has the half-life of a news cycle. What's valuable is not the code itself — it's the taste to know which code to write, which to delete, which prompt to ship, which eval to build, which tool to give the model, and how to read a non-deterministic trace and figure out what went wrong.

This is not a technical shift. It's a mindset shift. And most engineers have not made it.

OLD WORLD                          NEW WORLD
─────────                          ─────────
  Business logic                     Prompt + instructions
       ↓                                   ↓
  Frameworks                         Agent loop
       ↓                                   ↓
  Libraries                          Tools (fs, bash, APIs)
       ↓                                   ↓
  Deterministic CPU                  Non-deterministic model
Old world (deterministic software) New world (agentic software)
Code The asset. Preserve it. A commodity. Delete it often.
Behavior Deterministic. Same input → same output. Non-deterministic. Same input → different outputs.
Debugging Step through the stack. Find the bug. Read a trace. Form a hypothesis. Rewrite context. Retry.
Correctness Unit tests with fixed assertions. Evals with rubrics over statistical behavior.
Control flow You write the chain of actions. The model chooses the chain of tool calls.
Primary language Python, TypeScript, Rust. English.
Architecture Layers, abstractions, frameworks. Minimal scaffolding, filesystem, raw API.
Skill that compounds Systems design, clean code. Taste, model empathy, eval design.
Mental model of the machine A CPU executing your instructions. A coworker interpreting your instructions.
What you optimize Performance, reliability, DX. Quality, latency as proxy for quality.
Lifespan of your work Years to decades. Weeks to months.

Everything in this essay is downstream of this shift. Evals, observability, deletion discipline, hiring — all of it is what happens after you've accepted that the old playbook doesn't apply.

Engineers who can't make this shift will fight the model. They'll cling to types and schemas and validators. They'll build ten layers of scaffolding to pretend the system is deterministic. They'll protect the code they wrote because that's what the old world rewarded. And the next model release will eat it all.

This is why it's a people problem. The architecture can be taught. The tools can be taught. The mindset cannot. Either you see that code is now cheap and taste is now everything, or you don't. The people who see it ship great agents. The people who don't ship Rube Goldberg machines wrapped around a model they don't understand.

English Is the Programming Language

The job is writing very good text instructions to a non-deterministic system that kind of understands what you mean.

That's the craft. That's it.

Prompting is not a trick. It's the new programming. Every word matters. Ordering matters. What you leave out matters more than what you put in. The difference between "analyze this filing" and "read this 10-K and flag any disclosure that contradicts the guidance on the prior earnings call, with the exact quote and page number" is the difference between a useless agent and a $1,000/month product.

Traditional software engineering trains you for the opposite mindset. Determinism. Types. Unit tests with fixed inputs and fixed outputs. If the function misbehaves, you step through it in a debugger and find the bug.

None of that works here. The model is the function. You don't step through it. You read what it did, form a hypothesis about what it misunderstood, and rewrite your instructions. You ship the instructions. A different user hits it with different context and the model misunderstands in a new way. You rewrite again.

Engineers who can't hold this in their head will fight the model. They'll try to constrain it with schemas, validators, regex parsers, ten layers of scaffolding to make it deterministic. Those ten layers are the first things to delete when the next model ships.

English is a skill. Most engineers do not have it. That's now a hiring bar.

Become the Model

The best agent builders I know do one thing in common: they become the model.

When I'm prompting or designing a tool, I'm not thinking about the model from the outside. I'm trying to be it. I read my own prompt as if I were the model receiving it. I ask: where will I need to load a skill to get additional instructions? Will I need to explore the filesystem to retrieve this data? Which tool do I need to use to accomplish this prompt? How much context do I have? Where's the ambiguity that will trip me up?

This is the single highest-leverage skill in agent building, and you cannot shortcut it. You build it by spending thousands of hours with the model. Prompting it. Watching it fail. Reading its traces. After enough reps, you start to feel what it will do before you run it. You stop shipping-and-waiting. You ship the thing you already simulated in your head.

Geoffrey Hinton talks about this kind of mental simulation for understanding neural networks. Applied to agent building, it means your best tool is an internal model-of-the-model. It tells you when an instruction is ambiguous, when a tool output is too noisy, when context is in the wrong position, when the agent will retry fruitlessly instead of asking for help. You stop building defensively and start building for what the model actually needs.

The cleanest test of whether an engineer can build agents: ask them what a specific prompt will cause the model to do. If they can predict the first three steps, they're a builder. If they say "let me just run it and see," they're still learning.

Meet the Model Like a New Person

Every time a new model drops, you have to meet it.

New model drops
      ↓
Stop everything (day 0)
      ↓
Chat with it for an hour      ← feel the new personality
      ↓
Run full eval suite           ← find regressions & unlocks
      ↓
List scaffolding to delete    ← what did the model just eat?
      ↓
Ship the deletions that week

Not benchmark it. Not point your eval harness at it and declare victory. Meet it. Sit down and chat with it for an hour. Ask it weird things. Push on its edges. Try your actual hardest prompts and feel where it's different from the last one. Notice which idioms it has absorbed, which failure modes it's shed, which new quirks it ships with.

Every model has a personality. GPT-3.5 was eager, wrong, and forgetful. Claude 2 was cautious, articulate, refused things it shouldn't have. Claude 3.5 Sonnet was the first model that felt like a real collaborator. GPT-5 reasons differently than o3. The models don't just get better on a scalar — they get different.

One of my favorite lines: you need to test the model, not to test it. You need to chat with it to understand its capabilities, to understand how to prompt it, to understand where it will reach first. It's like meeting a new colleague — you don't hand them a standardized test, you sit down with coffee and get a feel for them.

This is taste. You can't automate it. And the engineers who skip this step and go straight to the eval harness will miss every paradigm shift the new model enables.

At Fintool we run model-release drills. Every major model drop, we stop. Drop everything. Re-run the evals, yes — but before the evals, the whole team spends a day just chatting with the model. Asking what's new. Figuring out what we can now delete. Finding the new capability that makes our current code obsolete. If we skipped the drill, we'd miss the paradigm shift, and missing a paradigm shift in AI is lethal.

The Bitter Lesson of Scaffolding

Everything you build has a life expectancy of a few months.

You are always one model away from the model eating your scaffolding.

Model ships → you build scaffolding → ships in prod
                                            │
                                            ▼
                          Next model ships and eats it
                                            │
                                            ▼
                    You delete scaffolding → repeat

I watched this happen over and over:

  • Vision scaffolding. Before multi-modal models, we ran a separate vision-to-text model whose job was to describe images so the LLM could "see" them. Obsolete the day Claude and GPT went multi-modal.
  • Math scaffolding. Early models couldn't do 52 × 36 reliably. We spun up a Python code interpreter just to do basic arithmetic. Obsolete.
  • Structured output scaffolding. Regex parsers, JSON validators, brittle retry loops for schema violations. Obsolete the moment function calling and structured outputs shipped in the API.
  • Prompt scaffolding. The Codex system prompt went from 310 lines on o3 to 104 lines on GPT-5. Two-thirds of the instructions were teaching the model things the next model already knew.

The hardest scaffolding deletion of my career was semantic search and RAG. We spent a year building an embedding pipeline. Vector DB, reranker, chunking strategies, evaluation harnesses for retrieval quality — the full stack. It was our crown jewel. Then Claude Code shipped with a filesystem and bash tools, and it dawned on me that the modern agent doesn't do semantic search. It greps. It finds. It reads files. The filesystem is the interface.

I wrote the RAG obituary and we deleted the embedding pipeline. A year of engineering. Gone. The agent got better and our infrastructure got simpler.

The current fashionable scaffolding is skills — markdown files that teach the model how to do a DCF, a legal memo, a financial analysis. We're building them. Every agent company is. They're essential today.

They will also be obsolete. The next generation of frontier models will be post-trained on exactly these kinds of skills. The model will know how to build a DCF without our 400-line skill file telling it to add back stock-based comp. The skill gets baked into the weights. And when that happens, the right move is to delete the skill. Not update it. Delete it.

Scaffolding will not survive AGI. Every piece of code you write to compensate for a current model limitation is a temporary bridge. The model will catch up, and when it does, your bridge becomes technical debt.

Teams that celebrate deleting code win. Teams that protect what they built lose. Every model release, someone on the team should be getting applause for deleting a pipeline.

Eval-Driven Development

If everything you build is temporary, how do you ship anything without breaking it on every model change?

Evals.

The only thing in agent engineering that doesn't rot is a great eval set. The model changes? Run the eval. The prompt changes? Run the eval. You deleted 2,000 lines of scaffolding? Run the eval. If the score goes up, ship it. If it goes down, figure out why. Evals are the spec. They are the ground truth that survives when everything else changes.

Generic NLP metrics don't work. BLEU and ROUGE are irrelevant for agent work. You need domain-specific evals with rubrics written by actual experts. At Fintool we maintain thousands of test cases across ticker disambiguation, fiscal period normalization, numeric precision, adversarial grounding (we plant fake numbers to check the model cites the real source), and every skill we ship. Every PR runs the eval. Drop more than 5% and the PR is blocked.

Here's the multiplier most people miss: once you have good evals, your agent becomes a self-improving loop.

         ┌──── Agent runs task ────┐
         │                         ▼
   Prompt / tool           Eval scores output
    updated                        │
         ▲                         ▼
         └───── Agent reads ───────┘
                eval feedback

Point the agent at a narrow task and its eval set, and it will iterate on its own prompt, its own tools, its own approach until the score improves. The eval is both scorecard and teacher. The agent debugs itself against it. For simple tasks, this closes the loop almost entirely — you define success precisely, walk away, come back to a better agent.

Building evals is harder than building the agent. People massively underestimate this. Your eval is your moat. It's also the single artifact that lets you move fast without breaking production when the model changes under you.

Don't start by writing the agent. Start by writing the eval. If you can't produce 100 concrete examples of "correct," you don't understand the problem well enough to build the agent.

Observability Or Die

LLMs are non-deterministic. Agents run dozens of tool calls. Each tool can fail. The API can rate-limit or timeout. You're fetching user data, hitting third-party services, streaming deltas to the UI. In a single conversation, the number of things that can go sideways is enormous.

If your logs are bad, you're dead. You cannot debug what you can't see.

We use Braintrust for production traces and evals, and I can't recommend it strongly enough. Every LLM call, every tool call, every intermediate state is captured. When a user reports a weird answer, I pull the exact trace, see which tool returned what, where the model got confused, what context it had at each step.

Good observability changes how you build. You stop speculating about failures and start watching them. You notice a tool returns malformed JSON 3% of the time. You notice 40% of your context is a tool output the model doesn't read. You notice a skill instruction is being ignored because it's buried in the middle of the prompt where attention drops. None of this is visible without traces. All of it compounds into "the AI is dumb today." It's not dumb. Your observability is dumb.

Cost, Latency, Quality — Sponsor Tokens, Win Quality

Every agent decision comes back to a triangle: cost, latency, quality. You can't have all three. My bet, every single time, is quality.

              QUALITY
                /\
               /  \
              / ✓  \      ← always pick this
             /      \
            /________\
          COST     LATENCY

Here's the economic reality: a lot of agent companies right now are losing money per query. They're sponsoring tokens to win adoption, betting that gross margins will improve as intelligence gets cheaper. The math works out. Intelligence is collapsing 10× per year — the model that's expensive today is free in eighteen months. The token sponsorship gets paid back by the price curve.

But the adoption doesn't come back.

If you lose adoption because your agent was cheaper but worse, you will spend 10× more on sales and marketing trying to win those users back than you would have spent just serving them the best model from day one. Customer acquisition in agent products is front-loaded: professional users decide in the first few interactions whether your agent is trustworthy. If you shipped mediocre output to save $0.50 per query, you've burnt a customer you'll never get back.

The brighter side is this: people will pay for more intelligence. Professional investors, lawyers, doctors, engineers — they are not price-sensitive to the model tier. They are price-sensitive to wrongness. Give them the best model, charge accordingly, don't apologize.

You still have to be excellent at the operational side — KV cache hits, sensible architecture, token discipline, parallel tool calls. The LLM Context Tax covers the playbook. But don't confuse operational excellence with strategic positioning. Operational wins keep you alive. Quality wins the market.

Cheap + fast + wrong is not a product. It's a money-losing demo.

Your Setup Is Replacing the OS

You cannot build at the edge of a technology you don't use.

My daily setup looks like this: tmux, five Claude Code terminals running in parallel, wired to my email, calendar, phone, SMS, WhatsApp, contacts, and files via CLIs.

┌──────────────── tmux ─────────────────────────────┐
│  CC #1           │  CC #2        │  CC #3         │
│  Fintool code    │  research     │  email / cal   │
├──────────────────┼───────────────┴────────────────┤
│  CC #4           │  CC #5                         │
│  writing         │  infra / personal automations  │
└───────────────────────────────────────────────────┘
        │                │                │
        ▼                ▼                ▼
   filesystem       web / APIs      CLIs (gog, wacli,
                                    imsg, peekaboo…)

That's my operating system now. The GUI is vestigial. I don't "open an app." I describe what I want and an agent does it, across my whole life, with the tools I've wired up for it.

This isn't a flex. It's the only way I know to stay calibrated on what agents can do. Every personal task I do with an agent teaches me something I can apply to Fintool. Every frustration with a tool that isn't agent-ready becomes an opportunity. My life is the live eval.

And here's the industry reality: the terminal and the agent are replacing the OS. The agent-with-tools is the primary interface for anyone who takes this technology seriously. The people who are still operating through point-and-click UIs are four tiers behind the frontier. They will not build good agents because they don't feel, in their hands, what an agent is supposed to be.

If your daily workflow is "write code in an IDE, paste errors into ChatGPT," you cannot build an agent. You are not a power user of the primitive. You have no taste.

And it's not generational. I've hired excellent agent builders in their 40s. I've rejected 23-year-olds who grew up with ChatGPT and still treat it like a search engine. It's not age. It's mindset — curiosity that borders on obsession. The engineers who get it try every new model the day it drops, run it against their private evals, live inside an agent terminal, and have strong opinions about which model is best for which task. The engineers who don't get it are waiting for a framework.

Hire For Taste, Not Credentials

After three years of hiring, here's the filter I trust:

Hire people who already can't put the tools down.

Not the best resume. Not the most credentialed. The ones whose GitHub has a top-tier agentic side project and whose personal setup is unhinged. Custom CLIs wired to everything they own. A memory system. A folder of prompts. A CLAUDE.md per repo. Five parallel agents in tmux. You can tell within thirty seconds of them sharing their screen whether they've been in the seat for thousands of hours.

The #1 positive signal I look for is a top agentic product on GitHub plus a crazy personal agent setup. Those two together are unfakeable. They can't be crammed for an interview. They're evidence of a person who's been obsessed with this technology for long enough to have developed taste.

A friend told me a line I keep coming back to: if a candidate lists LangChain as their orchestrator, they haven't run an agent in production. I think he's right. Frameworks that were best practice in 2023 are technical debt now. The engineers at the frontier use the raw API and write their own orchestration because they've learned the hard way that the abstractions hide exactly the things you need to tune. If you hear "LangChain" in a senior-hire interview in 2026, it's a red flag. The candidate is a paradigm behind.

Everything else — systems design, ML background, domain expertise — can be taught or paired around. The taste for agents cannot. It only comes from thousands of hours in the seat, and you can't fake it in an interview. The tell: ask them to debug a real agent trace in front of you. Watch their eyes. Do they scan it like a log they've read a thousand times, or do they freeze?

That five-second reaction is worth more than an hour of system design.

The Meta Lesson: Become the Model

If you remember one thing from this essay, let it be this:

Become the model.

Every other lesson is downstream. You can only write good prompts if you can simulate the model reading them. You can only hire well if you can tell, in seconds, whether another human has done the simulation. You can only delete scaffolding fearlessly if you know what the model can already do. You can only build evals that matter if you've felt the failure modes from the inside. You can only meet a new model like a new person if you have the reference frame of every model that came before.

The model is your coworker, your teammate, your function, your collaborator, your spec. Understanding it deeply — not benchmarking it, not abstracting it away with a framework, but being it — is the only skill in agent building that compounds. Everything else rots with the next release.

Scaffolding dies. Evals and people compound. Taste is the moat.

Become the model. Everything else follows.