For Founders, Operators, AI engineers · Advanced · Commercial · Solves: Agents breaking in production, No observability on AI pipelines, Demos that don't ship
Key takeaways
- AI gets you 60% of the way; the operational layer gets you to 100%.
- Spend 80% of your time on error handling, retries, and observability.
- The client pays for the output, not for the AI.
- The ratio of model-work to ops-work is what separates product from demo.
Everyone is shipping AI demos. Almost no one is shipping AI workflows that survive contact with real clients. The gap between the two is the entire job.
The 60/40 rule for agentic workflows
The pattern I use is straightforward: AI gets you 60% of the way on the repetitive work, then a human operator (me) makes sure the last 40% is correct. The output you ship is the output you stake your name on. Not the output the model gave you.
A real example: content automation engagement
Concretely, on a recent content automation engagement, the workflow was: an agent crawls the site, an agent drafts content updates, an agent runs an SEO audit. Three agents, about 20 hours of human work compressed into 4 to 6 hours. The client pays for the output, not for the AI. That is the framing that matters.
The operational layer you cannot skip
The parts that fail are always the operational layer: error handling when an API changes, retries when an LLM returns garbage, observability so you know which step broke. Those are the parts you cannot skip if you want this to be a product, not a parlour trick.
Spend 20% of your time on the model and 80% on the operational layer around it. That ratio is what separates shipped product from demo.
Implementation table
| Fix | Problem | What to change | Metric | Tool |
|---|---|---|---|---|
| Add retries with exponential backoff | LLM API calls fail intermittently and break the chain | Wrap every model call in a retry with jittered backoff | Workflow success rate | Any queue library (BullMQ, Inngest, Make) |
| Log every step with input and output | When a workflow breaks, you cannot tell which step failed | Add structured logging at every agent boundary | Mean time to recovery | LangSmith, Datadog, or simple structured logs |








