What Most Tutorials Get Wrong
Most "AI agent" tutorials show you how to make GPT-4o call a function. That's not an agent. That's a function call with extra steps.
A production agent needs to handle things no tutorial shows you: API rate limits at 3am, context windows that fill up after 200 messages, tool calls that return garbage, models that hallucinate tool parameters, and cost overruns that can blow your API budget in an hour if something loops.
We've built and deployed dozens of production agents across lead qualification, customer support, research automation, and content generation. This is the guide we wish existed when we started.
The gap between "demo agent" and "production agent" is 80% of the work. Plan for it from day one.
The Core Components of a Production Agent
Before writing a line of code, understand what you're actually building. A production agent has five layers:
- LLM backbone — the model making decisions
- Tool library — functions the agent can call
- Memory system — how context is stored and retrieved
- Orchestration layer — the loop that drives the agent
- Monitoring layer — logging, alerting, cost tracking
Most projects build the first two and ignore the last three. That's how you end up with an agent that works in demos but falls apart in production.
Step 1: LLM Selection — Not GPT-4o by Default
The reflex answer to "which model?" is GPT-4o. It's often wrong. Model selection should be driven by the task profile, not brand recognition.
For complex reasoning and multi-step planning, Claude 3.7 Sonnet or GPT-4o are the right choices. They handle ambiguity well and produce reliable structured output. For high-volume tool calling with strict JSON schemas, GPT-4o-mini or Gemini 2.5 Flash are significantly cheaper with nearly identical accuracy. For classification tasks at scale, Gemini 2.5 Flash or Qwen Plus can cost 10x less per token with no quality loss.
Our production agents use a tiered approach: expensive model for planning and decision-making, cheap model for execution steps, and a fallback chain when any model fails. This alone typically cuts API costs by 40–60% compared to using GPT-4o throughout.
Step 2: Tool Design — Atomic and Composable
Tools should be atomic. Each tool does one thing, does it reliably, and returns structured output. The agent composes them — not the tools.
A common mistake: search_and_summarize(query). This couples two concerns. If the search fails, the summary never runs. If you want to summarize a different source, you can't reuse the summary logic. Better: search(query) returns results, summarize(text) takes any text. The agent decides how to combine them.
Every tool needs explicit error states. A tool that returns None on failure is a bomb waiting to go off. Return a structured error object with a code and message. Let the agent decide whether to retry, skip, or escalate.
Step 3: Memory and Context Management
Context windows fill up. Most agents ignore this until they hit a 128k token limit mid-conversation and crash. Plan your memory architecture before you hit production.
Three layers of memory matter in practice. Working memory is the current context window — what the model can see right now. Episodic memory is the conversation history: you keep the last N turns and truncate older ones intelligently, preserving key facts. Semantic memory is vector-embedded long-term storage — useful when an agent needs to recall something from weeks ago without stuffing every historical message into context.
For most use cases, a well-managed episodic memory is sufficient. Summarize old turns rather than deleting them. Keep a "facts extracted so far" section that persists across context resets. This alone handles 90% of production memory challenges.
Step 4: The Orchestration Layer
Don't use LangChain or LlamaIndex for production orchestration. They add abstraction debt, make debugging harder, and change APIs fast enough to break your production system every two months. Build a simple, transparent router.
The loop looks like this: parse the current state and intent, select the appropriate tool or decide the task is done, execute with timeout and retry logic, parse the output into a structured format, and decide whether to loop or return a final response. That's it. You can implement this in 100 lines of Python with no framework.
The agent should have a maximum iteration count. An agent that can loop infinitely will loop infinitely when something goes wrong. Hard limit: 10 iterations for most tasks, 25 for complex research tasks. Log every iteration.
Step 5: Error Handling and Recovery
Every tool call can fail. Every model call can fail. Plan for both.
For API errors: retry with exponential backoff on 429 (rate limit) and 5xx (server errors). Three retries maximum. After the third failure, log the error with full context and return a graceful degradation response — never a raw stack trace.
Circuit breakers prevent cascade failures. If a tool fails three times in a row within a single session, mark it as unavailable for that session and route around it. This stops a single API outage from taking down your entire agent.
Validate all model outputs before using them as tool inputs. JSON schema validation, length checks, and type checks catch 90% of hallucination-induced failures before they cause downstream damage.
Step 6: Deployment and Monitoring
Log everything. Every input, every tool call, every output, every token count, every latency, every error. You will need this data when something breaks at 3am — and it will break at 3am.
The minimum viable monitoring stack: structured JSON logs to a file or logging service, a real-time cost tracker per agent session, and an alert when error rate exceeds 5% in a 10-minute window. Telegram works well for alerts — a simple HTTP POST to a bot webhook.
Deploy in Docker with a process manager that restarts on crash. Use environment variables for all API keys — never hardcode them. Set per-agent token budgets and hard-stop any session that exceeds them.
The Stack We Actually Use in Production
- Orchestration: Plain Python, no frameworks. ~200 lines of reusable agent core.
- LLMs: GPT-4o for planning, Gemini 2.5 Flash for execution, Claude as fallback for complex reasoning.
- Memory: Redis for working/episodic memory, pgvector for semantic search on large knowledge bases.
- Tool execution: Python functions, each with timeout, retry logic, and structured error returns.
- Monitoring: Python logging to JSON files, custom cost tracker, Telegram bot for alerts.
- Deployment: Docker + GitHub Actions CI/CD, process restart on crash.
This stack handles everything from 50 agent calls per day to 50,000. It's boring, transparent, and it works. That's the point.
The best production stack is the one you understand completely. Boring infrastructure beats clever frameworks every time.

