Creative Optimization for Instagram Ads: What Really Makes Ads Work
Learn how to improve Instagram ad performance by testing visuals, copy, and formats to find what truly connects with your audience.
Quick Summary: Building AI agents from scratch involves defining the agent's purpose, designing its architecture with core components (LLM, tools, memory, planning logic), implementing function calling patterns, and testing iteratively. Modern approaches use Python with OpenAI or Anthropic APIs, structured tool definitions, and orchestration logic rather than heavyweight frameworks, giving developers full control over agent behavior and capabilities.
The hype around AI agents has reached a fever pitch. Every tech company seems to be announcing their agent strategy, and developers are scrambling to understand how these systems actually work under the hood.
But here's the thing: agents aren't complicated magic boxes. Strip away the buzzwords and you'll find something surprisingly straightforward—a loop that calls an LLM, interprets what it wants to do, executes those actions, and feeds results back.
Building agents from scratch teaches you the fundamentals that frameworks hide. And once you understand those fundamentals, you can build agents that actually solve real problems instead of just impressing people in demos.
An agent is fundamentally different from a standard chatbot or LLM wrapper. The distinction matters.
A chatbot receives input, generates text, and stops. An agent receives a goal, makes decisions about how to achieve it, uses tools to gather information or take actions, and iterates until the goal is met or it determines completion isn't possible.
According to research from Google DeepMind (Kenton et al., published August 18, 2022), agency emerges when a system can be modeled as having preferences over outcomes and taking actions that systematically increase the probability of preferred outcomes. That's the formal definition, but practically speaking: agents decide, act, observe results, and adjust.
The core components that enable this behavior include:
Frameworks like LangChain and AutoGen package these components, but building from scratch reveals how they actually interact. And that knowledge becomes invaluable when debugging production agents or optimizing for specific use cases.
Before writing code, define what the agent needs to accomplish. Vague goals produce vague agents.
Start with the target user and deliverable. Who will interact with this agent, and what specific output should they receive? A sales assistant that qualifies leads produces structured prospect data and draft emails. A research agent generates annotated summaries with source citations. A code review agent outputs formatted feedback with severity levels.
Specificity at this stage prevents scope creep. Many agent projects fail because they try to be too general-purpose from day one.
Next, identify the tools the agent actually needs. Tools are functions the agent can invoke—API calls, database queries, calculators, web scrapers, anything executable. List the minimum set required for the core task. Adding tools later is easier than debugging an overcomplicated initial system.
According to the LangChain team's guidance (published July 10, 2025), successful agent projects define realistic task examples during planning. Write 3-5 specific scenarios the agent should handle, including edge cases. This forces clarity about required capabilities.
Separating concerns matters for maintainability. The model layer handles LLM interactions—prompting, parsing responses, managing tokens. The application layer handles business logic—tool execution, data validation, result formatting.
This separation means you can swap models (GPT-4 to Claude, for example) without rewriting application logic. It also means you can test tool functions independently from LLM behavior.
Think of the model layer as the brain and the application layer as the hands. The brain decides what to do, the hands do it. Keep them loosely coupled.
Understanding each component individually makes implementation clearer. Let's break down the essential building blocks.
The language model serves as the reasoning core. It receives the current state (conversation history, available tools, task objective) and outputs decisions about what to do next.
Modern agents typically use function calling capabilities built into models like GPT-4, GPT-4o, Claude 3.5, or similar. Function calling lets you describe tools in a structured format, and the model returns structured JSON indicating which tool to use with what parameters.
This is fundamentally different from asking the model to write code. Function calling produces parseable output that your orchestration layer can execute reliably. The model says "call the search_knowledge_base function with query='Q2 revenue data'" rather than generating Python code that might have syntax errors.
Prompt engineering becomes critical here. The system prompt should define the agent's role, behavioral guidelines, when to use tools versus answering directly, and how to handle ambiguity. Testing different instruction formats often reveals significant performance differences.
Tools need formal schemas that describe their purpose, parameters, and return types. Using libraries like Pydantic in Python enforces structure and prevents runtime errors.
A tool definition includes the function name, description (which the LLM reads to decide when to use it), parameter specifications with types and descriptions, and the actual executable code.
Here's what makes tool design tricky: the LLM only sees the description and parameters, not the implementation. So descriptions must be clear enough that the model knows exactly when the tool applies and what parameters are required. Vague descriptions produce hallucinated parameters or inappropriate tool selection.
Tool execution happens in the application layer. The orchestrator receives the LLM's function call request, validates parameters, executes the function, captures the result, and formats it for the next LLM call. Error handling at this stage prevents agent crashes when APIs fail or return unexpected data.
Agents need memory to maintain coherent multi-turn interactions. At minimum, this means conversation history—the full thread of user messages, agent responses, and tool results.
But context windows have limits. GPT-4 supports up to 128k tokens, but that fills quickly with lengthy tool outputs. Summarization strategies help—keeping recent messages in full detail while summarizing older exchanges.
More sophisticated agents use vector databases for retrieval-augmented generation. Relevant past interactions or knowledge base articles get embedded, then similar content gets retrieved when needed. This extends effective memory far beyond the context window.
According to documentation from Hugging Face on building agents, memory patterns significantly impact agent reliability. Agents with no memory repeat questions. Agents with unstructured memory hallucinate past events. Proper memory architecture stores structured data with timestamps and relevance scores.
The orchestration loop ties everything together. The standard pattern follows the ReAct framework: Reasoning, Acting, Observing, repeat.
Each iteration, the agent reasons about the current situation, decides on an action (including "task complete" as an action), executes that action, observes the result, and loops. The loop continues until the agent determines the task is finished or a maximum iteration limit prevents infinite loops.
Implementation typically looks like:
This loop is where frameworks add the most value—they handle edge cases like malformed responses, timeout management, and retry logic. But implementing it yourself for a simple agent requires maybe 50 lines of Python.
Theory is nice. Code is better. Let's build a minimal but functional agent using Python and the OpenAI API.
The goal: create a chatbot that can use a calculator tool and a knowledge base lookup tool. No frameworks, just the API and standard libraries.
Tools need three things: a schema the LLM can read, validation logic, and executable code. Start with simple examples and expand.
A calculator tool might handle basic arithmetic. The schema describes a function that takes a mathematical expression as a string and returns the result. The implementation uses Python's eval (with safety constraints) or a dedicated math parser.
A knowledge base tool searches a predefined set of documents. The schema describes a function that takes a query string and optional parameters like max_results. The implementation might do simple string matching initially, then upgrade to vector similarity later.
OpenAI's function calling format uses JSON schema. Each tool gets a name, description, and parameters object with type definitions. The description is what the model uses to decide when to invoke the tool, so clarity matters more than brevity.
The orchestrator manages conversation flow. It maintains the message history, formats prompts, calls the LLM, parses responses, executes tools, and handles completion.
Start by initializing an empty message list. When the user sends a message, append it to the list and call the OpenAI API with the full history plus tool definitions. The API returns either a text response or a function call request.
If the response is text, return it to the user. If it's a function call, extract the function name and arguments, validate them, execute the corresponding tool, capture the result, and append both the function call and result to the message history as the API expects. Then loop—call the API again with the updated history so the model can use the tool result to continue reasoning.
Iteration limits prevent infinite loops. After 5-10 iterations, force the agent to either provide a final answer or admit it can't complete the task. Real talk: agents can get stuck in reasoning loops, especially with ambiguous tasks or unreliable tools.
Tools fail. APIs timeout. Databases return empty results. The orchestrator needs error handling that degrades gracefully.
When a tool execution fails, don't crash the agent. Instead, return an error message to the LLM formatted as a tool result. The model can then reason about the failure and try an alternative approach or ask the user for clarification.
Parameter validation happens before execution. If the LLM requests a function with malformed arguments (wrong types, missing required fields), catch it early and return a structured error. This prevents exceptions deep in tool code from breaking the agent loop.
Logging every tool call, result, and LLM decision creates an audit trail for debugging. When agents produce unexpected outputs, the logs reveal which tool returned bad data or which decision point went wrong.
Basic agents work for demos. Production agents need robustness, observability, and performance optimization.
Complex tasks benefit from specialized agents working together. One agent handles research, another handles writing, a third handles validation. An orchestrator coordinates between them.
Research published on arXiv (submitted August 5, 2025) introduced Agent Lightning, demonstrating that hierarchical agent systems can automate even reinforcement learning tasks previously requiring manual configuration. The key insight: specialized agents with narrow expertise outperform general-purpose agents on complex workflows.
Implementation varies. Some systems use explicit coordination logic—Agent A finishes, triggers Agent B with specific inputs. Others use message-passing architectures where agents communicate through shared channels. The orchestrator monitors progress and handles failures.
Multi-agent systems increase complexity significantly. Start with a single agent, identify bottlenecks or capability gaps, then add specialized agents only when the benefit justifies the added coordination overhead.
Unstructured text responses cause integration headaches. Structured outputs using JSON schemas or tools like Pydantic ensure consistency and enable downstream automation.
Define the exact format the agent should produce—not just "a summary" but a JSON object with specific fields: title, key_points array, sources array with URLs, confidence score. Pass this schema to the model and validate outputs against it.
OpenAI and Anthropic both support structured output modes that constrain the model to produce valid JSON matching a provided schema. This eliminates parsing errors and makes agent outputs directly usable by other systems.
For agents that feed into APIs or databases, structured output isn't optional. It's the difference between a demo that impresses in meetings and a system that runs in production handling real workloads.
Context windows limit what agents can remember. For tasks requiring access to large knowledge bases or long conversation histories, retrieval-augmented generation extends effective memory.
The pattern: embed documents or past interactions into a vector database (Pinecone, Weaviate, Chroma). When the agent needs information, semantically search the database for relevant chunks, retrieve them, and inject them into the prompt. The model answers using retrieved context rather than relying solely on training data or conversation history.
According to LangChain's RAG agent documentation, combining agents with retrieval creates systems that can answer questions grounded in specific documents while still using tools for real-time data or actions. The agent decides when to retrieve versus when to call other tools.
Implementation adds a retrieval tool to the agent's toolkit. The tool description explains it searches the knowledge base for information relevant to a query. The LLM calls this tool when it needs domain-specific facts or past context beyond the conversation window.
Modern agents increasingly handle more than text. Image analysis, document processing, and voice interactions expand use cases significantly.
GPT-4 Vision enables agents to process images. An agent analyzing receipts can extract line items visually rather than requiring pre-parsed text. An agent reviewing UI mockups can identify accessibility issues by examining screenshots.
Audio processing with tools like Whisper allows voice-based agent interactions. The user speaks a request, Whisper transcribes it, the agent processes it normally, and text-to-speech converts the response. The agent code stays largely the same—multimodality happens at the interface layer.
Document understanding combines vision and text processing. PDF parsing, table extraction, and layout analysis let agents work with real-world documents without manual preprocessing. Tools like PyMuPDF or Unstructured handle format conversion so the agent receives clean text or structured data.
Agents are nondeterministic by nature. The same input produces different outputs across runs. Testing requires different strategies than traditional software.
Define success criteria before building. What percentage of test cases should the agent handle correctly? What's acceptable latency? How often can it invoke tools unnecessarily?
Create a test suite with representative examples spanning common cases and edge cases. Run the agent against each example multiple times (LLM outputs vary) and measure success rates. Track metrics like task completion percentage, average tool calls per task, and incorrect tool invocations.
LangSmith (mentioned in LangChain documentation from 2025) provides tracing for agent runs. Every LLM call, tool execution, and decision gets logged with timing and token counts. Analyzing traces reveals where agents get stuck or make poor decisions.
Human evaluation still matters. Automated metrics miss nuances like response tone, clarity, or helpfulness. Regular human review of agent outputs catches issues that metrics don't detect.
Agents fail in predictable ways. Knowing the patterns helps debugging.
Hallucinated tool calls: the agent invents functions that don't exist or uses wrong parameter names. Fix this with clearer tool descriptions and examples in the system prompt.
Infinite loops: the agent repeatedly tries the same failing tool without recognizing futility. Add iteration limits and teach the agent to recognize when to give up or ask for help.
Premature completion: the agent returns an answer before gathering necessary information. Improve the system prompt to emphasize thoroughness or add validation that checks for required information before allowing completion.
Tool misuse: calling the right tool with wrong parameters or calling tools in illogical sequences. Better parameter descriptions and few-shot examples demonstrating correct usage help significantly.
Getting an agent to production requires infrastructure, monitoring, and safety controls that demos don't need.
Agents need interfaces. Web UIs for human interaction, REST APIs for system integration, or webhook endpoints for event-driven triggers all work depending on the use case.
For web interfaces, frameworks like Streamlit or Gradio enable rapid prototyping. Production systems typically use React or similar for frontend with FastAPI or Flask for backend. The agent runs as a service that the API calls with user input and returns responses.
Streaming responses improve user experience for slow agents. Rather than waiting 30 seconds for a complete answer, stream partial results as the agent progresses. OpenAI's API supports streaming, and implementing it requires handling chunks and updating the UI incrementally.
Authentication and rate limiting prevent abuse. Agents consume API tokens and computational resources. Without controls, a malicious user could drain credits or overload the system. Standard API security patterns apply—keys, quotas, throttling.
You can't fix what you can't see. Agent monitoring tracks performance, identifies failures, and reveals usage patterns.
Essential metrics include task success rate, average completion time, token consumption per task, tool error rates, and user satisfaction ratings. Dashboard these metrics to spot degradation quickly.
Distributed tracing shows the full execution path of each agent run. When a task fails, traces reveal which tool returned bad data or where the reasoning went wrong. Tools like LangSmith or custom logging solutions capture this data.
Alert on anomalies. If success rates drop 20% or average latency doubles, something broke. Automated alerts enable rapid response before users complain.
Agents can do things. That means they can do wrong things. Safety mechanisms prevent harm.
Tool restrictions limit what agents can access. Don't give an agent database deletion capabilities unless absolutely necessary. Principle of least privilege applies—grant only tools required for the specific task.
Human-in-the-loop patterns add approval steps for high-risk actions. The agent generates the plan, a human reviews it, and only approved actions execute. This works for agents handling financial transactions, data modifications, or external communications.
Content filtering prevents agents from generating harmful outputs. Both input filtering (block malicious prompts) and output filtering (detect and block harmful responses) create layers of protection. OpenAI's moderation API and similar tools automate this.
According to NIST's AI Risk Management Framework (published on January 26, 2023), trustworthy AI systems require ongoing monitoring, clear documentation of capabilities and limitations, and mechanisms for human oversight. These principles apply directly to agent deployment.
Building from scratch teaches fundamentals. Frameworks provide production-ready infrastructure. When should you use each?
Learning projects benefit from scratch implementations. Understanding how the pieces fit together makes debugging framework-based agents easier later.
Highly specialized use cases sometimes outgrow frameworks. If your agent needs unique orchestration logic or custom tool execution patterns, building from scratch provides complete control.
Performance-critical applications might avoid framework overhead. Direct API calls eliminate abstraction layers and reduce latency. For agents where milliseconds matter, custom implementation can optimize the hot path.
Small, focused agents don't justify framework complexity. A single-purpose tool-calling agent might be 100 lines of Python. Adding a framework turns it into 10 files and configuration overhead.
Production systems at scale benefit enormously from frameworks. LangChain, LangGraph, CrewAI, and similar tools provide battle-tested solutions for common problems—retry logic, error handling, observability, multi-agent coordination.
Frameworks handle edge cases you haven't thought of yet. Rate limiting, token counting, malformed response handling, concurrent tool execution—these problems have known solutions that frameworks implement correctly.
Team development favors frameworks. Shared patterns and abstractions make codebases more maintainable when multiple developers contribute. Documentation and community support accelerate onboarding.
According to documentation from Hugging Face on building good agents, frameworks shine when projects need features like conversation memory management, document loading pipelines, or pre-built integrations with vector databases and APIs. Building these from scratch consumes time better spent on application-specific logic.
Abstract explanations only go so far. Concrete examples show how these patterns apply to actual problems.
An agent that researches companies might use tools for web search, financial data APIs, news article retrieval, and structured report generation. Given a company name, it searches for recent news, pulls revenue data, analyzes competitor positioning, and outputs a formatted brief.
The key challenge: synthesizing information from multiple sources into coherent analysis. The agent needs strong prompt engineering to maintain objectivity and cite sources accurately. Memory management becomes critical—research agents accumulate lots of context quickly.
Support agents handle ticket classification, knowledge base search, and response generation. Given a customer inquiry, the agent searches internal documentation, checks order history, and drafts responses.
Safety is paramount here. The agent must never promise refunds or changes it can't deliver. Tool restrictions ensure it can only read customer data and suggest responses—actual account modifications require human approval.
Integration with ticketing systems (Zendesk, Freshdesk) happens through custom tools that call their APIs. The agent becomes a copilot for human agents rather than fully autonomous support.
Development agents analyze codebases for bugs, style violations, or security issues. They receive code as input, run static analysis tools, check against best practices, and output formatted feedback.
These agents combine LLM reasoning with traditional tooling. Linters and type checkers provide objective data the agent contextualizes and explains. The value comes from translating tool output into actionable developer guidance.
According to PyTorch documentation on reinforcement learning agents (from competitive multi-agent RL tutorials), training agents to optimize specific tasks requires careful reward design and evaluation. For code review agents, reward might be "percentage of agent-flagged issues that developers agree are valid problems."
Sales agents qualify leads by researching prospects, scoring them against criteria, and drafting personalized outreach. Tools include LinkedIn scraping, company database access, email template generation, and CRM integration.
The workflow: receive lead list, research each company, score based on industry/size/recent activity, generate customized emails, log results in CRM. The agent handles high-volume repetitive research while humans focus on relationship building.
Privacy and compliance matter significantly here. Ensure tools respect data usage policies and the agent doesn't generate spammy or misleading content.
.avif)
Most agents use pre-trained LLMs with prompt engineering. But reinforcement learning can optimize agent behavior for specific tasks.
Research published on arXiv (submitted August 5, 2025) introduced Agent Lightning, a framework for training agents with RL. The key insight: treat agent decision-making as an RL problem where states are conversation contexts, actions are tool calls or responses, and rewards are task success metrics.
Training agents with RL requires defining reward functions that capture desired behavior. For a customer support agent, reward might combine response accuracy, customer satisfaction ratings, and efficiency (fewer tool calls to resolution).
The practical challenge: RL training requires significant compute and data. Most teams start with prompt-engineered agents and only invest in RL when they have production data showing exactly what behaviors need optimization.
According to the Agent² framework research (submitted September 16, 2025, revised September 30, 2025), automated agent generation can handle even complex RL tasks. The system takes task descriptions and automatically generates agent configurations, tests them, and iteratively improves performance without manual tuning.
For most developers building agents from scratch, RL remains aspirational. Focus on getting a prompt-based agent working reliably before considering RL optimization.
The agent landscape evolves rapidly. Design choices that support adaptation matter for longevity.
Model-agnostic interfaces let you swap LLM providers without rewriting application logic. Abstract the model layer behind a consistent interface—whether you use OpenAI, Anthropic, or open-source models, the orchestrator calls the same functions.
Modular tool design makes adding capabilities straightforward. Each tool should be self-contained with clear inputs and outputs. Adding a new tool means writing the function and schema, not modifying core agent logic.
Configuration-driven behavior keeps the codebase flexible. System prompts, tool lists, iteration limits, and other parameters should live in config files, not hardcoded. Changing agent behavior becomes editing config rather than code changes.
Version control for prompts and schemas enables experimentation. Track what prompts produced what results, A/B test variations, and roll back changes that degrade performance.

If you are reading about how to build AI agents from scratch, Extuitive is a more practical example of a focused AI product. It is built for one task: predicting ad performance before launch. That gives teams a way to assess creative earlier and make campaign decisions with more context.
Talk with Extuitive to:
👉 Book a demo with Extuitive to see how AI can support ad decisions.
Building AI agents from scratch demystifies the technology. Behind the hype sits a comprehensible pattern: define goals, create tools, orchestrate decisions, handle results, iterate.
The fundamentals covered here—LLM integration, function calling, tool execution, memory management, and orchestration loops—form the foundation every agent needs. Whether you build on these basics directly or use frameworks that implement them, understanding the underlying mechanics makes you a better agent developer.
Start simple. Pick a focused use case, implement a single-tool agent, test it thoroughly, and iterate. Complexity comes later, once core patterns work reliably.
The agent landscape will evolve. Models will improve, new patterns will emerge, and best practices will shift. But the core concept—systems that reason, act, and adapt to achieve goals—remains constant. Understanding how to build them from first principles gives you the foundation to adapt as the field advances.
Now the real question: what will you build? The tools are accessible, the patterns are proven, and the use cases are everywhere. Pick a problem, define an agent that could solve it, and start coding. The gap between concept and working prototype has never been smaller.