Relevant

As multi-agent systems become increasingly prominent, mastering the design and implementation of AI-driven architectures is more critical than ever.

In this article, we delve into the core principles of AI-driven system design, emphasizing key concepts and practices in building robust multi-agent systems.

What is Large Language Model (LLM)?

How to connect to an LLM?

Key Metrics to Measure LLM Performance

Measure Response Generation time (Latency)

Token Usage

print(response.usage)

# Output:
{
  "input_tokens": 10,
  "output_tokens": 20,
  "total_tokens": 30
}

Response Size

Which LLM to use?

Different models produce different results. This brings us to a decision point when creating any AI application: choosing the right model is not always about picking the most powerful one. It's an engineering trade-off between three factors: capability, cost, and latency (speed).

# Brief descriptions of gpt-4o and gpt-4.1 models

class OpenAIModel(str, Enum):
        # GPT-4o ("Omni") - flagship multimodal model released May 13, 2024.
        # Handles text, audio, images, and video with native voice/vision support;
        # ~128 K token context window.
        GPT_4O = "gpt-4o"

        # GPT-4o-mini - smaller, cost-effective variant released July 18, 2024.
        # Multimodal like GPT-4o, same context window, optimized for affordability.
        GPT_4O_MINI = "gpt-4o-mini"

        # GPT-4.1 - developer-focused flagship released April 14, 2025.
        # Massive 1 M token context window; excels at coding, reasoning, instruction-following.
        GPT_41 = "gpt-4.1"

        # GPT-4.1-mini - compact variant in the 4.1 family.
        # Same 1 M context, much lower latency and cost; matches or outperforms GPT-4o on many benchmarks.
        GPT_41_MINI = "gpt-4.1-mini"

        # GPT-4.1-nano - smallest, fastest 4.1 variant.
        # 1 M token context, highly efficient and cost-effective for lightweight tasks.
        GPT_41_NANO = "gpt-4.1-nano"

NOTE: you can try to solve a complex problem by giving a vague instruction to a very large, expensive model (like OpenAI o3) and hoping it correctly infers all the necessary steps. This approach is slow and costly to run for every user request. The alternative is what this course teaches: you, the engineer, use prompting techniques like Chain-of-Thought and Prompt Chaining to explicitly break the problem down into a series of logical steps. Because each individual step is simpler and more focused, you can often use a smaller, faster, and much cheaper model (like GPT-4.1 mini) to successfully execute each one.

Q. Your team is building a new AI-powered customer support chatbot for an e-commerce website. 
The agent is designed with a clear, multi-step prompt chain: it first identifies the user's 
product category, then retrieves FAQs for that category, and finally, drafts an answer 
based on the retrieved information.

The chatbot needs to handle thousands of common customer questions per day, and user 
experience (getting a fast answer) is a top priority.

Question: Which model selection strategy is best for this solution?

Answer: Because the complex task is broken down into simpler, well-defined steps, a smaller, 
faster, and more cost-effective model is the ideal choice. This strategy balances 
capability with the business needs of low latency and manageable costs.

What is Prompting?

Prompting is essentially the means by which LLMs are programmed or guided. A prompt is a set of instructions provided to an LLM that customizes, enhances, or refines its capabilities.

plain_system_prompt = "You are a helpful assistant."  # A generic system prompt
user_prompt = "Give me a simple plan to declutter and organize my workspace."

print(f"Sending prompt to {MODEL} model...")
baseline_response = get_completion(plain_system_prompt, user_prompt)
print("Response received!\n")

display_responses(
    {
        "system_prompt": plain_system_prompt,
        "user_prompt": user_prompt,
        "response": baseline_response,
    }
)

Principles of Effective Prompting

Assign a Professional Persona: Giving the LLM a role changes its tone, suggestions, and the perceived expertise of its advice.

role_system_prompt = "You are an expert professional organizer and productivity coach."

print("Sending prompt with professional role...")
role_response = get_completion(role_system_prompt, user_prompt)
print("Response received!\n")

# Show last two prompts and responses
display_responses(
    {
        "system_prompt": plain_system_prompt,
        "user_prompt": user_prompt,
        "response": baseline_response,
    },
    {
        "system_prompt": role_system_prompt,
        "user_prompt": user_prompt,
        "response": role_response,
    },
)

Introduce Concrete Constraints: Challenge the LLM to work within realistic limits, like time and budget, forcing it to prioritize and make trade-offs.

# TODO: Write a constraints system prompt replacing the ***********
constraints_system_prompt = f""" {role_system_prompt} The plan must be achievable in one hour and require no purchases, using only existing household items."""

print("Sending prompt with constraints...")
constraints_response = get_completion(constraints_system_prompt, user_prompt)
print("Response received!\n")

# Show last two prompts and responses
display_responses(
    {
        "system_prompt": role_system_prompt,
        "user_prompt": user_prompt,
        "response": role_response,
    },
    {
        "system_prompt": constraints_system_prompt,
        "user_prompt": user_prompt,
        "response": constraints_response,
    },
)

Request Step-by-Step Reasoning: Ask the LLM to "show its work," making its decision-making process transparent and revealing how it interprets your refined prompts.

# TODO: Ask the LLM to explain its reasoning step by step, replacing the ***********
reasoning_system_prompt = (
    f"{constraints_system_prompt} Explain your reasoning for each step of the plan in a thoughtful way before presenting the final checklist"
)

print("Sending prompt with reasoning request...")
reasoning_response = get_completion(reasoning_system_prompt, user_prompt)
print("Response received!\n")

# Display the last two prompts and responses
display_responses(
    {
        "system_prompt": constraints_system_prompt,
        "user_prompt": user_prompt,
        "response": constraints_response,
    },
    {
        "system_prompt": reasoning_system_prompt,
        "user_prompt": user_prompt,
        "response": reasoning_response,
    },
)

What is Streaming?

How Streaming Works?

Server Sent Events

Open SSE streaming connection.
Stream tokens in loop.

What is an AI Agent?

An agent is an intelligent system capable of perceiving its environment, making decisions, and taking actions to achieve specific goals.

They take the power of generative AI, particularly Large Language Models (or LLMs), a step further.

Why Build with Agents?

While powerful, standalone LLMs often fall short when faced with complex, broader tasks requiring multiple steps, planning, and sophisticated reasoning. These tasks often demand accessing external, real-time information or tools outside the LLM's training data.

Agents bridge this gap by integrating LLM reasoning with tools and planning capabilities. They enable the execution of workflows and allow actions in the "real world," whether digital or physical via APIs, to complete tasks.

Components of an AI Agent

The foundation of a modern AI agent typically involves several key components:

Large Language Model (LLM): This serves as the "brain" of the agent, providing the ability to understand, reason, and act. LLMs process and generate language, enabling the agent's cognitive functions.
Tools: These are external functions, APIs, or resources that the agent can access and utilize to interact with its environment and enhance its capabilities. Tools allow agents to perform specific tasks beyond text generation, such as searching the web, querying databases, making calculations, or controlling external systems.
Instructions: Explicit guidelines, often provided through a system prompt, define how the agent should behave and guide its actions.
Memory: Agents can possess various forms of memory, including short-term memory (context from the current conversation) and long-term memory (from past historical interactions), enabling them to learn from past experiences and maintain context.
Runtime/Orchestration Layer: This environment allows the agent or LLM to control its execution flow, decide when to use tools, and process observations. In fact, the orchestration layer is what actually runs the tools on the LLM's behalf, since by itself it only generates text.

What is a Multi-Agent System?

A multi-agent system is a system that consists of multiple agents that interact with each other and can make decisions to handle complicated natural language requests from users.

To make multi-agent systems powerful, we need to make sure that the agents themselves can integrate with our bussiness systems. We also need to learn how to route data and memories between agents as well as how to use agents to carefully augment requests to AI models with relevant context.

NOTE: The main role of an LLM within an AI agent is to provide reasoning and language processing capabilities. It can be used to understand the user's request and figure out which tools (APIs) can serve the request.

Multi-Agent System Architecture

Orchestrator Pattern: A centralized architecture where a single agent (the orchestrator) manages and delegates tasks to other worker agents.
Peer-to-Peer Pattern: A decentralized architecture where requests immediately goes to the most qualified agent for handling. Agents often commnunicate directly with each other without a central orchestrator.

Regardless of the the pattern, we need to define roles, communication protocols, state management, and data flow strategies for our multi-agent system.

Agent Tools

Tools give agents the ability to do things beyond just processing language, such as accessing databases, calling APIs, or performing complex calculations.

Tools names and descriptions are critical metadata that the LLM uses to understand what the tool does and how to use it.

NOTE: It's important to handle errors within your tools to prevent the agent from crashing.

Orchestration

Orchestration is all about defining the techniques that direct and manage agent interactions to achieve complex goals. Key orchestration patterns include:

Sequential Orchestration

Each step builds on the previous result. For example, a customer support agent might first use a diagnostic agent, then a solution agent, and finally a verification agent.

Parallel Orchestration

Tasks that don't depend on each other can be run simultaneously to save time. For example, a market analysis report can be generated by simultaneously running a news analysis agent, a stock data agent and a competitor agent.

Conditional Branching

The orchestrator makes decisions based on the output of an agent. For example, in an e-commerce return system, if a policy agent returns "Eligible," the orchestrator calls the inventory and refund agents. Otherwise, it calls a denial notification agent.

Routing & Data Flow Management

Routing patterns:

Content-Based Routing

A routing strategy that directs a message based on its content or metadata (e.g., keywords, data type).

Inspects the message itself to decide where it goes. This is useful for sorting diverse incoming requests.

Round-Robin Routing

A routing strategy that distributes tasks evenly across a pool of similar agents, one after another.

Spreads the workload evenly across similar agents. This is useful for scaling.

Priority-Based Routing

A routing strategy that processes messages based on a predefined priority level.

Treats urgent tasks differently from routine ones.

State Management & Coordination

State management is about giving your agent team a collective memory.

State Coordination represents the set of mechanisms that ensure multiple agents have a consistent and synchronized view of a shared state, especially during concurrent operations.

We need mechanisms to detect when one agent's knowledge contradicts another and clear rules for resolving those conflicts.

Synchronization Strategies

Lean on Your Database: For state stored in a database, the database itself is your primary source of truth. Your agent tools can leverage features like locking and transactions.
State Broadcasting / Eventing: When a tool successfully updates critical state, it can publish an event. Interested agents can subscribe to these events and react accordingly.
Optimistic Concurrency Control: Assume conflicts are rare, but check for them before committing changes.

Conflict Resolution Strategies

Predefined Rules/Policies: Embed conflict resolution logic directly.
Negotiation/Consensus: Agents actively communicate to find a resolution.
Rollback & Retry: If an update fails due to a conflict, roll back the operation and retry.
Human Escalation: For critical or ambiguous conflicts, escalate to a human.

AI-Driven System Design: Fundamentals