When One LLM Cannot Do the Job: Shipping Agent Teams

A mid-size accounting firm wanted to automate their tax return review process. A single tax return review involved checking 47 different items across federal and state requirements, cross-referencing with prior year filings, validating calculations, identifying optimization opportunities, and generating a summary for the CPA. A single LLM could not handle the full process — it lacked the ability to access their tax databases, perform complex calculations reliably, or maintain context across a 47-item checklist. An AI agency built a multi-agent system where specialized agents handled different aspects of the review: one agent for data extraction and validation, one for federal compliance checking, one for state compliance checking, one for calculation verification, one for optimization identification, and an orchestrating agent that coordinated the workflow and compiled the final report. The system reduced average review time from 4.2 hours to 38 minutes. Error detection improved by 23 percent because the agents consistently checked every item, unlike human reviewers who sometimes skipped steps under time pressure. The engagement was $320,000 and generated $1.8 million in annual labor savings for the firm.

Multi-agent AI systems are moving from research curiosity to production reality. For your agency, building agent platforms is a high-complexity, high-value service that commands premium pricing and creates deep client relationships.

What Multi-Agent Systems Are

A multi-agent AI system uses multiple specialized AI agents that collaborate to accomplish complex tasks. Each agent has defined capabilities, tools, and responsibilities. An orchestrator coordinates the agents, managing the workflow, handling failures, and assembling outputs.

Key characteristics:

Specialization. Instead of one general-purpose agent trying to do everything, specialized agents each do one thing well. A research agent is good at finding information. A coding agent is good at writing code. An analysis agent is good at interpreting data. Specialization improves quality because each agent's prompt and tools are optimized for its specific task.

Tool use. Agents interact with external tools — databases, APIs, calculators, search engines, file systems, code interpreters. This extends their capabilities beyond pure text generation.

Planning and reasoning. Agent systems can decompose complex tasks into subtasks, plan an execution strategy, and adapt the plan as they learn from intermediate results.

Memory and state. Agent systems maintain state across interactions — conversation history, task progress, discovered information, and decisions made. This enables them to work on tasks that span many steps.

Platform Architecture

Orchestration Layer

The orchestrator is the brain of the multi-agent system. It decomposes tasks, assigns subtasks to agents, manages the flow of information between agents, and handles failures.

Orchestration patterns:

Sequential pipeline. Agents execute in a fixed sequence — Agent A's output becomes Agent B's input. Simplest pattern, suitable for well-defined workflows where the steps are always the same.

Hierarchical delegation. A supervisor agent receives a task, decomposes it into subtasks, and delegates to specialized worker agents. The supervisor reviews outputs and requests revisions if needed. Most flexible pattern for complex tasks.

Collaborative discussion. Multiple agents discuss a problem, each contributing their perspective. A moderator agent synthesizes the discussion into a final output. Best for tasks requiring multiple viewpoints (risk assessment, strategy analysis).

Parallel execution. Independent subtasks are assigned to agents that execute simultaneously. An aggregator combines the results. Best for tasks with many independent components (multi-document analysis, parallel data collection).

Agent Definition Layer

Each agent needs a clear definition that includes:

Role and persona. A description of the agent's expertise and perspective. Example: "You are a financial compliance specialist with deep expertise in SEC regulations and corporate financial reporting."

Capabilities. What the agent can do — the tools it has access to and the types of tasks it can handle.

Tools. The specific tools the agent can invoke — database queries, API calls, calculations, file operations, web searches, code execution.

Constraints. What the agent must not do — access restrictions, compliance requirements, safety boundaries.

Output format. The expected format of the agent's output — structured JSON, natural language, specific templates.

Tool Integration Layer

Tools are what give agents their power. Without tools, agents are limited to text generation.

Common tool categories:

Data tools: Database queries, API calls, file reading, data processing
Computation tools: Calculators, code interpreters, statistical functions
Search tools: Web search, document search, knowledge base search
Communication tools: Email sending, message posting, notification triggering
Action tools: System configuration, workflow triggering, record creation/update

Tool implementation principles:

Every tool should have a clear description that the agent can understand
Tools should validate inputs and return structured outputs
Tools should handle errors gracefully and return informative error messages
Tool access should be controlled per agent (not every agent should have access to every tool)
Tool invocations should be logged for audit and debugging

Memory and State Layer

Short-term memory. The conversation context and intermediate results for the current task. Stored in the orchestrator's working memory during task execution.

Long-term memory. Persistent knowledge that agents build over time — user preferences, discovered patterns, past decisions and their outcomes. Stored in a database and retrieved contextually.

Shared state. Information that multiple agents need to access during a task — task progress, discovered data, decisions made, artifacts produced. Stored in a shared workspace accessible to all agents in the task.

Safety and Governance Layer

Agent systems can take actions with real-world consequences. Safety governance is not optional.

Guardrails:

Action approval: High-impact actions (sending emails, modifying records, triggering workflows) require human approval before execution
Budget limits: Cap the total compute, API calls, and tool invocations per task to prevent runaway agents
Output validation: Validate agent outputs against expected schemas and quality criteria before passing to downstream agents or users
Boundary enforcement: Agents must stay within their defined scope. A research agent should not attempt to modify data. A data agent should not attempt to send emails.
Human-in-the-loop: For critical decisions, route to a human for review before the agent proceeds

Delivery Process

Phase 1: Use Case Analysis and Design (Weeks 1-4)

Map the target workflow in detail (every step, decision point, data source, and output)
Identify which steps are candidates for agent automation
Define the agent roles and their required capabilities
Design the orchestration pattern
Identify required tools and integrations
Define safety requirements and governance policies
Design the evaluation strategy

Phase 2: Core Platform Build (Weeks 5-10)

Build the orchestration engine
Implement the agent framework (agent definition, prompt management, tool invocation)
Build the tool integration layer with required tool implementations
Implement the memory and state management system
Build the safety and governance layer

Phase 3: Agent Development (Weeks 11-16)

Develop and optimize prompts for each agent role
Implement and test each tool integration
Build end-to-end evaluation suites for the agent system
Iterate on agent behavior based on evaluation results
Tune orchestration logic for reliability and efficiency

Phase 4: Production Deployment (Weeks 17-22)

Deploy with human-in-the-loop for all actions initially
Gradually increase agent autonomy as confidence builds
Implement production monitoring and alerting
Build operational dashboards showing agent performance, cost, and quality
Train users on interacting with the agent system

Measuring Agent System Performance

Task completion metrics:

Success rate: Percentage of tasks completed successfully without human intervention. Target varies by use case — 80 percent for complex tasks, 95 percent for well-defined tasks.
Quality score: Human evaluation of agent output quality. Target: comparable to or better than human performance for routine tasks.
Time to completion: Average time to complete a task end-to-end. Track the improvement versus manual process.

Reliability metrics:

Error rate: Percentage of tasks where the agent makes a significant error. Target: under 5 percent.
Hallucination rate: Percentage of agent outputs containing fabricated information. Target: under 2 percent.
Recovery rate: Percentage of failed tasks that the system automatically recovers from. Target: 70 percent or higher.

Cost metrics:

Cost per task: Total cost (LLM tokens, tool invocations, compute) per completed task.
Cost versus manual: Comparison of agent system cost to the cost of manual task completion.

Common Agent System Failure Modes

Infinite loops. An agent encounters an error, retries, encounters the same error, retries again, and loops indefinitely. This consumes tokens and compute without producing results. The fix: implement retry limits and exponential backoff for every tool call. After a defined number of retries, the agent should escalate to the orchestrator or a human rather than continuing to retry.

Hallucinated tool calls. The agent attempts to invoke a tool that does not exist or passes invalid parameters to a real tool. This is especially common when the agent's tool descriptions are imprecise or when the task requires a tool the agent does not have. The fix: validate every tool call against the tool registry before execution. Return informative error messages when a tool call is invalid so the agent can adjust.

Context window overflow. Complex multi-agent tasks can generate enormous amounts of intermediate text — research results, tool outputs, agent discussions. When this exceeds the context window, the agent loses track of earlier information and makes decisions based on incomplete context. The fix: implement context management — summarize intermediate results, maintain a structured state object rather than relying on raw conversation history, and use memory retrieval for information that does not need to be in the immediate context.

Cascading failures. One agent fails, producing bad output. The downstream agent receives this bad output as input and produces worse output. The orchestrator does not detect the quality degradation and the cascade continues until the final output is unusable. The fix: implement output validation at every agent boundary. Each agent's output should be validated against expected schemas and quality criteria before being passed to the next agent.

Cost explosions. An agent enters a research loop, making dozens of API calls and processing enormous documents, spending hundreds of dollars in token costs on a single task that should cost cents. The fix: implement per-task cost budgets. When cost exceeds the budget, the task is paused and escalated for human review.

Agent Platform Technology Choices

Orchestration frameworks. LangGraph provides a graph-based orchestration framework with built-in state management and human-in-the-loop support. CrewAI focuses on role-based agent collaboration with a simpler API. AutoGen from Microsoft enables multi-agent conversations with code execution capabilities. For production enterprise systems, custom orchestration built on these frameworks (or from scratch) often provides the control and reliability that off-the-shelf frameworks lack.

LLM selection for agents. Not every agent needs the most capable model. Use a tiered approach — the orchestrator and complex reasoning agents use the most capable model (GPT-4, Claude, Gemini Pro), while simple extraction and formatting agents use smaller, faster, cheaper models. This can reduce total cost by 60 to 80 percent without meaningful quality loss.

Tool execution environments. Agents that execute code need sandboxed execution environments to prevent security issues. Use containerized execution (Docker, Firecracker) for code agents. For agents that access databases or APIs, implement fine-grained access controls — each agent should have access only to the specific resources it needs, following the principle of least privilege.

Evaluation and testing. Agent systems are harder to test than traditional AI because they involve multiple interacting components with non-deterministic behavior. Build evaluation suites that test end-to-end task completion (not just individual agent outputs), include adversarial test cases (what happens when a tool fails? when an agent receives unexpected input?), and measure both quality and cost.

Building Agent Systems Incrementally

The biggest mistake agencies make with agent platforms is trying to build the complete multi-agent system from day one. Instead, build incrementally.

Start with a single agent. Build one agent that handles one well-defined task with a few tools. Get it to production, measure its performance, and learn from the operational experience.

Add agents one at a time. Once the single agent is working, add a second agent for a related task. Implement the orchestration between them. This incremental approach surfaces integration challenges early when they are small.

Expand tool access gradually. Start with read-only tools (database queries, API lookups, document retrieval). Add write tools (record creation, email sending, workflow triggering) only after the agent system has proven reliable with read-only operations.

Increase autonomy gradually. Start with human-in-the-loop for every action. As confidence builds, allow the agent to execute low-risk actions autonomously while maintaining human approval for high-risk actions. Eventually, most routine actions are autonomous while high-risk actions retain human oversight.

This incremental approach typically takes 3 to 6 months to reach full production capability, but it dramatically reduces the risk of building a complex system that fails in unexpected ways.

Agent System Use Cases by Industry

Legal. Contract review agents that extract key terms, identify risk clauses, compare against standard templates, and generate summary reports. Document discovery agents that search across thousands of documents to find relevant evidence for litigation. Compliance monitoring agents that track regulatory changes and assess impact on the firm's practices.

Finance. Financial analysis agents that gather data from multiple sources, compute metrics, identify trends, and generate investment reports. Compliance checking agents that review transactions against regulatory requirements. Customer service agents that handle account inquiries by accessing multiple banking systems.

Healthcare. Clinical documentation agents that assist physicians with note-taking, coding, and billing. Research agents that search medical literature to find relevant studies for clinical questions. Administrative agents that handle appointment scheduling, insurance verification, and referral management.

Software engineering. Code review agents that analyze pull requests for bugs, security vulnerabilities, and style issues. Incident response agents that gather logs, identify root causes, and suggest remediation steps. Documentation agents that generate and update API documentation from code changes.

Customer service. Tier-1 support agents that handle common inquiries by accessing CRM data, knowledge bases, and order systems. Escalation agents that gather context and prepare summaries for human agents when issues exceed the AI agent's capability. Quality assurance agents that review support interactions for compliance and customer satisfaction.

Pricing Multi-Agent Platform Engagements

Agent system design and proof of concept: $30,000 to $80,000
Single-workflow agent system: $100,000 to $250,000
Enterprise multi-agent platform: $250,000 to $700,000
Ongoing operations and optimization: $10,000 to $30,000 per month

Your Next Step

This week: Identify client workflows that involve multiple steps, multiple data sources, and repetitive cognitive tasks. These are your highest-potential agent automation candidates.

This month: Build a simple two-agent system on a real client workflow. Experience the challenges firsthand — orchestration complexity, tool reliability, prompt brittleness, and safety governance.

This quarter: Deliver your first production agent system. Start with a well-defined workflow, implement strong safety guardrails, and demonstrate measurable productivity gains.

What Multi-Agent Systems Are

Key characteristics:

Tool use. Agents interact with external tools — databases, APIs, calculators, search engines, file systems, code interpreters. This extends their capabilities beyond pure text generation.

Planning and reasoning. Agent systems can decompose complex tasks into subtasks, plan an execution strategy, and adapt the plan as they learn from intermediate results.

Platform Architecture

Orchestration Layer

The orchestrator is the brain of the multi-agent system. It decomposes tasks, assigns subtasks to agents, manages the flow of information between agents, and handles failures.

Orchestration patterns:

Sequential pipeline. Agents execute in a fixed sequence — Agent A's output becomes Agent B's input. Simplest pattern, suitable for well-defined workflows where the steps are always the same.

Agent Definition Layer

Each agent needs a clear definition that includes:

Capabilities. What the agent can do — the tools it has access to and the types of tasks it can handle.

Tools. The specific tools the agent can invoke — database queries, API calls, calculations, file operations, web searches, code execution.

Constraints. What the agent must not do — access restrictions, compliance requirements, safety boundaries.

Output format. The expected format of the agent's output — structured JSON, natural language, specific templates.

Tool Integration Layer

Tools are what give agents their power. Without tools, agents are limited to text generation.

Common tool categories:

Data tools: Database queries, API calls, file reading, data processing
Computation tools: Calculators, code interpreters, statistical functions
Search tools: Web search, document search, knowledge base search
Communication tools: Email sending, message posting, notification triggering
Action tools: System configuration, workflow triggering, record creation/update

Tool implementation principles:

Every tool should have a clear description that the agent can understand
Tools should validate inputs and return structured outputs
Tools should handle errors gracefully and return informative error messages
Tool access should be controlled per agent (not every agent should have access to every tool)
Tool invocations should be logged for audit and debugging

Memory and State Layer

Short-term memory. The conversation context and intermediate results for the current task. Stored in the orchestrator's working memory during task execution.

Long-term memory. Persistent knowledge that agents build over time — user preferences, discovered patterns, past decisions and their outcomes. Stored in a database and retrieved contextually.

Safety and Governance Layer

Agent systems can take actions with real-world consequences. Safety governance is not optional.

Guardrails:

Action approval: High-impact actions (sending emails, modifying records, triggering workflows) require human approval before execution
Budget limits: Cap the total compute, API calls, and tool invocations per task to prevent runaway agents
Output validation: Validate agent outputs against expected schemas and quality criteria before passing to downstream agents or users
Boundary enforcement: Agents must stay within their defined scope. A research agent should not attempt to modify data. A data agent should not attempt to send emails.
Human-in-the-loop: For critical decisions, route to a human for review before the agent proceeds

Delivery Process

Phase 1: Use Case Analysis and Design (Weeks 1-4)

Map the target workflow in detail (every step, decision point, data source, and output)
Identify which steps are candidates for agent automation
Define the agent roles and their required capabilities
Design the orchestration pattern
Identify required tools and integrations
Define safety requirements and governance policies
Design the evaluation strategy

Phase 2: Core Platform Build (Weeks 5-10)

Build the orchestration engine
Implement the agent framework (agent definition, prompt management, tool invocation)
Build the tool integration layer with required tool implementations
Implement the memory and state management system
Build the safety and governance layer

Phase 3: Agent Development (Weeks 11-16)

Develop and optimize prompts for each agent role
Implement and test each tool integration
Build end-to-end evaluation suites for the agent system
Iterate on agent behavior based on evaluation results
Tune orchestration logic for reliability and efficiency

Phase 4: Production Deployment (Weeks 17-22)

Deploy with human-in-the-loop for all actions initially
Gradually increase agent autonomy as confidence builds
Implement production monitoring and alerting
Build operational dashboards showing agent performance, cost, and quality
Train users on interacting with the agent system

Measuring Agent System Performance

Task completion metrics:

Success rate: Percentage of tasks completed successfully without human intervention. Target varies by use case — 80 percent for complex tasks, 95 percent for well-defined tasks.
Quality score: Human evaluation of agent output quality. Target: comparable to or better than human performance for routine tasks.
Time to completion: Average time to complete a task end-to-end. Track the improvement versus manual process.

Reliability metrics:

Error rate: Percentage of tasks where the agent makes a significant error. Target: under 5 percent.
Hallucination rate: Percentage of agent outputs containing fabricated information. Target: under 2 percent.
Recovery rate: Percentage of failed tasks that the system automatically recovers from. Target: 70 percent or higher.

Cost metrics:

Cost per task: Total cost (LLM tokens, tool invocations, compute) per completed task.
Cost versus manual: Comparison of agent system cost to the cost of manual task completion.

Common Agent System Failure Modes

Agent Platform Technology Choices

Building Agent Systems Incrementally

The biggest mistake agencies make with agent platforms is trying to build the complete multi-agent system from day one. Instead, build incrementally.

Start with a single agent. Build one agent that handles one well-defined task with a few tools. Get it to production, measure its performance, and learn from the operational experience.

This incremental approach typically takes 3 to 6 months to reach full production capability, but it dramatically reduces the risk of building a complex system that fails in unexpected ways.

Agent System Use Cases by Industry

Pricing Multi-Agent Platform Engagements

Agent system design and proof of concept: $30,000 to $80,000
Single-workflow agent system: $100,000 to $250,000
Enterprise multi-agent platform: $250,000 to $700,000
Ongoing operations and optimization: $10,000 to $30,000 per month

Your Next Step

This week: Identify client workflows that involve multiple steps, multiple data sources, and repetitive cognitive tasks. These are your highest-potential agent automation candidates.

This quarter: Deliver your first production agent system. Start with a well-defined workflow, implement strong safety guardrails, and demonstrate measurable productivity gains.

When One LLM Cannot Do the Job: Shipping Agent Teams

What Multi-Agent Systems Are

Platform Architecture

Orchestration Layer

Agent Definition Layer

Tool Integration Layer

Memory and State Layer

Safety and Governance Layer

Delivery Process

Phase 1: Use Case Analysis and Design (Weeks 1-4)

Phase 2: Core Platform Build (Weeks 5-10)

Phase 3: Agent Development (Weeks 11-16)

Phase 4: Production Deployment (Weeks 17-22)

Measuring Agent System Performance

Common Agent System Failure Modes

Agent Platform Technology Choices

Building Agent Systems Incrementally

Agent System Use Cases by Industry

Pricing Multi-Agent Platform Engagements

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

When One LLM Cannot Do the Job: Shipping Agent Teams

What Multi-Agent Systems Are

Platform Architecture

Orchestration Layer

Agent Definition Layer

Tool Integration Layer

Memory and State Layer

Safety and Governance Layer

Delivery Process

Phase 1: Use Case Analysis and Design (Weeks 1-4)

Phase 2: Core Platform Build (Weeks 5-10)

Phase 3: Agent Development (Weeks 11-16)

Phase 4: Production Deployment (Weeks 17-22)

Measuring Agent System Performance

Common Agent System Failure Modes

Agent Platform Technology Choices

Building Agent Systems Incrementally

Agent System Use Cases by Industry

Pricing Multi-Agent Platform Engagements

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?