AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

When Multi-Agent Systems Make SenseSingle Agent vs. Multi-AgentAgent TypesArchitecture PatternsSequential PipelineParallel Fan-Out / Fan-InHierarchical OrchestrationEvent-Driven ArchitectureBuilding Reliable Agent SystemsPrompt Engineering for AgentsStructured Output and ValidationFailure Handling and RecoveryHuman-in-the-Loop IntegrationState ManagementWorkflow State ArchitectureContext Window ManagementMonitoring and ObservabilityTrace-Based MonitoringAgent-Level MetricsSystem-Level MetricsCost ManagementToken Cost OptimizationCost Tracking Per ClientYour Next Step
Home/Blog/One Giant LLM Failed Claims Processing; Seven Small Agents Didn't
Delivery

One Giant LLM Failed Claims Processing; Seven Small Agents Didn't

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท13 min read
ai agentsorchestrationmulti-agent systemsworkflow automation

An automation-focused AI agency in Atlanta was hired by a mid-size insurance company to automate their claims processing workflow. The existing process involved 11 manual steps across 4 departments, averaging 4.2 days from claim filing to resolution. The agency's initial approach was a single monolithic AI system โ€” one large language model handling the entire workflow end-to-end. It failed spectacularly. The LLM could not reliably maintain context across all 11 steps, made inconsistent decisions when juggling multiple document types simultaneously, and had no way to recover when it made an error in an early step. The agency pivoted to a multi-agent architecture: 7 specialized agents, each handling a specific phase of the claims process. A document intake agent extracted structured data from claim forms. A coverage verification agent checked policy terms. A damage assessment agent analyzed photos and repair estimates. A fraud detection agent flagged suspicious patterns. A settlement calculation agent computed payouts. An approval routing agent determined authorization requirements. And a communication agent generated claimant correspondence. An orchestration layer coordinated the agents, managed handoffs, handled failures, and maintained the overall workflow state. Average claim handling time dropped to 6 hours, with 73% of claims fully automated and 27% requiring human intervention at specific decision points.

Multi-agent AI systems decompose complex workflows into specialized components, each powered by an AI agent optimized for a specific task. For AI agencies, multi-agent orchestration is how you deliver enterprise-grade automation for workflows that are too complex for a single model or a single prompt. But orchestrating multiple agents introduces coordination challenges that require deliberate architectural decisions. This guide covers the full spectrum of multi-agent orchestration โ€” from architecture patterns to failure handling to production monitoring.

When Multi-Agent Systems Make Sense

Single Agent vs. Multi-Agent

Not every problem needs multiple agents. Use a single agent when the task is straightforward, requires minimal context switching, and has a linear workflow. Use multiple agents when:

Complexity requires specialization: The workflow involves distinct cognitive tasks that benefit from different prompts, different tools, different models, or different knowledge bases. A claims processing workflow that requires document understanding, policy knowledge, fraud detection, and communication skills benefits from agents specialized in each area.

Reliability requires isolation: Errors in one part of the workflow should not corrupt other parts. In a single-agent system, an error in document parsing can cascade to incorrect coverage decisions. In a multi-agent system, the document parsing agent can fail and retry without affecting the coverage verification agent.

Scale requires parallelism: Parts of the workflow can run simultaneously. In claims processing, fraud detection and coverage verification can run in parallel because they are independent tasks. A multi-agent system naturally supports parallel execution.

Maintainability requires modularity: Each agent can be updated, tested, and improved independently. Updating the fraud detection model does not require retesting the document intake agent.

Agent Types

Reactive agents respond to inputs without maintaining state. They process a request and produce a response. Examples: classification agents, extraction agents, summarization agents. These are the simplest agents to build and test.

Deliberative agents reason about their actions, plan multi-step approaches, and maintain working memory. They can break a complex task into subtasks and execute them sequentially. Examples: research agents, analysis agents, planning agents.

Tool-using agents have access to external tools (databases, APIs, calculators, search engines) and decide when and how to use them. They extend the agent's capabilities beyond what the language model alone can provide.

Supervisory agents coordinate other agents rather than performing tasks directly. They assign tasks, collect results, resolve conflicts, and make routing decisions. The orchestrator in a multi-agent system is typically a supervisory agent.

Architecture Patterns

Sequential Pipeline

Agents execute in a fixed order, each passing its output as input to the next agent.

When to use: The workflow has a natural sequential order where each step depends on the output of the previous step. Document processing pipelines (ingest, extract, validate, classify, route) are classic sequential pipelines.

Advantages:

  • Simple to implement and debug
  • Clear data flow and error attribution
  • Easy to add logging and monitoring at each step

Disadvantages:

  • No parallelism โ€” total latency is the sum of all agent latencies
  • A failure in any agent blocks the entire pipeline
  • Rigid โ€” cannot adapt the execution order based on intermediate results

Implementation:

  • Define a pipeline specification that lists agents in execution order
  • Each agent receives the accumulated context from all previous agents
  • Implement timeouts at each stage โ€” if an agent does not respond within the timeout, trigger the failure handler
  • Log the input and output of each agent for debugging and audit

Parallel Fan-Out / Fan-In

Multiple agents execute simultaneously on the same input, and their outputs are aggregated.

When to use: Multiple independent analyses need to be performed on the same data. Fraud detection, compliance checking, and risk assessment can all run in parallel on the same transaction data.

Advantages:

  • Reduced total latency (limited by the slowest agent, not the sum of all agents)
  • Natural isolation โ€” one agent's failure does not block others
  • Easy to add new parallel agents without affecting existing ones

Disadvantages:

  • Requires a fan-in step to aggregate and reconcile results
  • Conflicting results from different agents need a resolution strategy
  • Harder to debug when the aggregated result is incorrect (which agent's output was wrong?)

Implementation:

  • Dispatch the input to all parallel agents simultaneously
  • Set a timeout for the parallel phase โ€” if any agent does not respond within the timeout, proceed with the results from agents that did respond
  • Implement an aggregation function that combines agent outputs. For classification tasks, use majority voting or weighted voting. For extraction tasks, use union with conflict resolution. For scoring tasks, use weighted averaging.

Hierarchical Orchestration

A supervisory agent delegates tasks to worker agents, reviews their outputs, and makes routing decisions based on intermediate results.

When to use: The workflow requires dynamic decision-making about which steps to execute, in what order, and with what parameters. Complex customer support workflows where the response depends on the nature of the inquiry, the customer's history, and the severity of the issue benefit from hierarchical orchestration.

Advantages:

  • Adaptive execution โ€” the workflow changes based on intermediate results
  • The supervisor can recover from worker failures by reassigning tasks or trying alternative approaches
  • Natural escalation path โ€” the supervisor can escalate to a human when agent capabilities are insufficient

Disadvantages:

  • The supervisor agent is a single point of failure and a potential bottleneck
  • Supervisor reasoning adds latency and cost
  • Harder to predict execution paths, making testing more complex

Implementation:

  • Define the supervisor agent with clear instructions about available worker agents, their capabilities, and when to use each one
  • Give the supervisor access to workflow state so it can make informed routing decisions
  • Implement guardrails on the supervisor โ€” maximum number of steps, maximum budget, mandatory human review triggers
  • Log the supervisor's reasoning and decisions for debugging and audit

Event-Driven Architecture

Agents communicate through events on a message bus. Each agent subscribes to events it can handle and publishes events for other agents to consume.

When to use: The workflow is complex, dynamic, and may involve multiple concurrent processes that interact asynchronously. Enterprise workflows where actions trigger cascading processes across multiple systems benefit from event-driven orchestration.

Advantages:

  • Highly decoupled โ€” agents can be added, removed, or updated without affecting other agents
  • Natural support for asynchronous, long-running workflows
  • Scales horizontally โ€” add more instances of busy agents to handle load

Disadvantages:

  • Harder to reason about the overall workflow โ€” the execution path is determined by events, not a predefined plan
  • Event ordering and exactly-once processing require careful engineering
  • Debugging requires tracing events across multiple agents and services

Building Reliable Agent Systems

Prompt Engineering for Agents

Agent prompts are fundamentally different from chat prompts. They need to be precise, structured, and defensive.

Agent prompt structure:

  • Role definition: Clearly define the agent's role, expertise, and scope. What the agent is responsible for and what it is NOT responsible for.
  • Input specification: Exactly what data the agent will receive and in what format.
  • Output specification: Exactly what the agent must produce and in what format. Use structured output formats (JSON) to enable programmatic parsing.
  • Tool instructions: If the agent has tools, describe each tool, when to use it, and what to do with the results.
  • Error handling instructions: What to do when input data is missing, malformed, or ambiguous. What to do when a tool call fails.
  • Boundary conditions: Explicit instructions about when to flag uncertainty, when to escalate to a human, and when to refuse to act.

Defensive prompting techniques:

  • Include validation steps in the prompt: "Before producing your final output, verify that all required fields are present and that values are within expected ranges."
  • Include self-correction instructions: "If your initial analysis seems inconsistent, re-examine the evidence and revise your assessment."
  • Include scope enforcement: "Do not attempt to answer questions outside your defined scope. If the input requires expertise you do not have, indicate this in your output."

Structured Output and Validation

Every agent should produce structured output that can be validated programmatically.

Output schema definition:

  • Define a JSON schema for each agent's output
  • Include required fields, data types, allowed values, and value constraints
  • Validate every agent output against its schema before passing it to the next agent
  • If validation fails, retry the agent with an error message indicating what was wrong with the output

Schema versioning:

  • Version output schemas alongside agent prompts and code
  • When an agent's output schema changes, update all downstream agents that consume that output
  • Maintain backward compatibility during transitions โ€” support both old and new schema versions temporarily

Failure Handling and Recovery

Multi-agent systems have more failure modes than single-agent systems. Every failure mode needs an explicit handling strategy.

Agent-level failures:

  • Timeout: The agent does not respond within the expected time. Strategy: retry once with the same input, then escalate to a fallback agent or human review.
  • Invalid output: The agent produces output that does not match the expected schema. Strategy: retry with an explicit error message, then use a default output or escalate.
  • Hallucination: The agent produces output that is syntactically valid but factually incorrect. Strategy: implement validation checks using external data sources, cross-reference with other agents' outputs, or route to human review.
  • Refusal: The agent refuses to perform the task (common with safety-trained LLMs on edge cases). Strategy: rephrase the prompt, use an alternative agent, or escalate to a human.

Orchestration-level failures:

  • Deadlock: Two agents are waiting for each other's output. Strategy: implement timeout-based deadlock detection and resolution.
  • Infinite loop: The orchestrator keeps retrying a failed step. Strategy: implement a maximum retry count and a circuit breaker that escalates after a threshold.
  • State corruption: The workflow state becomes inconsistent due to a partial failure. Strategy: use transactional state updates โ€” either all state changes from an agent succeed or none do.
  • Resource exhaustion: Too many concurrent workflows overwhelm the system. Strategy: implement queue-based admission control with a maximum number of concurrent workflows.

Human-in-the-Loop Integration

Production multi-agent systems almost always include human review points for high-stakes decisions or agent uncertainty.

Designing human review triggers:

  • Agent confidence below a threshold (the agent is uncertain)
  • Decision involves financial amounts above a threshold
  • The case matches known edge case patterns
  • Multiple agents disagree on the assessment
  • The workflow has exceeded the maximum number of retries

Human review interface requirements:

  • Show the full workflow context โ€” all agent inputs, outputs, and reasoning
  • Allow the human to approve, modify, or reject the agent's decision
  • Capture the human's correction as feedback for improving the agents
  • Track review turnaround time and reviewer consistency

State Management

Workflow State Architecture

Multi-agent workflows need a centralized state store that all agents can read from and write to.

State store requirements:

  • Durability: State survives system restarts and agent failures
  • Consistency: All agents see the same state at any given time
  • Concurrency control: Multiple agents can read and write state without conflicts
  • Auditability: Every state change is logged with timestamp, agent ID, and change details

State store options:

  • PostgreSQL: Reliable, well-understood, supports transactions. Best for workflows with moderate concurrency.
  • Redis with persistence: Low-latency read/write, supports atomic operations. Best for workflows that need high-speed state access.
  • DynamoDB or similar: Managed, scalable, supports conditional writes. Best for high-volume workflows on AWS.

Context Window Management

LLM-based agents have limited context windows. As workflows progress and context accumulates, you need strategies to manage context size.

Context management strategies:

  • Summarization: At each stage, summarize the previous context into a compact representation. The current agent receives the summary plus the specific data it needs.
  • Selective context: Each agent receives only the context relevant to its task, not the entire workflow history. The document intake agent's raw output does not need to be in the fraud detection agent's context.
  • Context retrieval: Store the full context in the state store and have agents retrieve specific pieces as needed using structured queries.
  • Sliding window: Keep only the most recent N exchanges in the context and retrieve older context on demand.

Monitoring and Observability

Trace-Based Monitoring

Multi-agent systems need distributed tracing โ€” the ability to follow a single workflow request across all agents and see timing, inputs, outputs, and decisions at each step.

Tracing implementation:

  • Assign a unique trace ID to each workflow instance
  • Each agent logs its input, output, latency, token usage, and any tool calls with the trace ID
  • Use a tracing platform (Jaeger, Langfuse, Langsmith, or Arize) to visualize traces
  • Alert on traces that exceed latency thresholds, error thresholds, or cost thresholds

Agent-Level Metrics

For each agent, track:

  • Success rate: Percentage of invocations that produce valid output
  • Latency (p50, p95, p99): Time from input to output
  • Token usage: Input tokens, output tokens, and total cost per invocation
  • Retry rate: Percentage of invocations that require retries
  • Escalation rate: Percentage of invocations that escalate to human review
  • Accuracy (from human review feedback): Percentage of outputs confirmed as correct

System-Level Metrics

For the overall system, track:

  • End-to-end latency: Total time from workflow initiation to completion
  • Throughput: Workflows completed per unit time
  • Automation rate: Percentage of workflows completed without human intervention
  • Total cost per workflow: Sum of all agent invocation costs plus infrastructure costs
  • Error rate: Percentage of workflows that fail or require manual intervention

Cost Management

Token Cost Optimization

LLM-based agents consume tokens, and token costs add up quickly in high-volume multi-agent systems.

Cost optimization strategies:

  • Model tiering: Use cheaper, smaller models for simple agents (classification, extraction) and expensive, larger models only for agents that require complex reasoning. Not every agent needs GPT-4.
  • Prompt optimization: Minimize prompt length while maintaining output quality. Remove redundant instructions, use concise formatting, and avoid examples in production prompts (use few-shot examples during development, then distill into clear instructions).
  • Caching: Cache agent outputs for identical inputs. Many workflows process similar cases, and caching can reduce LLM calls by 20-40%.
  • Batch processing: When latency requirements allow, batch multiple workflow items into a single agent call to amortize the prompt overhead.

Cost Tracking Per Client

For agencies running multi-agent systems for multiple clients, track costs per client to ensure project profitability.

  • Tag every LLM call with the client identifier and workflow type
  • Compute per-client, per-workflow cost including LLM tokens, infrastructure, and human review time
  • Compare actual costs to project budgets
  • Identify workflows where cost exceeds expectations and optimize

Your Next Step

Take one complex workflow that your agency is automating or considering automating. Map every step of the current manual process, identifying the inputs, outputs, decisions, and handoffs at each step. Group the steps into 3-7 logical agent roles, where each agent handles a coherent subset of the workflow. For one of those agents โ€” the simplest one โ€” build a prototype. Define its prompt, input schema, output schema, and validation rules. Test it on 50 real examples and measure its success rate. If the single-agent prototype achieves an acceptable success rate, you have evidence that the multi-agent approach will work. If it fails, you have learned which assumptions need to change before you build the full system. Start small, validate early, and scale the architecture only after the individual agents prove reliable.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification