AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Scoping TTS ProjectsUse Case AssessmentVoice RequirementsModel and Provider SelectionCommercial TTS ProvidersOpen-Source TTS ModelsSelection FrameworkText Processing PipelineText NormalizationSSML for Pronunciation ControlContent SegmentationVoice Cloning and CustomizationEthical and Legal FrameworkRecording Requirements for Custom VoicesProduction ArchitectureBatch Generation PipelineStreaming ArchitectureAudio Post-ProcessingMonitoring and Quality AssuranceAutomated Quality MetricsHuman Quality ReviewYour Next Step
Home/Blog/Building Custom Text-to-Speech Solutions โ€” Delivering Natural Voice Synthesis for Enterprise Applications
Delivery

Building Custom Text-to-Speech Solutions โ€” Delivering Natural Voice Synthesis for Enterprise Applications

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท11 min read
text to speechvoice synthesisaudio generationenterprise ai

A media-focused AI agency in Los Angeles was hired by a financial services firm to create personalized audio summaries of portfolio performance reports. The firm sent 340,000 monthly reports to clients, and research showed that 62% of clients preferred audio content over written reports. Human narration would cost $4.50 per report โ€” $1.53 million monthly โ€” making it economically impossible. The agency built a custom text-to-speech system that cloned the voice of the firm's senior market analyst (with consent and contractual agreements), handled financial terminology and numbers naturally ("the S&P 500 rose 2.3% to 5,847" instead of "the S and P five hundred rose two point three percent to five eight four seven"), and generated personalized audio reports in under 8 seconds each. The system cost $180,000 to build and $12,000 per month to operate โ€” a 99% cost reduction compared to human narration. Client engagement with the audio reports reached 73%, significantly higher than the 41% engagement rate with written reports.

Text-to-speech (TTS) converts written text into natural-sounding spoken audio. Enterprise TTS has evolved from robotic-sounding rule-based systems to neural models that produce speech indistinguishable from human recordings. For AI agencies, custom TTS solutions are increasingly in demand for applications ranging from customer communication to content creation to accessibility. But delivering TTS that sounds professional, handles domain-specific content correctly, and operates at enterprise scale requires careful engineering across voice selection, text processing, model customization, and production infrastructure.

Scoping TTS Projects

Use Case Assessment

TTS applications vary widely in their requirements. The use case determines every technical decision downstream.

Informational content (reports, news summaries, documentation):

  • Priority: Natural prosody, correct pronunciation of domain terms, clear articulation
  • Latency tolerance: Batch generation acceptable (seconds to minutes per audio file)
  • Voice requirements: Professional, authoritative, consistent
  • Volume: High (thousands to millions of audio files)

Conversational AI (virtual assistants, IVR systems, chatbots):

  • Priority: Low latency, natural conversational rhythm, emotional expressiveness
  • Latency tolerance: Real-time (under 500ms from text to audio start)
  • Voice requirements: Warm, approachable, responsive
  • Volume: Real-time streaming, thousands of concurrent sessions

Accessibility (screen readers, audio descriptions, assistive technology):

  • Priority: Clarity, adjustable speed, correct pronunciation of all content types
  • Latency tolerance: Near real-time (under 1 second)
  • Voice requirements: Clear, neutral, configurable
  • Volume: On-demand, moderate

Creative content (audiobooks, podcasts, marketing materials):

  • Priority: Expressiveness, emotional range, character consistency
  • Latency tolerance: Batch generation acceptable
  • Voice requirements: Engaging, expressive, potentially multiple characters
  • Volume: Moderate, high quality per item

Voice Requirements

Define the voice characteristics before selecting a model or service.

Voice characteristics to specify:

  • Gender and perceived age
  • Accent and dialect
  • Speaking pace (words per minute)
  • Emotional tone (professional, warm, energetic, calm)
  • Domain pronunciation requirements (medical terms, financial terminology, brand names)

Custom voice vs. stock voice:

  • Stock voices from TTS providers (ElevenLabs, Google, AWS, Azure) are immediately available and require no training. Sufficient for many enterprise applications.
  • Custom voices cloned from a specific speaker's recordings provide brand consistency and uniqueness. Required when the voice is a recognizable part of the brand or when exact voice matching is needed.

Model and Provider Selection

Commercial TTS Providers

ElevenLabs: Currently the highest-quality commercial TTS. Excellent voice cloning, emotional expressiveness, and multilingual support. API-based with streaming support. Cost: approximately $0.03-0.10 per 1,000 characters depending on plan.

OpenAI TTS: Strong quality, simple API, competitive pricing. Limited voice customization โ€” choose from a set of pre-built voices. Good for applications that do not need voice cloning. Cost: $0.015-0.030 per 1,000 characters.

Google Cloud TTS: Strong multilingual support, good quality with WaveNet and Neural2 voices. Supports SSML for fine-grained pronunciation and prosody control. Cost: $0.004-0.016 per 1,000 characters.

Amazon Polly: Cost-effective for high-volume applications. Neural voices are good quality but not as expressive as ElevenLabs. Strong AWS integration. Cost: $0.004-0.016 per 1,000 characters.

Azure Speech Service: Good quality, strong enterprise features (custom neural voice training, pronunciation dictionaries). Integrates with the Microsoft ecosystem. Cost: $0.016 per 1,000 characters for neural voices.

Open-Source TTS Models

Coqui TTS / XTTS: Open-source, supports voice cloning with as little as 6 seconds of reference audio. Good quality for self-hosted deployments. Supports multiple languages.

Bark (Suno): Open-source, highly expressive with support for laughter, hesitations, and emotional speech. Less controllable than other options but good for creative applications.

StyleTTS 2: Open-source, state-of-the-art quality approaching human parity on some benchmarks. Supports style control and speaker adaptation.

VALL-E X and derivatives: Zero-shot voice cloning models that can synthesize speech in a target voice from a few seconds of reference audio. Research-stage but rapidly improving.

Selection Framework

Choose a commercial API when:

  • Time to market is critical
  • You need managed infrastructure with SLA guarantees
  • Volume is moderate (under 10 million characters per month)
  • Voice cloning is needed but custom model training is not justified

Choose self-hosted open-source when:

  • Data privacy prohibits sending text to external services
  • Volume is high enough that API costs exceed infrastructure costs
  • You need maximum customization of voice characteristics and pronunciation
  • The client requires on-premises deployment

Text Processing Pipeline

Text Normalization

The most underappreciated component of a TTS system is text normalization โ€” converting written text into a form that produces natural-sounding speech.

Number handling:

  • Cardinal numbers: "42" becomes "forty-two"
  • Ordinal numbers: "3rd" becomes "third"
  • Decimal numbers: "3.14" becomes "three point one four"
  • Phone numbers: "555-0123" becomes "five five five, zero one two three"
  • Currency: "$1,234.56" becomes "one thousand two hundred thirty-four dollars and fifty-six cents"
  • Percentages: "15.7%" becomes "fifteen point seven percent"
  • Years: "2026" becomes "twenty twenty-six" (not "two thousand twenty-six")
  • Ranges: "100-200" becomes "one hundred to two hundred"

Abbreviation expansion:

  • Common abbreviations: "Dr." becomes "Doctor," "Inc." becomes "Incorporated"
  • Domain-specific abbreviations: "BP" becomes "blood pressure" in medical context, "basis points" in financial context
  • Acronyms vs. initialisms: "NASA" is spoken as a word, "FBI" is spelled out. Build a lookup table for your domain.

Special content handling:

  • URLs: Either spell out or describe ("link to company dot com")
  • Email addresses: "john at company dot com"
  • Dates: "03/20/2026" becomes "March twentieth, twenty twenty-six"
  • Times: "14:30" becomes "two thirty PM"
  • Mathematical expressions: Context-dependent, often need manual handling

SSML for Pronunciation Control

Speech Synthesis Markup Language (SSML) provides fine-grained control over pronunciation, pausing, emphasis, and prosody.

Essential SSML tags for enterprise TTS:

  • Phoneme: Specify exact pronunciation for terms the model mispronounces
  • Break: Insert pauses of specific durations between sentences or after key terms
  • Emphasis: Add stress to important words
  • Prosody: Control rate, pitch, and volume for specific passages
  • Say-as: Specify how to interpret content โ€” as a date, telephone number, cardinal number, ordinal, or characters

Building a pronunciation dictionary:

  • Start with common domain terms and proper nouns
  • Test each term with the TTS model and identify mispronunciations
  • Create phonetic overrides for mispronounced terms
  • Maintain the dictionary as a living document that grows as new terms are encountered

Content Segmentation

Long texts need to be segmented intelligently for natural-sounding speech.

Segmentation strategies:

  • Split at sentence boundaries for the most natural pausing
  • Add longer pauses between paragraphs and sections
  • For very long content (reports, articles), split into sections with natural transition phrases
  • Ensure that segment boundaries do not break mid-thought โ€” a clause split across segments sounds unnatural

Voice Cloning and Customization

Ethical and Legal Framework

Voice cloning raises significant ethical and legal considerations that must be addressed before starting the project.

Required consents and agreements:

  • Written consent from the person whose voice is being cloned, specifying the scope of use
  • Contractual agreement covering voice ownership, usage rights, and termination conditions
  • Disclosure to end users that the audio is AI-generated (required by law in many jurisdictions)
  • Content restrictions โ€” what the cloned voice can and cannot be used to say

Voice cloning best practices:

  • Only clone voices with explicit, informed consent from the voice owner
  • Implement content filters to prevent misuse (do not allow the cloned voice to say things the voice owner would not approve)
  • Watermark generated audio with inaudible markers that identify it as AI-generated
  • Maintain an audit trail of all content generated with each cloned voice

Recording Requirements for Custom Voices

If building a custom voice from scratch, the quality of reference recordings directly determines the quality of the synthetic voice.

Recording specifications:

  • Minimum 30 minutes of clean speech for commercial-quality voice cloning (1-3 hours is ideal)
  • Professional recording environment (sound-treated room, minimal background noise)
  • High-quality microphone (condenser microphone, not a laptop built-in mic)
  • 44.1kHz or 48kHz sample rate, 16-bit or 24-bit depth, WAV format
  • Consistent microphone distance and positioning across all recordings
  • Read a diverse set of texts covering the full phonetic range of the target language
  • Include domain-specific terminology in the recording script

Recording script design:

  • Include phonetically balanced sentences that cover all sounds in the target language
  • Include domain-specific sentences that contain terminology the voice will need to pronounce
  • Include varied sentence structures โ€” questions, statements, lists, exclamations
  • Include number-heavy content if the voice will read reports with numerical data

Production Architecture

Batch Generation Pipeline

For applications that generate audio content in advance (reports, notifications, content creation):

  1. Text content arrives in a processing queue
  2. Text normalization pipeline prepares the content for synthesis
  3. SSML markup is applied for pronunciation control
  4. Audio is generated using the TTS model or API
  5. Audio post-processing: volume normalization, silence trimming, format conversion
  6. Audio quality validation: check duration, volume levels, detect artifacts
  7. Delivery to the client's systems (storage, CDN, notification service)

Throughput considerations:

  • API-based TTS: rate limits vary by provider (typically 10-100 concurrent requests)
  • Self-hosted TTS on GPU: 10-50x real-time factor (1 minute of audio generated in 1-6 seconds)
  • Plan batch generation during off-peak hours to minimize infrastructure costs

Streaming Architecture

For real-time applications (virtual assistants, live captions):

  • Receive text input via WebSocket or gRPC stream
  • Chunk text at natural break points (sentence or clause boundaries)
  • Generate audio for each chunk as soon as it is available
  • Stream audio chunks to the client with minimal buffering
  • Target time-to-first-byte under 300ms for conversational applications

Audio Post-Processing

Volume normalization: Normalize all generated audio to a consistent loudness level (typically -16 to -14 LUFS for spoken content). This ensures consistent volume across different generated clips.

Silence management: Trim excessive silence from the beginning and end of clips. Normalize inter-sentence pauses to consistent durations.

Format conversion: Convert from the model's native output format to the required delivery format (MP3 for web, WAV for broadcast, AAC for mobile).

Quality detection: Automatically detect audio artifacts (clicks, pops, unnatural pauses, pitch glitches) and flag affected clips for regeneration.

Monitoring and Quality Assurance

Automated Quality Metrics

  • Mean Opinion Score (MOS) prediction: Use a neural MOS prediction model to estimate the perceived quality of generated audio without human listeners
  • Pronunciation accuracy: Compare generated audio against expected phonetic transcription using forced alignment
  • Duration consistency: Monitor the duration of generated audio relative to text length โ€” sudden changes may indicate model degradation
  • Artifact detection: Monitor for clipping, silence gaps, and spectral anomalies

Human Quality Review

  • Sample 1-2% of generated audio for human review
  • Have reviewers rate naturalness, pronunciation accuracy, and appropriateness
  • Track quality scores over time to detect gradual degradation
  • Focus reviews on content types that are most likely to have issues (heavy numerical content, domain terminology, proper nouns)

Your Next Step

Take 20 representative text samples from your client's actual content โ€” the exact text that the TTS system will need to speak. Run them through three TTS providers (ElevenLabs, Google Cloud TTS, and one open-source option like XTTS). Have the client listen to all three versions of each sample and rate them on naturalness, pronunciation accuracy, and appropriateness for their brand. This 2-hour evaluation gives you the data to make an informed provider selection and identify the text normalization challenges specific to the client's content. Do this before committing to a provider or estimating a delivery timeline.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification