A healthcare-focused AI agency in Boston was contracted by a hospital network to build an automated transcription system for clinical dictations. Physicians dictated patient notes using a mix of medical terminology, abbreviations, drug names, and conversational asides. The existing transcription service used human transcriptionists at $1.20 per minute of audio, costing the hospital $2.8 million annually with a 24-48 hour turnaround time. The agency's initial approach โ feeding audio directly to a general-purpose speech-to-text API โ produced transcripts with 82% word accuracy. That sounds acceptable until you realize that the 18% errors concentrated in medical terminology, drug names, and dosages โ exactly the words that matter most. "Metformin 500mg twice daily" transcribed as "met for men 500 MG twice daily" is not just an error, it is a patient safety risk. After fine-tuning a Whisper model on 4,000 hours of clinical dictation data, building a medical vocabulary correction layer, and implementing speaker diarization for multi-speaker clinical encounters, the system achieved 97% accuracy on medical terminology, processed 8,000 dictations daily with a median turnaround of 3 minutes, and reduced the hospital's transcription costs by 78%.
Speech-to-text (STT) โ also called automatic speech recognition (ASR) โ converts audio into text. For AI agencies, enterprise STT systems are high-value deliverables because nearly every organization generates significant volumes of spoken content that needs to be transcribed, analyzed, and integrated into business workflows. But the gap between a generic STT API and a production system that handles domain-specific vocabulary, noisy environments, multiple speakers, and enterprise integration requirements is where the real engineering happens.
Scoping STT Projects
Audio Environment Assessment
The acoustic environment determines the difficulty of the transcription task and the expected accuracy level.
Clean audio (studio recording, professional microphone, quiet environment):
- Expected word error rate (WER): 3-8% with a general model, 1-5% with a fine-tuned model
- Examples: podcast recordings, professional voiceovers, quiet office dictation
Moderate noise (office environment, standard microphone, some background noise):
- Expected WER: 8-15% with a general model, 4-8% with a fine-tuned model
- Examples: conference room meetings, phone calls, video conferences
High noise (industrial environments, crowd noise, poor microphone, multiple overlapping speakers):
- Expected WER: 15-30% with a general model, 8-15% with a fine-tuned model
- Examples: factory floor communication, field recordings, public event transcription
Telephony audio (8kHz sample rate, compression artifacts, variable connection quality):
- Expected WER: 10-20% with a general model, 5-10% with a fine-tuned model
- Examples: call center recordings, phone consultations, emergency dispatch
Vocabulary Assessment
Domain-specific vocabulary is the primary driver of customization needs. General STT models are trained on common English (or other language) vocabulary and struggle with specialized terms.
High-specialization domains:
- Medical: Drug names, anatomical terms, procedure codes, abbreviations
- Legal: Latin terms, case citations, legal terminology
- Financial: Ticker symbols, financial instruments, regulatory terms
- Technical: Software names, technical specifications, industry jargon
For each domain, create a vocabulary inventory:
- List all domain-specific terms the system must transcribe correctly
- Include common abbreviations and their expansions
- Include proper nouns (company names, product names, person names)
- Note any terms that sound like common English words but have domain-specific meanings
Requirements Definition
Functional requirements:
- Supported languages: Which languages and accents must the system handle?
- Speaker diarization: Does the system need to identify who is speaking?
- Punctuation and formatting: Does the output need punctuation, capitalization, and paragraph breaks?
- Timestamps: Does the output need word-level or segment-level timestamps?
- Real-time vs. batch: Does the system need to transcribe in real-time (live captions) or can it process audio after recording?
- Downstream integration: What systems consume the transcript? (search index, NLP pipeline, EHR system, CRM)
Non-functional requirements:
- Latency: For real-time transcription, what is the acceptable delay from speech to text? (200ms-2s typical)
- Throughput: How many hours of audio per day need to be transcribed?
- Accuracy target: What WER is acceptable? Define targets both for general vocabulary and for domain-specific terms.
- Availability: What uptime is required? (99.9% for clinical transcription, 99% for meeting transcription)
Architecture and Model Selection
Model Options
OpenAI Whisper: Open-source, multilingual, strong general-purpose accuracy. Available in multiple sizes (tiny to large-v3). Can be fine-tuned on domain-specific data. The default starting point for most agency STT projects.
Whisper variants and derivatives:
- Faster-Whisper: CTranslate2-based implementation that runs 4x faster than standard Whisper with identical accuracy
- WhisperX: Adds word-level timestamps and speaker diarization to Whisper
- Distil-Whisper: Distilled version that is 6x faster with minimal accuracy loss
Cloud STT APIs:
- Google Cloud Speech-to-Text: Strong accuracy, supports real-time streaming, medical and phone-call models available
- AWS Transcribe: Good for AWS-native architectures, supports medical transcription, custom vocabulary
- Azure Speech Service: Strong for Microsoft ecosystem integration, custom model training
Specialized models:
- NVIDIA NeMo: Open-source toolkit with pre-trained models optimized for specific domains (medical, financial)
- AssemblyAI: API with strong speaker diarization and content safety detection
- Deepgram: Optimized for speed and real-time transcription
Model Selection Framework
Choose Whisper (self-hosted) when:
- You need to fine-tune on domain-specific data for maximum accuracy
- Data privacy requirements prohibit sending audio to external APIs
- You need to control costs at high volume (self-hosted inference is cheaper than API pricing above approximately 10,000 hours per month)
- You want maximum flexibility in preprocessing and post-processing
Choose a cloud API when:
- Speed to deployment is the priority
- The domain vocabulary is not highly specialized
- Volume is moderate (under 10,000 hours per month)
- You want managed infrastructure with SLA guarantees
Choose a specialized model when:
- The domain has very specific accuracy requirements (medical, legal)
- The audio conditions require specialized handling (telephony, real-time streaming)
- The vendor offers features that would be expensive to build (speaker diarization, content moderation)
Fine-Tuning for Domain Accuracy
Fine-tuning a general STT model on domain-specific audio-transcript pairs is the most effective way to improve accuracy on specialized vocabulary.
Training data requirements:
- Minimum: 50 hours of domain-specific audio with accurate transcripts
- Recommended: 200-1,000 hours for production-quality fine-tuning
- Optimal: 1,000-5,000 hours for maximum domain accuracy
Training data sources:
- Existing transcribed recordings from the client (the gold standard if transcripts are accurate)
- Human transcription of a sample of the client's audio (expensive but high quality)
- Synthetic audio generated by TTS systems reading domain-specific text (useful for vocabulary coverage but less effective for acoustic domain adaptation)
Fine-tuning process:
- Prepare audio-transcript pairs in the required format
- Split into training (80%), validation (10%), and test (10%) sets
- Fine-tune the base Whisper model using the Hugging Face Transformers library or similar framework
- Monitor validation WER during training โ stop when validation WER plateaus
- Evaluate on the held-out test set and compare to the base model
Expected improvement from fine-tuning:
- General vocabulary: 10-30% relative WER reduction
- Domain-specific terms: 30-60% relative WER reduction
- The improvement is largest for terms that the general model has never seen
Audio Processing Pipeline
Preprocessing
Audio format normalization:
- Convert all input audio to a standard format (16kHz sample rate, mono channel, 16-bit PCM)
- Handle diverse input formats (MP3, WAV, FLAC, M4A, OGG, raw PCM)
- Validate audio quality โ reject corrupted files, empty files, or files below minimum duration
Noise reduction:
- Apply noise reduction for audio with significant background noise
- Use a neural noise reduction model (RNNoise, DeepFilterNet) for general noise
- Apply spectral gating for stationary noise (fan noise, hum)
- Be cautious โ aggressive noise reduction can distort speech and reduce accuracy
Voice Activity Detection (VAD):
- Detect segments of audio that contain speech vs. silence or noise
- Skip non-speech segments to reduce processing time and cost
- Use a lightweight VAD model (Silero VAD is fast and accurate) to segment audio before transcription
- VAD is essential for long recordings with significant non-speech portions (meetings with breaks, phone calls on hold)
Speaker Diarization
Speaker diarization identifies who is speaking when. It is essential for meetings, interviews, clinical encounters, and any multi-speaker audio.
Diarization approaches:
- Embedding-based clustering: Extract speaker embeddings for each speech segment, then cluster embeddings to identify distinct speakers. This is the traditional approach and works well for 2-10 speakers.
- End-to-end neural diarization: Use a single neural network to jointly detect speech, identify speakers, and handle overlapping speech. More accurate than clustering-based approaches for complex multi-speaker scenarios.
- Prompted diarization: If speaker identities are known in advance (doctor and patient, interviewer and interviewee), provide speaker enrollment samples to the diarization system for higher accuracy.
Diarization tools:
- pyannote.audio: Open-source, state-of-the-art diarization with support for overlapping speech detection
- NeMo Speaker Diarization: NVIDIA's open-source diarization pipeline
- WhisperX: Integrates Whisper transcription with word-level diarization
Punctuation and Formatting
Raw STT output is typically lowercase text without punctuation. Enterprise applications need formatted output.
Punctuation restoration:
- Use a fine-tuned language model to add punctuation (periods, commas, question marks) to the raw transcript
- Train on domain-specific data for best results โ medical dictation has different punctuation patterns than business meetings
- Common models: fine-tuned BERT or T5 for punctuation prediction
Formatting:
- Capitalize the first word of each sentence and proper nouns
- Format numbers, dates, currency values, and measurements according to domain conventions
- Apply paragraph breaks based on topic changes or speaker turns
- Format domain-specific content (medical notes have a specific structure, legal transcripts have specific formatting requirements)
Post-Processing and Correction
Custom Vocabulary Correction
Even after fine-tuning, STT models sometimes produce phonetically similar but incorrect substitutions for domain-specific terms.
Vocabulary correction pipeline:
- Maintain a domain-specific vocabulary list with canonical spellings
- For each word in the transcript, compute phonetic similarity to vocabulary entries (using Soundex, Metaphone, or phonetic embeddings)
- If a transcript word is phonetically similar to a vocabulary entry and not a common English word, replace it with the vocabulary entry
- Apply context-aware correction using a language model to resolve ambiguities (is "principal" a person or a financial term?)
Vocabulary list management:
- Start with the client's existing terminology lists, glossaries, and style guides
- Expand by analyzing transcription errors in the first weeks of production
- Update regularly as new terms enter the domain (new drug names, new product names, new regulations)
Confidence-Based Human Review
Not every transcript needs human review. Use confidence scores to route uncertain transcripts efficiently.
Confidence scoring:
- Aggregate word-level confidence scores from the STT model
- Weight domain-specific terms higher than common words in the confidence calculation
- Factor in audio quality metrics (SNR, clarity) as a predictor of transcription accuracy
Review routing:
- High confidence (above 95%): Auto-accept with no review
- Medium confidence (80-95%): Auto-accept but include in periodic batch review
- Low confidence (below 80%): Route to human reviewer immediately
Production Deployment
Real-Time Transcription Architecture
For applications that need live transcription (live captions, real-time note-taking):
- Stream audio chunks to the STT model in overlapping windows
- Use a streaming-capable model (Whisper with streaming adapter, Google Streaming STT, or Deepgram)
- Display partial transcripts that update as more audio context becomes available
- Buffer 2-5 seconds of audio before starting transcription to provide initial context
Batch Processing Architecture
For applications that process recorded audio:
- Accept audio uploads via API or file drop
- Queue audio files for processing
- Process with GPU-accelerated STT workers
- Apply post-processing (punctuation, formatting, vocabulary correction)
- Deliver transcripts to the client's systems
Throughput benchmarks:
- Whisper large-v3 on A10G GPU: approximately 10-30x real-time (1 hour of audio processed in 2-6 minutes)
- Faster-Whisper large-v3 on A10G GPU: approximately 40-100x real-time (1 hour of audio processed in 36-90 seconds)
- Distil-Whisper on A10G GPU: approximately 60-150x real-time
Monitoring
Metrics to track:
- Word error rate (from human review samples): The primary quality metric
- Processing latency: Time from audio submission to transcript delivery
- Throughput: Hours of audio processed per hour of wall time
- Domain-specific term accuracy: Separate WER for domain vocabulary
- Human review rate: Percentage of transcripts requiring human correction
- Cost per hour of audio transcribed
Your Next Step
Collect 2 hours of representative audio from your client's actual environment โ not clean studio recordings, but real-world audio with the noise, accents, vocabulary, and speaking styles that the production system will encounter. Transcribe this audio using a general-purpose STT model (Whisper large-v3 is the best free option). Have a domain expert review the transcripts and mark every error, categorizing errors as general vocabulary, domain terminology, proper nouns, or audio quality issues. This error analysis tells you exactly where to invest: if 70% of errors are on domain terminology, fine-tuning is the priority. If 70% are audio quality issues, preprocessing is the priority. If 70% are proper nouns, a custom vocabulary correction layer is the priority. Let the data tell you where to focus before you commit to an architecture.