AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Scoping Translation ProjectsTranslation Quality TiersLanguage Pair AssessmentTerminology RequirementsModel Architecture and SelectionCommercial Translation APIsOpen-Source Translation ModelsCustom Model TrainingTranslation Pipeline ArchitectureEnd-to-End PipelineTerminology EnforcementTranslation Memory IntegrationQuality EvaluationAutomated Quality MetricsHuman EvaluationPost-Editing ProductivityProduction OperationsScaling and PerformanceContinuous ImprovementYour Next Step
Home/Blog/Delivering Machine Translation Systems โ€” Building Custom Translation Pipelines for Enterprise Localization
Delivery

Delivering Machine Translation Systems โ€” Building Custom Translation Pipelines for Enterprise Localization

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท11 min read
machine translationlocalizationnlpenterprise ai

A localization agency in Berlin was hired by a global industrial equipment manufacturer to translate their technical documentation across 14 languages. The manufacturer produced 2.4 million words of new documentation monthly โ€” user manuals, safety guides, maintenance procedures, and training materials. Human translation at $0.18 per word across 14 languages cost $6 million annually, with a 4-6 week turnaround that delayed product launches in international markets. The agency built a custom machine translation system fine-tuned on the manufacturer's existing translation memory (8.2 million aligned sentence pairs accumulated over 12 years), integrated with a human post-editing workflow. Translation cost dropped to $0.058 per word โ€” a 68% reduction. Turnaround dropped from 4-6 weeks to 5 days. Human evaluators rated the machine-translated, post-edited output as equivalent to fully human-translated output 94% of the time.

Machine translation (MT) systems convert text from one language to another. For AI agencies, custom MT solutions are high-value deliverables because localization is expensive, time-consuming, and critical for global businesses. Generic translation APIs (Google Translate, DeepL) work well for general content, but enterprise clients with specialized terminology, brand voice requirements, and quality standards need customized systems that learn their specific language patterns.

Scoping Translation Projects

Translation Quality Tiers

Not all translation needs the same quality level. Define the quality tier before choosing an approach.

Raw machine translation (no human review):

  • Quality: 70-85% of human translation quality
  • Use cases: Internal communications, content triage, gisting (understanding the general meaning of foreign-language content)
  • Cost: $0.005-0.02 per word

Machine translation with light post-editing (MTPE-Light):

  • Quality: 85-93% of human translation quality
  • Use cases: Knowledge base articles, internal documentation, high-volume content with moderate quality requirements
  • Cost: $0.03-0.06 per word

Machine translation with full post-editing (MTPE-Full):

  • Quality: 93-98% of human translation quality
  • Use cases: Marketing content, product documentation, customer-facing content
  • Cost: $0.06-0.12 per word

Human translation with MT assistance:

  • Quality: 98-100% of human translation quality
  • Use cases: Legal contracts, regulatory filings, medical documents, creative marketing copy
  • Cost: $0.12-0.25 per word

Language Pair Assessment

Translation difficulty varies dramatically by language pair. Assess each pair's difficulty before committing to quality targets.

High-resource language pairs (abundant training data, mature MT models):

  • English to/from: French, German, Spanish, Portuguese, Chinese, Japanese
  • Expected quality: High even with generic models, excellent with customization

Medium-resource language pairs:

  • English to/from: Korean, Arabic, Turkish, Vietnamese, Thai, Polish
  • Expected quality: Good with generic models, very good with customization

Low-resource language pairs:

  • English to/from: Swahili, Khmer, Amharic, Burmese, many indigenous languages
  • Expected quality: Moderate with generic models, significant room for improvement with customization

Language-specific challenges:

  • Morphologically rich languages (Finnish, Turkish, Hungarian): Many word forms per lemma, requiring robust handling of inflections
  • Languages with no word boundaries (Chinese, Japanese, Thai): Require word segmentation as a preprocessing step
  • Right-to-left languages (Arabic, Hebrew): Require bidirectional text handling
  • Tonal languages (Chinese, Vietnamese): Homophone ambiguity requires strong contextual understanding

Terminology Requirements

Enterprise translation requires consistent handling of domain-specific terminology.

Terminology inventory:

  • Extract all domain-specific terms from the client's existing documentation and translation memories
  • Create a terminology database with source terms, approved translations for each target language, definitions, and context examples
  • Include brand names, product names, and terms that should not be translated
  • Include terms with multiple valid translations and guidelines for when to use each

Model Architecture and Selection

Commercial Translation APIs

DeepL API: Currently the highest quality commercial translation API for European languages. Strong support for formal/informal register, glossary enforcement, and document-level translation. Cost: $0.025 per character (approximately $0.005 per word).

Google Cloud Translation (Advanced): Broad language coverage (130+ languages), supports custom models trained on client data. Strong for Asian languages. Offers AutoML Translation for custom model training. Cost: $0.020 per character.

Amazon Translate: Good integration with AWS services, supports custom terminology, active custom translation models. Cost: $0.015 per character.

Azure Translator: Strong enterprise features, custom translator with training on parallel data, document translation. Cost: $0.010 per character.

Open-Source Translation Models

NLLB (No Language Left Behind, Meta): Supports 200+ languages. Strong for low-resource language pairs. Self-hostable.

mBART / mT5: Multilingual models that can be fine-tuned for translation tasks. Good starting point for custom translation systems.

OPUS-MT (Helsinki-NLP): Collection of translation models for specific language pairs. Good baseline quality for many pairs. Self-hostable, lightweight.

Madlad-400: Large multilingual model supporting 400+ languages. Strong zero-shot and few-shot translation capabilities.

Custom Model Training

Fine-tuning a translation model on the client's domain-specific parallel data is the most effective way to improve terminology handling and style consistency.

Training data requirements:

  • Minimum: 10,000 parallel sentence pairs for noticeable quality improvement
  • Recommended: 100,000-500,000 pairs for production-quality customization
  • Optimal: 1,000,000+ pairs for maximum domain adaptation

Training data sources:

  • Translation memories (TMs): The gold standard. Client's existing TM databases contain aligned source-target sentence pairs that represent their approved translations.
  • Aligned parallel documents: Source documents and their official translations, aligned at the sentence level using tools like LF Aligner or Hunalign.
  • Bilingual glossaries: Term-level translations that can be used for terminology enforcement and training data augmentation.
  • Synthetic parallel data: Use back-translation (translate target-language text into the source language) to generate additional training pairs.

Translation Pipeline Architecture

End-to-End Pipeline

Stage 1 โ€” Source Text Preprocessing:

  • Extract text from source documents while preserving formatting markup
  • Segment text into translation units (typically sentences)
  • Identify non-translatable elements (code, URLs, file paths, brand names)
  • Tag protected terms that should be passed through without translation
  • Detect the source language (if not specified)

Stage 2 โ€” Pre-Translation Processing:

  • Look up translation memory for exact and fuzzy matches
  • Apply terminology glossary to identify term translations
  • Handle text normalization (number formats, date formats, measurement units)

Stage 3 โ€” Machine Translation:

  • Translate each segment using the MT model
  • Apply terminology constraints (force the model to use approved term translations)
  • Generate confidence scores for each segment
  • Produce alternative translations for low-confidence segments

Stage 4 โ€” Post-Translation Processing:

  • Apply quality checks (terminology consistency, number preservation, tag integrity)
  • Enforce glossary terms (replace incorrect term translations with approved translations)
  • Normalize formatting (punctuation conventions, capitalization rules for the target language)
  • Reconstruct the document with original formatting

Stage 5 โ€” Quality Assessment and Routing:

  • Score each segment using automated quality metrics
  • Route segments by quality tier: auto-accept (high quality), light post-edit (medium quality), full post-edit (low quality)
  • Present segments for human review in a translation management interface

Terminology Enforcement

Consistent terminology is the difference between professional and amateur translation output.

Terminology enforcement approaches:

  • Glossary injection: Include approved term translations in the translation prompt or model input as constraints
  • Post-translation replacement: After translation, replace any incorrect term translations with the approved translations from the glossary
  • Fine-tuning on glossary examples: Include sentence pairs demonstrating correct terminology usage in the fine-tuning data
  • Constrained decoding: Modify the model's decoding algorithm to force specific token sequences when translating specific terms

Terminology consistency monitoring:

  • Track terminology usage across all translated segments
  • Flag segments where approved terms are not used consistently
  • Generate terminology consistency reports for quality assurance review

Translation Memory Integration

Translation memories (TMs) store previously translated segments for reuse. Integrating TM with MT creates a hybrid system that maximizes both consistency and efficiency.

TM-MT integration workflow:

  1. For each source segment, search the TM for matches
  2. 100% match: Use the TM translation directly (no MT needed)
  3. Fuzzy match (75-99%): Use the TM match as a starting point and apply MT to adjust for differences
  4. No match (below 75%): Translate using MT
  5. After human review, store the approved translation in the TM for future reuse

TM leverage rates (percentage of segments with TM matches):

  • Highly repetitive content (product updates, policy revisions): 40-70% TM leverage
  • Moderately repetitive content (user manuals, technical documentation): 20-40% TM leverage
  • Unique content (marketing copy, creative writing): 5-15% TM leverage

Higher TM leverage means lower MT costs and higher consistency.

Quality Evaluation

Automated Quality Metrics

BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between MT output and human reference translations. Industry standard but correlates poorly with human quality judgments for individual sentences. Useful for comparing systems and tracking quality over time.

COMET: Neural evaluation metric that correlates highly with human quality judgments. Uses a trained model to predict human quality scores from source, MT output, and reference. The recommended automated metric for modern MT evaluation.

ChrF: Character-level F-score that works well for morphologically rich languages where word-level metrics like BLEU are noisy.

Terminology accuracy: Percentage of domain terms translated according to the approved glossary. Track this as a separate metric because it directly impacts usability.

Human Evaluation

Multidimensional Quality Metrics (MQM):

The industry standard framework for human MT evaluation. Evaluators identify and classify errors in the translation:

  • Accuracy errors: Mistranslation, omission, addition, untranslated text
  • Fluency errors: Grammar, spelling, punctuation, style
  • Terminology errors: Incorrect term translation, inconsistent terminology
  • Locale errors: Number format, date format, currency, measurement units

Each error is weighted by severity (critical, major, minor) and the total weighted error count produces a quality score.

Evaluation sampling:

  • Evaluate at least 1,000 words per language pair per month
  • Sample from different document types and difficulty levels
  • Track quality scores over time to detect improvement or degradation
  • Compare quality across language pairs to identify pairs needing more customization

Post-Editing Productivity

For MTPE workflows, measure the productivity improvement that MT provides to human translators.

Key metrics:

  • Post-editing speed: Words per hour for post-editing MT output vs. translating from scratch
  • Post-editing distance: Percentage of MT output that the post-editor changes (lower is better)
  • Post-editor satisfaction: Subjective assessment of MT output quality from the post-editors themselves

Typical results for a well-customized MT system:

  • Post-editing speed: 2-4x faster than translating from scratch
  • Post-editing distance: 15-30% of words changed
  • Translators prefer post-editing over from-scratch translation for technical content

Production Operations

Scaling and Performance

Throughput considerations:

  • Commercial APIs: Rate limits vary (typically 100-1,000 requests per second)
  • Self-hosted models: GPU-dependent throughput. A single A10G GPU running an OPUS-MT model can translate approximately 50-200 segments per second.
  • Plan for batch processing during peak documentation release cycles

Cost monitoring:

  • Track translation cost per word by language pair
  • Track the proportion of segments handled by TM vs. MT vs. human translation
  • Monitor TM leverage rates and take action when leverage drops (may indicate new content types entering the pipeline)

Continuous Improvement

Feedback loop:

  1. Human post-editors correct MT output
  2. Corrections are stored in the translation memory
  3. Periodically retrain the MT model on the expanded parallel data (original training data plus post-edited segments)
  4. Each retraining cycle improves MT quality, reducing post-editing effort over time

Quality trend tracking:

  • Plot COMET scores and post-editing distance over time
  • Expect gradual improvement as the TM grows and the model is retrained
  • Investigate sudden quality drops (may indicate new document types, terminology changes, or model issues)

Your Next Step

Ask your client for their largest translation memory โ€” the database of previously translated segment pairs. Count the sentence pairs per language pair. For any pair with more than 10,000 aligned segments, fine-tune a translation model and compare its quality to a generic API on 200 representative sentences from the client's domain. Have a professional translator score both outputs on a 1-5 scale. If the fine-tuned model scores within 0.5 points of the generic API, customization has minimal value and the project should focus on terminology enforcement and workflow integration. If the fine-tuned model scores significantly higher, you have a strong case for a custom model. This evaluation takes 3-4 days and gives you the data to make an informed architectural decision.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification