Delivering Machine Translation Systems — Building Custom Translation Pipelines for Enterprise Localization

A localization agency in Berlin was hired by a global industrial equipment manufacturer to translate their technical documentation across 14 languages. The manufacturer produced 2.4 million words of new documentation monthly — user manuals, safety guides, maintenance procedures, and training materials. Human translation at $0.18 per word across 14 languages cost $6 million annually, with a 4-6 week turnaround that delayed product launches in international markets. The agency built a custom machine translation system fine-tuned on the manufacturer's existing translation memory (8.2 million aligned sentence pairs accumulated over 12 years), integrated with a human post-editing workflow. Translation cost dropped to $0.058 per word — a 68% reduction. Turnaround dropped from 4-6 weeks to 5 days. Human evaluators rated the machine-translated, post-edited output as equivalent to fully human-translated output 94% of the time.

Machine translation (MT) systems convert text from one language to another. For AI agencies, custom MT solutions are high-value deliverables because localization is expensive, time-consuming, and critical for global businesses. Generic translation APIs (Google Translate, DeepL) work well for general content, but enterprise clients with specialized terminology, brand voice requirements, and quality standards need customized systems that learn their specific language patterns.

Scoping Translation Projects

Translation Quality Tiers

Not all translation needs the same quality level. Define the quality tier before choosing an approach.

Raw machine translation (no human review):

Quality: 70-85% of human translation quality
Use cases: Internal communications, content triage, gisting (understanding the general meaning of foreign-language content)
Cost: $0.005-0.02 per word

Machine translation with light post-editing (MTPE-Light):

Quality: 85-93% of human translation quality
Use cases: Knowledge base articles, internal documentation, high-volume content with moderate quality requirements
Cost: $0.03-0.06 per word

Machine translation with full post-editing (MTPE-Full):

Quality: 93-98% of human translation quality
Use cases: Marketing content, product documentation, customer-facing content
Cost: $0.06-0.12 per word

Human translation with MT assistance:

Quality: 98-100% of human translation quality
Use cases: Legal contracts, regulatory filings, medical documents, creative marketing copy
Cost: $0.12-0.25 per word

Language Pair Assessment

Translation difficulty varies dramatically by language pair. Assess each pair's difficulty before committing to quality targets.

High-resource language pairs (abundant training data, mature MT models):

English to/from: French, German, Spanish, Portuguese, Chinese, Japanese
Expected quality: High even with generic models, excellent with customization

Medium-resource language pairs:

English to/from: Korean, Arabic, Turkish, Vietnamese, Thai, Polish
Expected quality: Good with generic models, very good with customization

Low-resource language pairs:

English to/from: Swahili, Khmer, Amharic, Burmese, many indigenous languages
Expected quality: Moderate with generic models, significant room for improvement with customization

Language-specific challenges:

Morphologically rich languages (Finnish, Turkish, Hungarian): Many word forms per lemma, requiring robust handling of inflections
Languages with no word boundaries (Chinese, Japanese, Thai): Require word segmentation as a preprocessing step
Right-to-left languages (Arabic, Hebrew): Require bidirectional text handling
Tonal languages (Chinese, Vietnamese): Homophone ambiguity requires strong contextual understanding

Terminology Requirements

Enterprise translation requires consistent handling of domain-specific terminology.

Terminology inventory:

Extract all domain-specific terms from the client's existing documentation and translation memories
Create a terminology database with source terms, approved translations for each target language, definitions, and context examples
Include brand names, product names, and terms that should not be translated
Include terms with multiple valid translations and guidelines for when to use each

Model Architecture and Selection

Commercial Translation APIs

DeepL API: Currently the highest quality commercial translation API for European languages. Strong support for formal/informal register, glossary enforcement, and document-level translation. Cost: $0.025 per character (approximately $0.005 per word).

Google Cloud Translation (Advanced): Broad language coverage (130+ languages), supports custom models trained on client data. Strong for Asian languages. Offers AutoML Translation for custom model training. Cost: $0.020 per character.

Amazon Translate: Good integration with AWS services, supports custom terminology, active custom translation models. Cost: $0.015 per character.

Azure Translator: Strong enterprise features, custom translator with training on parallel data, document translation. Cost: $0.010 per character.

Open-Source Translation Models

NLLB (No Language Left Behind, Meta): Supports 200+ languages. Strong for low-resource language pairs. Self-hostable.

mBART / mT5: Multilingual models that can be fine-tuned for translation tasks. Good starting point for custom translation systems.

OPUS-MT (Helsinki-NLP): Collection of translation models for specific language pairs. Good baseline quality for many pairs. Self-hostable, lightweight.

Madlad-400: Large multilingual model supporting 400+ languages. Strong zero-shot and few-shot translation capabilities.

Custom Model Training

Fine-tuning a translation model on the client's domain-specific parallel data is the most effective way to improve terminology handling and style consistency.

Training data requirements:

Minimum: 10,000 parallel sentence pairs for noticeable quality improvement
Recommended: 100,000-500,000 pairs for production-quality customization
Optimal: 1,000,000+ pairs for maximum domain adaptation

Training data sources:

Translation memories (TMs): The gold standard. Client's existing TM databases contain aligned source-target sentence pairs that represent their approved translations.
Aligned parallel documents: Source documents and their official translations, aligned at the sentence level using tools like LF Aligner or Hunalign.
Bilingual glossaries: Term-level translations that can be used for terminology enforcement and training data augmentation.
Synthetic parallel data: Use back-translation (translate target-language text into the source language) to generate additional training pairs.

Translation Pipeline Architecture

End-to-End Pipeline

Stage 1 — Source Text Preprocessing:

Extract text from source documents while preserving formatting markup
Segment text into translation units (typically sentences)
Identify non-translatable elements (code, URLs, file paths, brand names)
Tag protected terms that should be passed through without translation
Detect the source language (if not specified)

Stage 2 — Pre-Translation Processing:

Look up translation memory for exact and fuzzy matches
Apply terminology glossary to identify term translations
Handle text normalization (number formats, date formats, measurement units)

Stage 3 — Machine Translation:

Translate each segment using the MT model
Apply terminology constraints (force the model to use approved term translations)
Generate confidence scores for each segment
Produce alternative translations for low-confidence segments

Stage 4 — Post-Translation Processing:

Apply quality checks (terminology consistency, number preservation, tag integrity)
Enforce glossary terms (replace incorrect term translations with approved translations)
Normalize formatting (punctuation conventions, capitalization rules for the target language)
Reconstruct the document with original formatting

Stage 5 — Quality Assessment and Routing:

Score each segment using automated quality metrics
Route segments by quality tier: auto-accept (high quality), light post-edit (medium quality), full post-edit (low quality)
Present segments for human review in a translation management interface

Terminology Enforcement

Consistent terminology is the difference between professional and amateur translation output.

Terminology enforcement approaches:

Glossary injection: Include approved term translations in the translation prompt or model input as constraints
Post-translation replacement: After translation, replace any incorrect term translations with the approved translations from the glossary
Fine-tuning on glossary examples: Include sentence pairs demonstrating correct terminology usage in the fine-tuning data
Constrained decoding: Modify the model's decoding algorithm to force specific token sequences when translating specific terms

Terminology consistency monitoring:

Track terminology usage across all translated segments
Flag segments where approved terms are not used consistently
Generate terminology consistency reports for quality assurance review

Translation Memory Integration

Translation memories (TMs) store previously translated segments for reuse. Integrating TM with MT creates a hybrid system that maximizes both consistency and efficiency.

TM-MT integration workflow:

For each source segment, search the TM for matches
100% match: Use the TM translation directly (no MT needed)
Fuzzy match (75-99%): Use the TM match as a starting point and apply MT to adjust for differences
No match (below 75%): Translate using MT
After human review, store the approved translation in the TM for future reuse

TM leverage rates (percentage of segments with TM matches):

Highly repetitive content (product updates, policy revisions): 40-70% TM leverage
Moderately repetitive content (user manuals, technical documentation): 20-40% TM leverage
Unique content (marketing copy, creative writing): 5-15% TM leverage

Higher TM leverage means lower MT costs and higher consistency.

Quality Evaluation

Automated Quality Metrics

BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between MT output and human reference translations. Industry standard but correlates poorly with human quality judgments for individual sentences. Useful for comparing systems and tracking quality over time.

COMET: Neural evaluation metric that correlates highly with human quality judgments. Uses a trained model to predict human quality scores from source, MT output, and reference. The recommended automated metric for modern MT evaluation.

ChrF: Character-level F-score that works well for morphologically rich languages where word-level metrics like BLEU are noisy.

Terminology accuracy: Percentage of domain terms translated according to the approved glossary. Track this as a separate metric because it directly impacts usability.

Human Evaluation

Multidimensional Quality Metrics (MQM):

The industry standard framework for human MT evaluation. Evaluators identify and classify errors in the translation:

Accuracy errors: Mistranslation, omission, addition, untranslated text
Fluency errors: Grammar, spelling, punctuation, style
Terminology errors: Incorrect term translation, inconsistent terminology
Locale errors: Number format, date format, currency, measurement units

Each error is weighted by severity (critical, major, minor) and the total weighted error count produces a quality score.

Evaluation sampling:

Evaluate at least 1,000 words per language pair per month
Sample from different document types and difficulty levels
Track quality scores over time to detect improvement or degradation
Compare quality across language pairs to identify pairs needing more customization

Post-Editing Productivity

For MTPE workflows, measure the productivity improvement that MT provides to human translators.

Key metrics:

Post-editing speed: Words per hour for post-editing MT output vs. translating from scratch
Post-editing distance: Percentage of MT output that the post-editor changes (lower is better)
Post-editor satisfaction: Subjective assessment of MT output quality from the post-editors themselves

Typical results for a well-customized MT system:

Post-editing speed: 2-4x faster than translating from scratch
Post-editing distance: 15-30% of words changed
Translators prefer post-editing over from-scratch translation for technical content

Production Operations

Scaling and Performance

Throughput considerations:

Commercial APIs: Rate limits vary (typically 100-1,000 requests per second)
Self-hosted models: GPU-dependent throughput. A single A10G GPU running an OPUS-MT model can translate approximately 50-200 segments per second.
Plan for batch processing during peak documentation release cycles

Cost monitoring:

Track translation cost per word by language pair
Track the proportion of segments handled by TM vs. MT vs. human translation
Monitor TM leverage rates and take action when leverage drops (may indicate new content types entering the pipeline)

Continuous Improvement

Feedback loop:

Human post-editors correct MT output
Corrections are stored in the translation memory
Periodically retrain the MT model on the expanded parallel data (original training data plus post-edited segments)
Each retraining cycle improves MT quality, reducing post-editing effort over time

Quality trend tracking:

Plot COMET scores and post-editing distance over time
Expect gradual improvement as the TM grows and the model is retrained
Investigate sudden quality drops (may indicate new document types, terminology changes, or model issues)

Your Next Step

Ask your client for their largest translation memory — the database of previously translated segment pairs. Count the sentence pairs per language pair. For any pair with more than 10,000 aligned segments, fine-tune a translation model and compare its quality to a generic API on 200 representative sentences from the client's domain. Have a professional translator score both outputs on a 1-5 scale. If the fine-tuned model scores within 0.5 points of the generic API, customization has minimal value and the project should focus on terminology enforcement and workflow integration. If the fine-tuned model scores significantly higher, you have a strong case for a custom model. This evaluation takes 3-4 days and gives you the data to make an informed architectural decision.

Scoping Translation Projects

Translation Quality Tiers

Not all translation needs the same quality level. Define the quality tier before choosing an approach.

Raw machine translation (no human review):

Quality: 70-85% of human translation quality
Use cases: Internal communications, content triage, gisting (understanding the general meaning of foreign-language content)
Cost: $0.005-0.02 per word

Machine translation with light post-editing (MTPE-Light):

Quality: 85-93% of human translation quality
Use cases: Knowledge base articles, internal documentation, high-volume content with moderate quality requirements
Cost: $0.03-0.06 per word

Machine translation with full post-editing (MTPE-Full):

Quality: 93-98% of human translation quality
Use cases: Marketing content, product documentation, customer-facing content
Cost: $0.06-0.12 per word

Human translation with MT assistance:

Quality: 98-100% of human translation quality
Use cases: Legal contracts, regulatory filings, medical documents, creative marketing copy
Cost: $0.12-0.25 per word

Language Pair Assessment

Translation difficulty varies dramatically by language pair. Assess each pair's difficulty before committing to quality targets.

High-resource language pairs (abundant training data, mature MT models):

English to/from: French, German, Spanish, Portuguese, Chinese, Japanese
Expected quality: High even with generic models, excellent with customization

Medium-resource language pairs:

English to/from: Korean, Arabic, Turkish, Vietnamese, Thai, Polish
Expected quality: Good with generic models, very good with customization

Low-resource language pairs:

English to/from: Swahili, Khmer, Amharic, Burmese, many indigenous languages
Expected quality: Moderate with generic models, significant room for improvement with customization

Language-specific challenges:

Morphologically rich languages (Finnish, Turkish, Hungarian): Many word forms per lemma, requiring robust handling of inflections
Languages with no word boundaries (Chinese, Japanese, Thai): Require word segmentation as a preprocessing step
Right-to-left languages (Arabic, Hebrew): Require bidirectional text handling
Tonal languages (Chinese, Vietnamese): Homophone ambiguity requires strong contextual understanding

Terminology Requirements

Enterprise translation requires consistent handling of domain-specific terminology.

Terminology inventory:

Extract all domain-specific terms from the client's existing documentation and translation memories
Create a terminology database with source terms, approved translations for each target language, definitions, and context examples
Include brand names, product names, and terms that should not be translated
Include terms with multiple valid translations and guidelines for when to use each

Model Architecture and Selection

Commercial Translation APIs

Amazon Translate: Good integration with AWS services, supports custom terminology, active custom translation models. Cost: $0.015 per character.

Azure Translator: Strong enterprise features, custom translator with training on parallel data, document translation. Cost: $0.010 per character.

Open-Source Translation Models

NLLB (No Language Left Behind, Meta): Supports 200+ languages. Strong for low-resource language pairs. Self-hostable.

mBART / mT5: Multilingual models that can be fine-tuned for translation tasks. Good starting point for custom translation systems.

OPUS-MT (Helsinki-NLP): Collection of translation models for specific language pairs. Good baseline quality for many pairs. Self-hostable, lightweight.

Madlad-400: Large multilingual model supporting 400+ languages. Strong zero-shot and few-shot translation capabilities.

Custom Model Training

Fine-tuning a translation model on the client's domain-specific parallel data is the most effective way to improve terminology handling and style consistency.

Training data requirements:

Minimum: 10,000 parallel sentence pairs for noticeable quality improvement
Recommended: 100,000-500,000 pairs for production-quality customization
Optimal: 1,000,000+ pairs for maximum domain adaptation

Training data sources:

Translation memories (TMs): The gold standard. Client's existing TM databases contain aligned source-target sentence pairs that represent their approved translations.
Aligned parallel documents: Source documents and their official translations, aligned at the sentence level using tools like LF Aligner or Hunalign.
Bilingual glossaries: Term-level translations that can be used for terminology enforcement and training data augmentation.
Synthetic parallel data: Use back-translation (translate target-language text into the source language) to generate additional training pairs.

Translation Pipeline Architecture

End-to-End Pipeline

Stage 1 — Source Text Preprocessing:

Extract text from source documents while preserving formatting markup
Segment text into translation units (typically sentences)
Identify non-translatable elements (code, URLs, file paths, brand names)
Tag protected terms that should be passed through without translation
Detect the source language (if not specified)

Stage 2 — Pre-Translation Processing:

Look up translation memory for exact and fuzzy matches
Apply terminology glossary to identify term translations
Handle text normalization (number formats, date formats, measurement units)

Stage 3 — Machine Translation:

Translate each segment using the MT model
Apply terminology constraints (force the model to use approved term translations)
Generate confidence scores for each segment
Produce alternative translations for low-confidence segments

Stage 4 — Post-Translation Processing:

Apply quality checks (terminology consistency, number preservation, tag integrity)
Enforce glossary terms (replace incorrect term translations with approved translations)
Normalize formatting (punctuation conventions, capitalization rules for the target language)
Reconstruct the document with original formatting

Stage 5 — Quality Assessment and Routing:

Score each segment using automated quality metrics
Route segments by quality tier: auto-accept (high quality), light post-edit (medium quality), full post-edit (low quality)
Present segments for human review in a translation management interface

Terminology Enforcement

Consistent terminology is the difference between professional and amateur translation output.

Terminology enforcement approaches:

Glossary injection: Include approved term translations in the translation prompt or model input as constraints
Post-translation replacement: After translation, replace any incorrect term translations with the approved translations from the glossary
Fine-tuning on glossary examples: Include sentence pairs demonstrating correct terminology usage in the fine-tuning data
Constrained decoding: Modify the model's decoding algorithm to force specific token sequences when translating specific terms

Terminology consistency monitoring:

Track terminology usage across all translated segments
Flag segments where approved terms are not used consistently
Generate terminology consistency reports for quality assurance review

Translation Memory Integration

Translation memories (TMs) store previously translated segments for reuse. Integrating TM with MT creates a hybrid system that maximizes both consistency and efficiency.

TM-MT integration workflow:

For each source segment, search the TM for matches
100% match: Use the TM translation directly (no MT needed)
Fuzzy match (75-99%): Use the TM match as a starting point and apply MT to adjust for differences
No match (below 75%): Translate using MT
After human review, store the approved translation in the TM for future reuse

TM leverage rates (percentage of segments with TM matches):

Highly repetitive content (product updates, policy revisions): 40-70% TM leverage
Moderately repetitive content (user manuals, technical documentation): 20-40% TM leverage
Unique content (marketing copy, creative writing): 5-15% TM leverage

Higher TM leverage means lower MT costs and higher consistency.

Quality Evaluation

Automated Quality Metrics

ChrF: Character-level F-score that works well for morphologically rich languages where word-level metrics like BLEU are noisy.

Terminology accuracy: Percentage of domain terms translated according to the approved glossary. Track this as a separate metric because it directly impacts usability.

Human Evaluation

Multidimensional Quality Metrics (MQM):

The industry standard framework for human MT evaluation. Evaluators identify and classify errors in the translation:

Accuracy errors: Mistranslation, omission, addition, untranslated text
Fluency errors: Grammar, spelling, punctuation, style
Terminology errors: Incorrect term translation, inconsistent terminology
Locale errors: Number format, date format, currency, measurement units

Each error is weighted by severity (critical, major, minor) and the total weighted error count produces a quality score.

Evaluation sampling:

Evaluate at least 1,000 words per language pair per month
Sample from different document types and difficulty levels
Track quality scores over time to detect improvement or degradation
Compare quality across language pairs to identify pairs needing more customization

Post-Editing Productivity

For MTPE workflows, measure the productivity improvement that MT provides to human translators.

Key metrics:

Post-editing speed: Words per hour for post-editing MT output vs. translating from scratch
Post-editing distance: Percentage of MT output that the post-editor changes (lower is better)
Post-editor satisfaction: Subjective assessment of MT output quality from the post-editors themselves

Typical results for a well-customized MT system:

Post-editing speed: 2-4x faster than translating from scratch
Post-editing distance: 15-30% of words changed
Translators prefer post-editing over from-scratch translation for technical content

Production Operations

Scaling and Performance

Throughput considerations:

Commercial APIs: Rate limits vary (typically 100-1,000 requests per second)
Self-hosted models: GPU-dependent throughput. A single A10G GPU running an OPUS-MT model can translate approximately 50-200 segments per second.
Plan for batch processing during peak documentation release cycles

Cost monitoring:

Track translation cost per word by language pair
Track the proportion of segments handled by TM vs. MT vs. human translation
Monitor TM leverage rates and take action when leverage drops (may indicate new content types entering the pipeline)

Continuous Improvement

Feedback loop:

Human post-editors correct MT output
Corrections are stored in the translation memory
Periodically retrain the MT model on the expanded parallel data (original training data plus post-edited segments)
Each retraining cycle improves MT quality, reducing post-editing effort over time

Quality trend tracking:

Plot COMET scores and post-editing distance over time
Expect gradual improvement as the TM grows and the model is retrained
Investigate sudden quality drops (may indicate new document types, terminology changes, or model issues)

Delivering Machine Translation Systems — Building Custom Translation Pipelines for Enterprise Localization

Scoping Translation Projects

Translation Quality Tiers

Language Pair Assessment

Terminology Requirements

Model Architecture and Selection

Commercial Translation APIs

Open-Source Translation Models

Custom Model Training

Translation Pipeline Architecture

End-to-End Pipeline

Terminology Enforcement

Translation Memory Integration

Quality Evaluation

Automated Quality Metrics

Human Evaluation

Post-Editing Productivity

Production Operations

Scaling and Performance

Continuous Improvement

Your Next Step

Agency Script Editorial

Related Articles

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Building Synthetic Data Generation Pipelines — Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

Ready to certify your AI capability?

Delivering Machine Translation Systems — Building Custom Translation Pipelines for Enterprise Localization

Scoping Translation Projects

Translation Quality Tiers

Language Pair Assessment

Terminology Requirements

Model Architecture and Selection

Commercial Translation APIs

Open-Source Translation Models

Custom Model Training

Translation Pipeline Architecture

End-to-End Pipeline

Terminology Enforcement

Translation Memory Integration

Quality Evaluation

Automated Quality Metrics

Human Evaluation

Post-Editing Productivity

Production Operations

Scaling and Performance

Continuous Improvement

Your Next Step

Agency Script Editorial

Related Articles

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Building Synthetic Data Generation Pipelines — Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

Ready to certify your AI capability?