Integrating with Messy Client Data Systems: The Unglamorous Reality of AI Delivery
An agency signed a $120,000 contract to build an intelligent document routing system for a regional insurance carrier. The client assured them that all policy documents were stored in a centralized document management system with consistent metadata. Six weeks in, the agency discovered the truth. Roughly 60 percent of documents were in the document management system. Another 25 percent were in a legacy system from a 2019 acquisition that had never been fully migrated. Ten percent were in shared network drives organized by individual adjuster โ some in folders named things like "John's Stuff" and "OLD DO NOT DELETE." The remaining 5 percent existed only as email attachments. The document management system's metadata was inconsistent: policy numbers followed three different format conventions from three different IT regimes. Date fields contained a mix of MM/DD/YYYY, DD/MM/YYYY, and YYYY-MM-DD formats. The "document type" field โ critical for routing โ had 47 distinct values, many of which were misspellings or abbreviations of each other. The agency spent eight of their twelve budgeted weeks on data integration and cleaning, leaving only four weeks for the actual AI system. The result was a system that worked but was far less sophisticated than what they had proposed. The client was disappointed but understanding โ they knew their data was a mess, even if they had underestimated how bad it was.
Every AI agency has a version of this story. Client data is always messier than described, always harder to access than promised, and always more complex than the sales conversation implied. The agencies that deliver consistently are the ones that expect this messiness, budget for it, and have systematic approaches for handling it. The agencies that believe the data will be clean are the ones that blow timelines and budgets.
Why Client Data Is Always Messy
Understanding why client data is messy helps you anticipate specific problems before you encounter them.
Organizational complexity. Large organizations have multiple departments, multiple IT systems, multiple data entry conventions, and multiple generations of technology. Each of these creates data inconsistencies that accumulate over time.
Acquisition history. Companies that have grown through acquisitions carry the data systems and conventions of each acquired company. Full data migration is expensive and disruptive, so it is often deferred or done incompletely.
Human data entry. Data entered by humans contains typos, abbreviations, inconsistent formatting, missing fields, and creative interpretations of field definitions. No amount of validation catches everything.
Schema evolution. Databases evolve over time. Fields are added, renamed, repurposed, and deprecated. Documentation of these changes is often incomplete or nonexistent.
Vendor lock-in. Data stored in vendor-specific systems may be difficult to extract in usable formats. APIs may be limited, documentation may be poor, and export capabilities may be restricted.
Undocumented business logic. Critical data transformations and business rules exist only in the minds of long-tenured employees or in undocumented stored procedures. When you query the data, you get results that do not match your expectations because of logic you did not know existed.
The Data Assessment Phase
Before committing to a project scope, timeline, or budget, assess the client's actual data situation. This assessment should be a paid engagement โ never do deep data assessment for free.
What to Assess
Data inventory. Identify all data sources relevant to the project. Do not accept the client's initial list as exhaustive. Ask probing questions: "Is there any data from before 2020?" "Are there any systems from acquisitions?" "Does anyone keep data in spreadsheets or local files?"
Data access. Verify that you can actually access each data source. Test API connections, database credentials, file system access, and export capabilities. Access that is "being set up" or "should be ready next week" is access you do not have.
Data quality. Profile each data source. Measure completeness โ what percentage of records have values for each field. Measure consistency โ do the same entities have the same identifiers across sources. Measure accuracy โ do field values match real-world facts for a sample of records.
Data volume. Measure the volume of data you will need to process. Row counts, file sizes, and record counts all affect your pipeline design, processing time, and infrastructure requirements.
Data freshness. Understand how frequently data is updated, how quickly updates need to be reflected in your system, and whether historical data is available for training.
Data governance. Understand who owns the data, who can authorize access, what privacy regulations apply, and what restrictions exist on data usage, storage, and processing.
Red Flags to Watch For
"The data is in good shape." This almost always means the client has not looked closely at the data recently. Verify independently.
No data dictionary. If the client cannot provide documentation of their data schema, expect to spend significant time reverse-engineering it.
Multiple systems of record. When the same data entity exists in multiple systems, reconciliation will be a major effort.
Recent system migrations. Data quality issues are most severe during and immediately after system migrations.
Restricted access. If it takes weeks to get database access or API credentials, expect ongoing access issues throughout the project.
No data engineering team. If the client does not have dedicated data engineering resources, expect that data pipelines are fragile, undocumented, and running on individual machines.
Data Integration Strategies
Once you understand the data landscape, choose your integration approach based on the specific challenges you face.
Extract, Transform, Load
The classic approach: extract data from source systems, transform it into a consistent format, and load it into your working environment.
When to use it. ETL is appropriate when you need full control over the data, when source systems are slow or unreliable, when you need to combine data from multiple sources, or when your processing is batch-oriented.
Key decisions. How often to run the ETL. How to handle records that fail transformation. How to detect and process changes since the last ETL run. How to reconcile data across sources.
Common pitfalls. ETL pipelines that work on sample data often fail on full data due to edge cases โ unexpected character encodings, extremely long fields, null values in unexpected places, or circular references. Test your ETL on a representative sample of the full data, including the messiest records.
API Integration
Connect directly to client systems through their APIs and process data on demand.
When to use it. API integration is appropriate when data freshness is critical, when the volume of relevant data per request is small, or when the client's system provides a well-documented API with the queries you need.
Key decisions. How to handle API rate limits. How to handle API failures and retries. How to cache API responses to avoid redundant calls. How to handle API authentication and token refresh.
Common pitfalls. Client APIs are often poorly documented, inconsistently implemented, and subject to undocumented changes. Build defensive integrations with comprehensive error handling and monitoring. Never assume the API will always return what the documentation says it should.
Data Federation
Query data across multiple source systems without consolidating it into a single store.
When to use it. Federation is appropriate when data cannot be moved due to regulatory constraints, when the overhead of maintaining a consolidated copy is too high, or when you need real-time access to authoritative source data.
Key decisions. How to join data across systems with different schemas. How to handle queries that span slow and fast systems. How to manage transaction consistency across distributed sources.
Common pitfalls. Federated queries are inherently slower than queries against consolidated data. Performance optimization is challenging and often requires caching and materialized views that partially negate the benefits of federation.
Data Cleaning Strategies
Client data will need cleaning. The question is how much and how to do it efficiently.
Standardization
String normalization. Standardize capitalization, whitespace, character encoding, and punctuation. "NEW YORK", "New York", "new york", and "New York" (extra space) should all map to the same value.
Date normalization. Convert all date representations to a single format. Handle timezone ambiguity explicitly. Watch for date formats that are ambiguous โ is "01/02/2025" January 2nd or February 1st?
Identifier reconciliation. When the same entity has different identifiers in different systems, build a mapping table that reconciles them. Customer "C-12345" in the CRM might be "12345" in the billing system and "CUST_12345" in the support system.
Category harmonization. When categorical fields have inconsistent values โ "Active", "active", "ACTIVE", "A", "Yes" all meaning the same thing โ create a mapping from raw values to standardized categories.
Deduplication
Duplicate records are almost guaranteed when data comes from multiple sources or when data entry is manual.
Exact deduplication. Remove records that are identical across all fields. This is straightforward and safe.
Fuzzy deduplication. Identify records that represent the same entity but have minor differences โ typos, format variations, partial information. This requires matching algorithms that compare multiple fields and score overall similarity.
Deduplication strategy. When you find duplicates, you need to decide which record to keep or how to merge them. The most recently updated record is often the most accurate, but not always. Define merge rules for each field โ take the most recent, take the most complete, take the value from the most authoritative source.
Missing Data Handling
Missing data is inevitable. Your handling strategy depends on the downstream use case.
For training data. Missing values in training data affect model quality. Options include removing records with critical missing values, imputing missing values from statistical distributions, or training models that handle missing values natively.
For inference inputs. When real-time inputs have missing values, your system needs a graceful handling strategy โ use defaults, skip affected features, or flag the prediction as low-confidence.
For analytics and reporting. Missing data should be clearly flagged rather than silently imputed. Clients need to understand the completeness of their data to make informed decisions.
Data Validation
Implement validation rules that catch data quality issues before they affect your AI system.
Rule-based validation. Define explicit rules โ age must be between 0 and 120, dates must be in the past, email must match a valid pattern. Flag or reject records that violate rules.
Statistical validation. Monitor statistical properties โ distributions, correlations, trends โ and flag anomalies. A sudden spike in null rates or a shift in value distributions indicates a data quality issue.
Cross-field validation. Validate relationships between fields โ start date before end date, city matches zip code, total equals sum of components.
Cross-source validation. When the same data exists in multiple sources, validate consistency between them. Discrepancies indicate data quality issues in one or both sources.
Setting Client Expectations
Managing client expectations around data is critical for project success.
Be honest in the assessment. Share your data assessment findings transparently, including the problems you found and the effort required to address them. Clients who understand the data challenge are more patient with the timeline.
Budget explicitly for data work. In your proposal, separate data integration and cleaning from AI development. Make the data effort visible rather than burying it in the development estimate. Clients who see data work as a line item understand what they are paying for.
Set realistic timelines. Data integration consistently takes longer than initial estimates. Add a buffer to your data work timeline โ typically 30 to 50 percent โ and explain to the client that data surprises are normal and expected.
Define data quality requirements. Agree with the client on what level of data quality is needed for the project to succeed. Perfection is neither achievable nor necessary. Define "good enough" explicitly.
Establish ongoing data responsibilities. Clarify who is responsible for data quality going forward โ your agency, the client's IT team, or a shared responsibility. Data quality is not a one-time fix; it requires ongoing attention.
Document everything. Document every data issue you find, every transformation you apply, and every assumption you make. This documentation is essential for the client's future maintenance of the system and for your defense if questions arise about data handling.
Data integration is not the fun part of AI delivery. It does not make for exciting demos or compelling case studies. But it is where most AI projects succeed or fail. The agencies that excel at data integration deliver projects on time and on budget because they do not get blindsided by data surprises. The agencies that treat data as someone else's problem deliver projects that are late, over budget, and underwhelming. Embrace the mess. Budget for it. Build systematic approaches for handling it. And never, ever believe anyone who tells you the data is clean.