Implementing Data Catalogs for AI-Ready Organizations: The Agency Delivery Guide
A healthcare analytics company with 340 employees and 47 data sources had a problem that sounds almost absurd: nobody knew what data they had. When the data science team wanted to build a patient readmission model, they spent six weeks just finding the right tables, understanding the schemas, and figuring out which fields were reliable. The data engineering team maintained informal knowledge in people's heads and a half-maintained Confluence page from 2023. When a senior data engineer left, critical institutional knowledge walked out the door with her.
An AI agency in Philadelphia was hired to build the readmission model. After the first two weeks of the engagement were consumed by data discovery, the agency made a counter-proposal: before building any models, let us implement a data catalog. The catalog would document every data asset, its lineage, its quality metrics, its owner, and its relationship to other assets. The catalog project cost $65,000 and took eight weeks. But every AI project after that โ and the company had six in the pipeline โ started two to four weeks faster because the data discovery phase was eliminated.
Over the following 18 months, the agency delivered $1.2 million in AI projects for this client. The data catalog was the foundation that made all of them possible. And the catalog itself generated a $4,000 monthly retainer for maintenance and updates.
Why Data Catalogs Are an AI Agency Opportunity
Most agencies treat data catalogs as a problem for the client's data engineering team. That is a missed opportunity. Here is why data catalogs should be part of your agency's offering:
They eliminate your biggest project risk. The number one reason AI projects go over budget and over time is data discovery โ figuring out what data exists, where it lives, whether it is reliable, and how to access it. A data catalog removes this risk from every subsequent project.
They create dependency. Once you build and maintain the catalog, you become the team that understands the client's data landscape better than anyone. Every new AI initiative starts with your catalog and, by extension, with you.
They are a gateway to larger engagements. A data catalog project gives you deep visibility into the client's data assets, quality issues, and gaps. That visibility lets you identify and propose high-value AI projects that the client might not have discovered on their own.
They justify retainers. Catalogs need ongoing maintenance โ new data sources, schema changes, quality monitoring, access management. This creates a natural monthly retainer.
What an AI-Ready Data Catalog Includes
A data catalog for a general analytics team and a data catalog for an AI-ready organization are different things. The AI-ready version includes additional metadata and capabilities that standard catalogs often lack.
Standard Catalog Components
Data asset inventory. Every table, file, API endpoint, and data stream documented with:
- Name and description
- Location (database, schema, bucket)
- Owner (person or team responsible)
- Last updated timestamp
- Access permissions
- Related documentation
Schema documentation. For each data asset:
- Column names, types, and descriptions
- Primary keys and foreign key relationships
- Enum values and their meanings
- Nullable fields and default values
- Units of measurement (is revenue in dollars or cents? is distance in miles or kilometers?)
Data lineage. How data flows through the organization:
- Which source systems feed which data assets
- Which transformations are applied
- Which downstream assets consume each upstream asset
- When the lineage was last validated
Quality metrics. For each data asset:
- Completeness (null rates per column)
- Freshness (age of the most recent record)
- Volume (row counts and growth rates)
- Consistency (cross-reference checks against related tables)
AI-Specific Catalog Extensions
Feature registry. Which data assets have been used as features in ML models? What transformations were applied? Which models consume which features? This prevents duplicate feature engineering work across projects.
Label and target documentation. For supervised learning, which columns serve as prediction targets? What are the known biases or issues with the labels? When were labels last validated?
Training dataset metadata. For each model, which data assets contributed to the training set? What was the date range? What filters were applied? This enables model reproducibility.
Data quality for ML. Beyond standard quality metrics, document:
- Class distribution for classification targets (imbalance levels)
- Distribution statistics for key features (mean, median, standard deviation, skewness)
- Known data drift patterns (seasonal shifts, trend changes)
- Historical quality incidents and their resolution
Privacy and compliance annotations. Which fields contain PII? Which data assets fall under HIPAA, GDPR, or other regulations? What anonymization or masking is required before use in ML training?
Sensitivity classification. Public, internal, confidential, restricted โ classified at the column level, not just the table level. A customer table might have public columns (product preferences) and restricted columns (SSN) in the same asset.
Technology Options
Open Source
Apache Atlas. The mature, widely-adopted open-source option. Strong lineage tracking, integration with the Hadoop ecosystem, and extensible classification and tagging. Best for organizations already using Apache tools.
DataHub (LinkedIn). Modern, API-first catalog with strong search, lineage, and metadata management. Excellent developer experience and growing community. Currently the strongest open-source option for most use cases.
OpenMetadata. Full-featured catalog with built-in data quality, lineage, and governance. Strong UI and good integration with modern data stack tools (dbt, Airflow, Spark).
Amundsen (Lyft). Focused on data discovery and search. Simpler than DataHub or OpenMetadata but lighter-weight and faster to deploy.
Managed Services
Alation. Enterprise-grade catalog with strong data governance and compliance features. The market leader for large enterprises. Premium pricing.
Collibra. Data intelligence platform with catalog, governance, and lineage. Extremely comprehensive but complex and expensive.
AWS Glue Data Catalog. Native to AWS, automatically discovers schemas in S3 and databases. Limited in metadata richness but frictionless for AWS-native clients.
Google Dataplex. Google Cloud's data management platform with built-in cataloging. Strong for GCP-native organizations.
Recommendation for Agency Work
For most agency engagements, use DataHub or OpenMetadata. Both are free, capable, and have active communities. The choice between them depends on the client's existing stack and preferences.
For enterprise clients with compliance requirements, recommend Alation or Collibra. The governance and compliance features justify the cost in regulated industries.
For quick, lightweight implementations, use the cloud provider's native catalog (Glue Data Catalog for AWS, Dataplex for GCP). These integrate seamlessly with the rest of the cloud platform.
Delivery Playbook
Phase 1: Discovery and Design (Weeks 1-3)
- Interview data owners across departments to identify data assets
- Map data flows between systems (even roughly)
- Assess existing documentation and metadata
- Select the catalog technology based on requirements and client infrastructure
- Define the metadata schema (what fields to capture for each asset type)
- Design the governance model (who owns what, who approves access)
Phase 2: Platform Setup (Weeks 3-5)
- Deploy the catalog platform
- Configure authentication and authorization
- Set up integrations with source systems for automated metadata extraction
- Configure automated schema crawling
- Set up the data quality monitoring integration
Phase 3: Initial Population (Weeks 5-8)
- Crawl and register the highest-priority data assets (start with assets needed for upcoming AI projects)
- Enrich automated metadata with human-authored descriptions, business context, and quality annotations
- Document data lineage for critical paths
- Tag data assets with AI-relevant metadata (feature registry, label documentation)
- Validate accuracy with data owners
Phase 4: Governance and Adoption (Weeks 8-10)
- Define and implement data access request workflows
- Create data steward roles and responsibilities
- Train data owners on how to maintain their catalog entries
- Implement change management processes (how to add new assets, update descriptions, report issues)
- Create onboarding documentation for new team members
Phase 5: AI Integration (Weeks 10-12)
- Integrate the catalog with ML experiment tracking tools (MLflow, W&B)
- Implement the feature registry linking features to catalog assets
- Build automated documentation for training datasets (linking models to the data versions they used)
- Create dashboards showing AI data usage and quality metrics
Making the Catalog Stick
The biggest risk with data catalogs is abandonment. Catalogs become stale when:
- Nobody updates them when data changes
- Nobody uses them because search is bad
- Nobody is accountable for accuracy
- The effort to maintain entries exceeds the perceived value
Strategies to prevent abandonment:
Automate everything possible. Schema changes should be detected automatically. Quality metrics should be computed automatically. Freshness should be tracked automatically. The less manual work required, the more likely the catalog stays current.
Make it the default entry point. When a data scientist starts a new project, the catalog should be the first place they go. When a new employee onboards, the catalog tour should be part of their first week. Build the habit.
Assign data stewards with explicit accountability. Each data domain should have a named steward responsible for catalog accuracy. Include catalog maintenance in their job description and performance reviews.
Show the value constantly. Track metrics: how many searches per week, how many data assets accessed, how much time saved on data discovery. Share these metrics with leadership.
Integrate with workflows. The catalog should not be a standalone tool that people visit occasionally. Integrate it with Jupyter notebooks (auto-link to catalog entries when data is loaded), Slack (chatbot that answers data questions from the catalog), and project management tools (link AI project tasks to catalog assets).
Common Pitfalls in Data Catalog Delivery
Pitfall 1: Trying to catalog everything at once. A comprehensive catalog of every data asset in a large enterprise is a multi-month project. Start with the data assets needed for the current and next two AI projects. Expand from there based on demand.
Pitfall 2: Over-engineering the metadata schema. Capturing 50 metadata fields per data asset is overwhelming for data stewards and leads to incomplete entries. Start with 10-15 essential fields. Add more as the catalog matures and the team sees the value.
Pitfall 3: Not automating schema crawling. If data stewards have to manually update the catalog every time a table schema changes, it will be perpetually out of date. Automate schema detection and change tracking from day one.
Pitfall 4: Treating the catalog as a documentation project. A catalog that is just documentation โ even well-written documentation โ will be abandoned. The catalog must be operational: integrated with query tools, connected to quality monitoring, and embedded in data access workflows. It should be the tool people use daily, not a reference they consult occasionally.
Pitfall 5: Ignoring data quality metadata. A catalog that tells you a dataset exists but not whether it is reliable is incomplete for AI work. Every catalog entry should include freshness, completeness, and quality metrics that are updated automatically.
Pricing Data Catalog Projects
- Phase 1 (Discovery and design): $10,000 - $20,000
- Phase 2 (Platform setup): $15,000 - $30,000
- Phase 3 (Initial population): $20,000 - $40,000
- Phase 4 (Governance and adoption): $10,000 - $20,000
- Phase 5 (AI integration): $10,000 - $20,000
- Total typical engagement: $65,000 - $130,000
Monthly maintenance retainer: $3,000 - $6,000 for new asset onboarding, quality monitoring, governance reviews, and platform updates.
Bundle with AI projects. The most effective pricing strategy is to include the data catalog as a line item in a larger AI engagement. "Phase 0: Data Foundation" covers the catalog, and every subsequent phase benefits from it. Clients who might resist paying $80,000 for a standalone catalog will readily accept it as part of a $300,000 AI platform build.
Your Next Step
On your next client engagement, spend the first day documenting how data discovery currently works. Ask three data scientists: "How do you find the data you need for a new project? How long does it take? What frustrates you about the process?" If the answers involve "I ask Dave" or "I dig through our data warehouse schema" or "it takes weeks," you have a data catalog opportunity. Quantify the time spent on data discovery across the AI team (hours per week x hourly cost) and present the catalog as an investment that eliminates that cost permanently.