A ten-person AI agency in Seattle built a sentiment analysis model for a consumer goods company. They trained the model on a combination of publicly scraped product reviews, licensed datasets from a data broker, and proprietary customer feedback data the client provided. The model performed exceptionally well. Then the data broker sent a cease-and-desist notice: the license agreement covered "analytical use" but explicitly excluded "machine learning training." The agency had used the licensed data in a way the license did not permit. Retraining the model without the licensed data reduced accuracy by 12 percentage points, and the client was not happy. The agency spent $65,000 on legal fees and had to negotiate a new license at triple the original cost — $180,000 instead of $60,000.
Training data rights are not a theoretical legal concern. They are an operational reality that affects what you can build, how you build it, and whether the models you deliver can withstand legal scrutiny. Every dataset you use to train a model comes with rights, restrictions, and obligations. If you do not understand those rights before you start training, you are building on a foundation that could collapse.
The Training Data Rights Landscape
Training data rights exist on a spectrum from fully open to fully restricted. Understanding where your data falls on this spectrum is the first step in any training data governance framework.
Public Domain and Open Data
Data in the public domain — government datasets, expired-copyright works, voluntarily released data — can generally be used for AI training without restriction. However, even public domain data requires careful analysis.
Considerations:
- Government data may be public but subject to use restrictions in certain jurisdictions
- Public domain status varies by country — a work may be public domain in the US but protected in the EU
- "Publicly available" does not mean "public domain" — social media posts are publicly accessible but not in the public domain
- Some open data licenses (like Creative Commons) have specific terms that may or may not permit AI training use
Licensed Datasets
Commercial data providers sell access to datasets under license agreements. These licenses define what you can do with the data, and the terms vary enormously.
Common license restrictions that affect AI training:
- Purpose restrictions — The license may limit use to specific purposes (analytics, research, commercial deployment)
- Machine learning exclusions — Some licenses explicitly exclude machine learning training
- Derivative works restrictions — Training an AI model on licensed data may constitute creating a derivative work, which some licenses prohibit
- Distribution restrictions — If the trained model effectively memorizes or can reproduce licensed data, distributing the model may violate distribution restrictions
- Sublicensing restrictions — Delivering a trained model to a client may constitute sublicensing the underlying data
Due diligence requirements:
- Read every data license agreement completely before using data for training
- Map specific license terms to your intended use (training, evaluation, deployment)
- Identify restrictions that could affect your ability to deliver or license the resulting model
- Consult legal counsel when license terms are ambiguous about AI training
Client-Provided Data
Most agency AI projects involve client-provided training data. This data comes with its own rights considerations.
Questions to resolve:
- Ownership — Does the client own the data outright, or does it include third-party data the client has licensed?
- Privacy — Does the data contain personal information subject to privacy regulations?
- Consent — Was the data collected with consent that covers AI training use?
- Contractual restrictions — Is the client subject to any contractual restrictions on sharing or using the data for AI training?
- Employment and labor data — Does the data include employee-generated content that may have IP implications?
Web-Scraped Data
Web scraping for AI training data is one of the most legally contested areas in AI right now.
Current landscape:
- Multiple lawsuits are challenging the legality of using scraped web content for AI training
- The legal theories include copyright infringement, violation of terms of service, and unfair competition
- Judicial outcomes have been mixed, with no definitive resolution
- The EU AI Act requires documentation of training data sources, making scraped data harder to use without disclosure
- Several jurisdictions are considering or have enacted legislation specifically addressing web scraping for AI training
Risk assessment for agencies:
- Using scraped data for internal research carries lower risk than using it in client-facing products
- Models trained on scraped data may face legal challenges that could affect your clients
- The reputational risk of training on scraped data is increasing as public awareness grows
- Consider whether the risk is worth it when licensed alternatives exist
Synthetic and Generated Data
Synthetic data — data generated by AI models or statistical processes — offers an alternative to the rights complications of real-world data.
Considerations:
- Synthetic data generated by your own models is generally free of third-party rights issues
- Synthetic data generated by third-party AI services may be subject to those services' terms of use
- Quality and representativeness of synthetic data must be validated
- Synthetic data may inherit biases from the models or distributions used to generate it
Ownership Frameworks for Training Data
Who Owns What
Training data ownership is not a single question — it is a matrix of ownership claims across multiple data types and multiple parties.
Raw training data: Typically owned by whoever collected or created it. Client-provided data remains the client's property. Licensed data remains the licensor's property. Data you collect or create belongs to your agency.
Curated and prepared datasets: When you clean, label, augment, and prepare training data, you create a derivative work. Ownership depends on the underlying data rights and the value added through curation. For client-provided data, the curated dataset is typically a joint work — the client owns the underlying data, your agency owns the curation methodology and added labels.
Feature-engineered data: Feature engineering transforms raw data into model-ready features. The feature engineering methodology is your agency's intellectual property. The resulting feature sets may be considered derivative works of the underlying data.
Model weights: Trained model weights represent a transformation of training data through your agency's model architecture, training methodology, and optimization process. Ownership of model weights is one of the most complex questions in AI IP law. In most agency engagements, the agency retains ownership of model weights while granting the client a license to use the trained model.
Evaluation and benchmark data: Test datasets, benchmark results, and evaluation metrics created during the training process represent valuable intellectual property. Define ownership of these artifacts in your agreements.
Contractual Frameworks for Training Data Rights
Your contracts with clients need to address training data rights explicitly. Here are the key provisions.
Data license from client to agency:
- Grant: Client grants agency a license to use provided data for the defined AI project
- Scope: Training, evaluation, validation, and model improvement
- Duration: Duration of the project plus a defined post-project period
- Restrictions: No use for other clients, no sharing with third parties, no use beyond the defined project scope
Representations and warranties:
- Client represents that they have the right to share the data for AI training purposes
- Client represents that the data was collected in compliance with applicable laws
- Client represents that necessary consents for AI training use have been obtained
- Agency represents that data will be used only within the licensed scope
Post-project data rights:
- Define what happens to training data when the project ends
- Address whether the agency can retain anonymized or aggregated insights
- Clarify whether the agency can use learnings (not data) from the project for other engagements
- Define model weight ownership and license terms
Privacy and Consent Considerations
Training data that contains personal information introduces a layer of privacy regulation on top of IP and contract law.
GDPR Implications
If your training data includes personal data of EU residents, GDPR applies.
Key requirements:
- Legal basis for processing — AI training requires a legal basis. Legitimate interest is the most common basis for B2B AI training, but it requires a documented legitimate interest assessment
- Purpose limitation — Data collected for one purpose cannot be freely repurposed for AI training without either new consent or a compatible purpose assessment
- Data minimization — Use only the personal data necessary for training — do not train on full personal records when anonymized data would suffice
- Right to erasure — Individuals have the right to request deletion of their personal data, which creates challenges for models already trained on that data
- Data Protection Impact Assessment — AI training on personal data likely requires a DPIA
CCPA and US State Privacy Laws
California's Consumer Privacy Act and similar state laws create additional requirements.
Key requirements:
- Disclosure — Consumers must be informed that their data is being used for AI training
- Opt-out rights — Consumers may have the right to opt out of having their data used for AI training
- Sale restrictions — Transferring personal data to an agency for AI training may constitute a "sale" under CCPA, triggering additional obligations
- Purpose limitations — Data can only be used for disclosed purposes
Practical Privacy Approaches for Agencies
Anonymize aggressively. If you do not need personally identifiable information for training, strip it before training begins. Anonymization reduces privacy risk and simplifies compliance.
Document your legal basis. For every dataset containing personal data, document why you have the right to use it for training. Keep this documentation current.
Build privacy into your data pipeline. Implement technical measures that enforce privacy requirements — data anonymization, access controls, audit trails, retention enforcement.
Get explicit consent when possible. If you can obtain explicit consent for AI training use, do so. Consent is the strongest legal basis and the hardest to challenge.
Emerging Legal and Regulatory Developments
Training data transparency requirements. The EU AI Act requires providers of certain AI systems to document the training data used, including sources, scope, and any known biases. This creates a documentation obligation that affects how you track and manage training data rights.
Copyright challenges to AI training. Multiple lawsuits are challenging whether using copyrighted works for AI training constitutes fair use (US) or falls under permitted exceptions (EU). The outcomes of these cases will significantly affect what data is legally available for AI training.
Training data marketplaces. Commercial marketplaces for AI training data are emerging, offering licensed datasets with clear terms for AI training use. These marketplaces may become the standard source for training data as legal risks around other sources increase.
Data unions and collective licensing. Content creators and data subjects are organizing to collectively negotiate training data terms. This could create new licensing models that simplify rights acquisition for agencies.
Building a Training Data Governance Program
Step 1: Data Inventory
Catalog every dataset used for AI training across all projects. For each dataset, document the source, the rights basis (ownership, license, public domain), any restrictions, and the projects that used it.
Step 2: Rights Assessment
For each dataset in your inventory, assess whether your current rights basis permits your actual use. Pay particular attention to licensed datasets where terms may not have explicitly contemplated AI training.
Step 3: Gap Remediation
Where your rights assessment reveals gaps — datasets used without proper rights, licenses that may not cover AI training, client data without proper representations — take corrective action. Obtain proper licenses, negotiate updated terms, or retrain models without problematic data.
Step 4: Ongoing Governance
Implement processes that prevent training data rights issues from arising in future projects.
- Require rights assessment before any dataset is used for training
- Include training data rights provisions in all client contracts
- Review data licenses for AI training compatibility before purchase
- Train your team on training data rights basics
- Conduct annual audits of training data rights compliance
Your Next Step
Create a training data inventory for your agency. List every dataset used for AI training in the last 12 months. For each dataset, answer three questions: Do you have clear legal rights to use this data for AI training? Are those rights documented? Would those rights hold up under legal scrutiny?
If you cannot answer yes to all three questions for every dataset, you have training data rights gaps that need immediate attention. Start with your highest-risk datasets — those used in client-facing production models — and work through the rights assessment and remediation process.
The Seattle agency's $245,000 lesson could have been avoided with a $2,000 license review before training began. Training data governance is not expensive. Training data problems are.