A healthcare AI agency in Philadelphia deployed a diagnostic assistance model for a hospital network. The model was performing well clinically, but a security audit revealed a critical vulnerability: an attacker could reconstruct individual patient records from the training data by systematically querying the model with slight variations of patient attributes and observing how the model's confidence changed. This membership inference attack meant that the model was effectively leaking protected health information through its API. The agency had hardened their web application security โ firewalls, authentication, encryption โ but had not considered the unique attack surfaces that ML systems introduce. The remediation required differential privacy mechanisms, output perturbation, query rate limiting, and a complete re-evaluation of their model's memorization behavior. It cost $140,000 in unplanned work and delayed the deployment by two months.
ML systems introduce attack surfaces that traditional software security does not address. The model itself can be attacked, the training data can be poisoned, the inference pipeline can be manipulated, and the model's outputs can leak sensitive information. For AI agencies delivering production ML systems, security hardening is not an afterthought โ it is a delivery requirement that affects architecture decisions from day one.
ML-Specific Attack Surfaces
Adversarial Attacks on Model Inputs
Adversarial attacks craft inputs designed to cause the model to produce incorrect outputs. These attacks exploit the model's learned decision boundaries.
Evasion attacks modify inputs at inference time to fool the model:
- Image perturbation: Imperceptible pixel changes that cause an image classifier to misclassify (a stop sign classified as a speed limit sign)
- Text manipulation: Character substitutions, homoglyph attacks, and paraphrasing that change a text classifier's output (a toxic message classified as benign)
- Feature manipulation: Modifying input features to change a tabular model's prediction (tweaking a loan application to change the fraud score)
Impact for agencies: If your client's model makes consequential decisions (fraud detection, medical diagnosis, content moderation), adversarial attacks can cause financial loss, safety hazards, or regulatory violations.
Data Poisoning Attacks
Data poisoning attacks corrupt the training data to compromise the model's behavior.
Training data poisoning: An attacker injects malicious examples into the training dataset. These examples cause the model to learn incorrect associations โ for example, learning that a specific pattern in a transaction is always benign, allowing the attacker to use that pattern to bypass fraud detection.
Label poisoning: The attacker corrupts training labels rather than input data. This is easier to execute (flipping labels is simpler than crafting adversarial examples) and harder to detect (the input data looks normal).
Backdoor attacks: The attacker injects a hidden trigger pattern into training data. The model performs normally on clean inputs but produces a specific (attacker-chosen) output when the trigger pattern is present.
Impact for agencies: If your training data pipeline ingests data from external sources, user-generated content, or third-party data providers, it may be vulnerable to poisoning.
Model Extraction Attacks
Model extraction (or model stealing) attacks reconstruct a copy of the model by querying it and observing its outputs.
Query-based extraction: The attacker sends many queries to the model API and uses the input-output pairs to train a replica model. The replica may not be identical to the original but can approximate its behavior sufficiently to replicate its value or find its vulnerabilities.
Side-channel extraction: The attacker uses timing information, memory access patterns, or power consumption to extract model parameters. This is more relevant for edge-deployed models than cloud-deployed models.
Impact for agencies: Model extraction threatens the intellectual property embodied in trained models. For agencies that invest significant resources in model development, extraction attacks can erode competitive advantage.
Privacy Attacks
Privacy attacks extract information about the training data from the model's outputs.
Membership inference: Determine whether a specific data point was in the training set by observing the model's confidence on that data point. Models typically have higher confidence on training data than on unseen data.
Model inversion: Reconstruct input features from the model's output. Given a model that predicts a person's name from facial features, an attacker can invert the model to reconstruct facial features from a name.
Training data extraction: For large language models, carefully crafted prompts can cause the model to regurgitate verbatim training data, including potentially sensitive information.
Impact for agencies: Privacy attacks are particularly consequential for models trained on personal data (healthcare, finance, HR). They can lead to regulatory violations (HIPAA, GDPR) and reputational damage.
Defensive Measures for Model Security
Input Validation and Sanitization
The first line of defense is validating and sanitizing all inputs before they reach the model.
Input validation rules:
- Type checking: Verify that all input features match expected data types (numerical, categorical, text, image)
- Range checking: Verify that numerical inputs fall within expected ranges. A human age of 500 or a temperature of -1000 is clearly invalid.
- Format checking: Verify that text inputs conform to expected formats (encoding, length, character set)
- Schema validation: Verify that the input schema matches the model's expected schema โ correct number of features, correct feature names, correct data types
Input sanitization for adversarial robustness:
- Image smoothing: Apply a light Gaussian blur to input images. This removes high-frequency perturbations that adversarial attacks rely on while preserving the semantic content of the image.
- Input transformation: Apply random transformations (slight rotation, scaling, cropping) to inputs before inference. This destroys the specific perturbation patterns that adversarial attacks inject.
- Text normalization: Normalize Unicode characters, remove invisible characters, and standardize encoding. This prevents character-level adversarial manipulations like homoglyph substitution.
- Feature clipping: Clip input features to the range observed in training data. This limits the impact of extreme feature values that may be crafted to manipulate predictions.
Adversarial Training
Train the model to be robust against adversarial inputs by including adversarial examples in the training data.
Adversarial training process:
- Generate adversarial examples from the training data using attack methods (PGD, FGSM for images; character perturbation, paraphrase attacks for text)
- Add adversarial examples to the training set with correct labels
- Train the model on the combined clean and adversarial data
- Evaluate robustness on a separate adversarial test set
Adversarial training tradeoffs:
- Improves robustness against known attack types by 30-60%
- May reduce accuracy on clean data by 1-3%
- Does not protect against all possible attacks โ only against the types of perturbations included in training
- Increases training time by 2-3x due to the cost of generating adversarial examples
Model Output Protection
Protect the model's outputs to prevent information leakage and extraction attacks.
Confidence score perturbation: Add calibrated random noise to confidence scores before returning them. This prevents attackers from using precise confidence values for membership inference or model extraction. The noise level should be tuned to maintain utility while degrading attack effectiveness.
Output rounding: Round confidence scores to a limited number of decimal places (2-3) or return only the top-K predictions without confidence scores. This reduces the information available to attackers without significantly affecting utility.
Prediction rate limiting: Limit the number of queries a single user or API key can make per time period. This makes model extraction attacks impractical by limiting the number of input-output pairs an attacker can collect. Set limits based on legitimate usage patterns.
Query auditing: Log all queries and monitor for patterns indicative of attacks โ systematic exploration of the input space, high query volume from a single source, queries that systematically vary a single feature while holding others constant.
Differential Privacy
Differential privacy provides mathematical guarantees that the model's outputs do not reveal information about any individual training example.
Differentially private training (DP-SGD):
- Add calibrated Gaussian noise to gradient updates during training
- Clip gradients to bound the influence of any single training example
- The privacy guarantee is parameterized by epsilon โ lower epsilon means stronger privacy but potentially lower model accuracy
Privacy budget management:
- Define a privacy budget (epsilon value) based on the sensitivity of the training data and regulatory requirements
- Track privacy expenditure across training iterations and model queries
- Stop training or serving when the privacy budget is exhausted
Practical considerations:
- DP-SGD typically reduces model accuracy by 3-10% compared to non-private training
- The accuracy cost decreases with larger training datasets (more data means each individual's contribution is smaller)
- For highly sensitive data (medical records, financial data), the accuracy cost is justified by the privacy guarantee
- For less sensitive data, other privacy measures (output perturbation, access control) may provide sufficient protection at lower accuracy cost
Training Data Security
Protect the training pipeline against data poisoning attacks.
Data provenance tracking:
- Record the source, collection date, and processing history of every training example
- Maintain a chain of custody for training data from collection through preprocessing to model training
- Use cryptographic hashes to verify data integrity at each pipeline stage
Data quality monitoring:
- Compute statistical properties of each training batch and compare to historical baselines
- Flag batches with anomalous distributions (sudden changes in class balance, unusual feature distributions, outlier examples)
- Manually review flagged batches before including them in training
Anomaly detection in training data:
- Use outlier detection algorithms to identify potentially poisoned examples
- Cluster training data and flag examples that are distant from their cluster centers
- Monitor training loss per example โ examples with unusually low loss may be memorized, and examples with unusually high loss may be poisoned
Access control for training data:
- Restrict access to training data to authorized personnel
- Use separate storage with encryption for training data
- Audit all access to training data
- For data sourced from external providers, validate data quality before ingestion
Infrastructure Security
Serving Infrastructure
Network security:
- Deploy model serving behind a WAF (Web Application Firewall) configured to detect and block common attack patterns
- Use TLS for all communication between clients and the model serving endpoint
- Restrict model serving endpoints to authorized IP ranges or VPN connections
- Implement DDoS protection to prevent service disruption
Container security:
- Use minimal container images with only required dependencies
- Scan images for known vulnerabilities before deployment
- Run containers with non-root users and minimal privileges
- Use read-only file systems where possible
- Apply security contexts and network policies in Kubernetes
Secrets management:
- Store API keys, database credentials, and encryption keys in a secrets manager (AWS Secrets Manager, HashiCorp Vault, Google Secret Manager)
- Never hardcode secrets in model serving code, configuration files, or container images
- Rotate secrets regularly (quarterly minimum)
- Audit secret access
Model Artifact Security
Model encryption:
- Encrypt model artifacts at rest using AES-256 encryption
- Encrypt model artifacts in transit during deployment
- Use customer-managed encryption keys for clients with strict key management requirements
Model signing:
- Digitally sign model artifacts during the registration process
- Verify signatures before deployment to ensure the model has not been tampered with
- Maintain a registry of trusted model signatures
Access control for model artifacts:
- Restrict model artifact access to the model registry and deployment pipeline
- Require multi-person approval for production model deployments
- Audit all model artifact access and deployment events
Compliance and Governance
Security Assessment Framework
Conduct a security assessment for every production ML system before deployment.
Assessment categories:
- Data sensitivity: What sensitive data was used in training? What are the regulatory implications?
- Attack surface: What input channels exist? Who has access to model outputs? What side channels are available?
- Threat model: Who might attack the system and why? What would they gain? What resources do they have?
- Impact analysis: What is the worst-case outcome of a successful attack? Financial loss, safety hazard, privacy violation, reputational damage?
- Existing controls: What security measures are already in place? What gaps remain?
Regulatory Compliance
HIPAA (healthcare data):
- Models trained on PHI (Protected Health Information) must implement access controls, audit logging, and encryption
- Differential privacy or equivalent privacy measures may be required
- Business Associate Agreements (BAAs) must cover all systems that process PHI, including model training infrastructure
GDPR (European personal data):
- Right to explanation: Users may have the right to understand how the model's predictions affect them
- Right to deletion: The ability to remove a person's data from the training set and retrain the model
- Data minimization: Train models on the minimum data necessary for the task
- Purpose limitation: Models trained for one purpose should not be repurposed without consent
SOC 2 (service organization controls):
- Security, availability, processing integrity, confidentiality, and privacy controls must be documented and audited
- Model serving infrastructure must meet SOC 2 requirements for continuous monitoring, access control, and incident response
Incident Response Plan
Prepare an incident response plan specific to ML security incidents.
ML security incident types:
- Adversarial attack detected (unusual input patterns, sudden accuracy drop)
- Data breach (training data or model artifacts exposed)
- Model extraction (suspected model copying via API queries)
- Privacy violation (membership inference or data extraction confirmed)
- Data poisoning (compromised training data discovered)
Response procedures for each incident type:
- Detection: How is the incident detected? (monitoring alerts, user reports, audit findings)
- Containment: How is the attack stopped? (rate limiting, API shutdown, model rollback)
- Assessment: What is the scope and impact of the incident?
- Remediation: What fixes are needed? (model retraining, vulnerability patching, access revocation)
- Communication: Who needs to be notified? (client, regulators, affected individuals)
- Prevention: What changes prevent recurrence?
Security Testing
Penetration Testing for ML Systems
Regular penetration testing should include ML-specific attack vectors.
ML penetration testing scope:
- Adversarial input testing: Attempt to fool the model with crafted inputs
- Model extraction testing: Attempt to extract the model through the API
- Membership inference testing: Attempt to determine training data membership
- Data pipeline testing: Attempt to inject poisoned data into the training pipeline
- API abuse testing: Attempt to exceed rate limits, inject malicious payloads, or access unauthorized endpoints
Automated Security Scanning
Integrate ML security checks into CI/CD pipelines:
- Scan model serving containers for vulnerabilities
- Validate input validation rules cover all expected attack vectors
- Verify that output perturbation is configured and active
- Check that rate limiting is enabled and properly configured
- Verify that encryption is enabled for model artifacts and data in transit
- Run automated adversarial robustness tests on the model
Your Next Step
Conduct a threat model for one production ML system your agency operates. List every input channel, every output channel, every person and system that has access, and every piece of sensitive data involved. For each channel and access point, identify the potential attacks (adversarial inputs, extraction, privacy leakage, data poisoning) and rate the risk (likelihood times impact). Pick the three highest-risk items and implement defensive measures this month. ML security is not about achieving perfect protection โ it is about systematically identifying and mitigating the most consequential risks before they become incidents. Start with the threat model, and the priorities will become clear.