Experiment Tracking and Reproducibility Best Practices for AI Agencies
A two-person AI agency in Portland had a conversation that every agency has had at least once. The client called: "That model you deployed six months ago โ we need to retrain it. Can you reproduce exactly what you built?" The data scientist who built it had run 47 experiments over three weeks. The final model was "the one from that Thursday afternoon run where I tweaked the learning rate and added that new feature." No experiment was logged. No parameters were recorded. No data version was tagged. Reproducing the model meant re-running all 47 experiments from memory, hoping to recreate the same conditions.
It took two weeks to reproduce what should have been a one-click retraining. The agency billed for the time, but the client was unhappy about paying twice for the same work. The relationship suffered.
After that project, the agency implemented MLflow for experiment tracking. Every experiment logged its parameters, metrics, data version, code commit, and artifacts automatically. The next retraining request? One command. Pull the experiment ID, rerun the pipeline with the logged parameters on the logged data version. Done in an afternoon.
Experiment tracking is the foundation of professional ML practice. It is the difference between "data science" as ad-hoc exploration and "machine learning engineering" as a repeatable, auditable discipline. For agencies, it is also the difference between profitable projects and money-losing rework.
Why Experiment Tracking Matters for Agency Work
Reproducibility saves time and money. Models need to be retrained, debugged, and extended. Without experiment logs, every modification starts with "what did we do last time?" โ a question that burns hours or days to answer.
Client confidence requires transparency. When a client asks "why did you choose this model?" you need to show the comparison โ here are the 30 experiments we ran, here is how each performed, and here is why this configuration won. Experiment logs make this presentation trivial.
Team collaboration depends on shared context. When one data scientist picks up where another left off (common in agencies where team members rotate between projects), experiment logs provide the complete history. Without them, the new person starts from scratch.
Regulatory and audit requirements demand documentation. In regulated industries, you must document the model development process, including all alternatives considered and the rationale for final selection. Experiment tracking provides this documentation automatically.
Preventing duplicate work. In a 6-week model development project, data scientists frequently re-run experiments they already tried. Experiment logs prevent this waste โ you can check whether a configuration has already been tested before running it again.
The Experiment Tracking Stack
MLflow
The most popular open-source experiment tracking tool. MLflow provides four components:
- Tracking: Log parameters, metrics, and artifacts for each experiment run
- Projects: Package ML code for reproducible runs
- Models: Register and version model artifacts
- Model Registry: Manage model lifecycle (staging, production, archived)
Strengths:
- Free and open source
- Language and framework agnostic
- Simple API โ logging an experiment takes 3-4 lines of code
- Integrates with all major ML frameworks (scikit-learn, PyTorch, TensorFlow, XGBoost)
- Self-hosted or managed (Databricks)
Limitations:
- UI is functional but not beautiful
- Limited built-in collaboration features
- No native data versioning (pair with DVC)
Best for: Most agency work. MLflow covers the core tracking need without lock-in or cost.
Weights and Biases (W&B)
A managed experiment tracking platform with strong visualization and collaboration features.
Strengths:
- Excellent visualizations โ interactive charts, comparison dashboards, and reports
- Strong collaboration โ shared workspaces, annotations, and team dashboards
- Automatic hyperparameter importance analysis
- Built-in sweep (hyperparameter search) functionality
- System metrics logging (GPU utilization, memory usage)
Limitations:
- Managed service with per-user pricing
- Data goes through their servers (privacy consideration)
- Can be overkill for simple projects
Best for: Agencies that want polished visualizations for client presentations and teams that value collaboration features.
Neptune.ai
A metadata store for MLOps that focuses on flexibility and scale.
Strengths:
- Very flexible metadata schema โ log anything
- Strong comparison tools
- Good for large-scale experiments (thousands of runs)
- Integrates with most ML frameworks
Best for: Agencies running large-scale hyperparameter searches or managing many concurrent experiments.
Comet ML
Similar to W&B with a focus on enterprise features.
Strengths:
- Code diff tracking (automatically captures code changes between experiments)
- Strong model registry
- Good for regulated industries (audit features)
- Self-hosted option available
Best for: Agencies working in regulated industries where data cannot leave the client's infrastructure.
Recommendation for Agencies
Start with MLflow. It is free, universal, and sufficient for 90% of agency work. If you need better visualizations for client-facing reports, add W&B. If you work in regulated industries, consider Comet ML's self-hosted option.
Pick one tool and standardize. Do not let each data scientist use their preferred tool. Standardization enables team collaboration and knowledge transfer between projects.
What to Track in Every Experiment
Mandatory (Track This Always)
Parameters:
- All model hyperparameters (learning rate, depth, regularization, etc.)
- Feature set used (list of feature names or a feature set version ID)
- Data preprocessing parameters (normalization method, missing value handling, encoding strategy)
- Training configuration (number of epochs, batch size, optimizer, early stopping criteria)
- Random seed
Metrics:
- Primary evaluation metric on validation and test sets
- Secondary metrics (precision, recall, F1, AUC โ whatever is relevant)
- Training metrics per epoch (loss, accuracy)
- Training time
- Model size (parameters, file size)
Artifacts:
- Trained model file
- Training and validation data version (hash or version ID)
- Feature importance plot
- Confusion matrix (for classification)
- Residual plots (for regression)
Context:
- Git commit hash of the code used
- Environment specification (Python version, library versions)
- Timestamp
- Who ran the experiment
- A human-readable description of the experiment purpose
Recommended (Track When Possible)
- Data statistics: Summary statistics of the training data (row count, class distribution, feature distributions)
- Compute resources: GPU/CPU used, memory consumption, cost
- Cross-validation fold results: Per-fold metrics, not just averages
- Prediction distribution: Histogram of model predictions on the validation set
- Learning curves: Training and validation metrics vs. training set size
- Feature correlations: Correlation matrix of top features
For Client-Facing Projects (Always)
- Experiment narrative: A brief explanation of what you tried and why
- Comparison against baseline: How does this experiment compare to the simplest reasonable approach?
- Comparison against previous best: How does this compare to the best result so far?
- Decision rationale: If this becomes the selected model, why was it chosen over alternatives?
Implementing Experiment Tracking in Your Workflow
The Standard Agency Workflow
Step 1: Project setup (Day 1 of model development).
- Create the experiment in your tracking tool
- Define the primary and secondary metrics
- Log the baseline model (simplest approach)
- Tag the data version used
Step 2: Systematic experimentation.
- For each experiment, log all parameters before running
- Set a descriptive name and tags (e.g., "featurev2xgboostdeepertrees")
- Run the experiment
- Log all metrics, artifacts, and notes
- Compare against previous runs in the tracking UI
Step 3: Model selection.
- Query the experiment tracker for the top N runs by primary metric
- Compare the top runs across all metrics, not just the primary
- Select the final model and tag it in the registry
- Document the selection rationale
Step 4: Handoff documentation.
- Export the experiment comparison as a report
- Include the full experiment history in the project deliverable
- Register the final model in the model registry with production tags
- Link the model to its experiment, data version, and code commit
Automating Experiment Logging
Manual logging is error-prone โ data scientists forget to log parameters or skip logging when they are "just testing something." Automate as much as possible.
Auto-logging integrations. Most tracking tools provide auto-logging for popular frameworks:
- MLflow's mlflow.autolog() captures scikit-learn, XGBoost, LightGBM, PyTorch, and TensorFlow experiments automatically
- W&B's wandb.init() captures environment details, system metrics, and code state automatically
Pipeline-level logging. If you use an orchestration tool (Airflow, Prefect, Kubeflow), integrate experiment logging into the pipeline. The pipeline logs parameters and metrics as part of its execution, not as a separate step.
Pre-commit hooks. Add a pre-commit hook that checks whether experiment results are logged before allowing model code to be committed. This prevents untracked experiments from making it into the codebase.
Reproducibility Checklist
For any experiment to be fully reproducible, you need to capture and restore:
- Code version: Git commit hash. Not "the latest code" โ the exact commit.
- Data version: Dataset hash or version tag. Not "the customer data" โ the exact version.
- Environment: Python version, library versions (requirements.txt or conda environment YAML). Not "whatever is installed" โ the exact versions.
- Configuration: All parameters, not just model hyperparameters. Preprocessing parameters, splitting logic, random seeds.
- Hardware specification: For deep learning, results can differ between GPU types. Document the hardware used.
The reproducibility test: Can a new team member, given only the experiment log and the codebase, reproduce the exact model without asking anyone a question? If not, something is missing from the log.
Common Experiment Tracking Mistakes
Mistake 1: Logging only successful experiments. Failed experiments are as valuable as successful ones โ they tell you what does not work and prevent repeated wasted effort. Log every experiment, including the ones that crashed, produced terrible metrics, or were based on bad assumptions.
Mistake 2: Using the tracking tool as a dumping ground. Logging everything without organization creates a different problem โ nobody can find anything. Use consistent naming conventions, meaningful tags, and experiment grouping. A naming scheme like "{date}{featureversion}{modeltype}_{hypothesis}" makes experiments searchable and understandable.
Mistake 3: Not linking experiments to decisions. The experiment log should tell a story: "We tried approaches A, B, and C. Approach B performed best on metrics X and Y while meeting constraint Z. Therefore, we selected B for production." Without this narrative, the log is data without insight.
Mistake 4: Skipping the baseline. Every experiment log should start with a baseline โ the simplest reasonable approach. If your gradient-boosted model with 200 features performs only 2% better than a logistic regression with 10 features, the complexity may not be justified. The baseline makes this comparison possible.
Mistake 5: Not tracking compute costs. In the age of GPU-intensive training and LLM API calls, experiment cost matters. Track the compute cost of each experiment so you can evaluate whether marginal accuracy improvements justify the additional training expense.
Pricing Experiment Tracking
Do not price experiment tracking as a separate line item. Include it as part of your standard model development process. It is table stakes for professional work.
Budget allocation:
- Initial tracking infrastructure setup: 0.5-1 day per project (included in project setup)
- Ongoing logging overhead: 5-10% of model development time (included in development estimates)
- Reporting and documentation: 0.5-1 day at project completion (included in delivery)
The ROI of experiment tracking:
- Saves 1-2 weeks per model retraining request (no more "reproduce what we built")
- Saves 2-3 days per project when a team member rotates off (no more "what did they do?")
- Prevents duplicate experiment runs (estimated 10-15% of experiment time is wasted on duplicates without tracking)
- Enables confident model selection (reduces the risk of deploying a suboptimal model)
Your Next Step
Set up MLflow (or your chosen tracking tool) on your next project before writing any model code. Create the experiment, define your metrics, and log the baseline model. Then, for every experiment you run, log everything โ parameters, metrics, artifacts, and a brief note about what you were trying. At the end of the project, generate the experiment comparison report and include it in the client deliverable. The professionalism of that report โ showing 30+ systematically evaluated approaches with clear rationale for the final selection โ will set you apart from agencies that present a single model with no documented alternatives.