/ AI Development / Model Versioning and Experiment Tracking: Organizing ML Development at Scale
AI Development 15 min read

Model Versioning and Experiment Tracking: Organizing ML Development at Scale

A practical guide to managing ML experiments and model versions using tools like MLflow, Weights & Biases, and DVC. Covers experiment tracking, model registry patterns, and scaling strategies for teams.

Model Versioning and Experiment Tracking: Organizing ML Development at Scale - Complete AI Development guide and tutorial

Machine learning development is fundamentally experimental. A single project can spawn hundreds of training runs, each testing a different combination of hyperparameters, architectures, or data preprocessing strategies. Without systematic organization, this process becomes unmanageable — teams lose track of which model produced which results, reproducibility suffers, and deploying the right model to production turns into a scavenger hunt. This article examines the landscape of experiment tracking and model versioning systems, focusing on three widely adopted tools: MLflow, Weights & Biases (W&B), and DVC (Data Version Control). It provides practical guidance on setting up tracking infrastructure, designing a model registry, and scaling these practices across a team.

Introduction

ML practitioners face a paradox: the tools that make experimentation fast — Jupyter notebooks, automated training loops, cloud GPUs — are the same tools that make experimentation messy. It's common for a single researcher to have 50+ training runs with names like baseline_v3_tuned, resnet50_try2_final, or oops_again.ipynb. Copying models between folders, tracking metrics in spreadsheets, and manually tagging the "best" checkpoint are band-aid solutions that break at scale.

The field of MLOps has responded with purpose-built experiment tracking and model versioning systems. These tools fall into a few overlapping categories:

  • Experiment tracking: recording metrics, parameters, and artifacts from individual training runs
  • Model versioning: storing, labeling, and retrieving specific model checkpoints
  • Model registry: a curated pipeline from training to staging to production deployment
  • Data versioning: tracking changes to training datasets (often paired with model versioning)

Three tools have emerged as mainstream choices in this space. MLflow is an open-source platform maintained by Databricks, widely used in enterprises. Weights & Biases is a commercial SaaS product with a strong research community following. DVC (Data Version Control) is an open-source tool that brings Git-like versioning to datasets and models, favored by teams already using Git workflows. Each has distinct strengths and trade-offs.

Experiment Tracking Architecture

What Gets Tracked

Before evaluating tools, it's worth establishing what data actually needs tracking. A typical training run generates several categories of information:

Parameters (config) — hyperparameter values, random seeds, data pipeline settings. These are inputs to reproducibility: if you can't re-run with the same parameters, you can't verify results.

Metrics — quantitative measures logged during or after training. Accuracy, loss, F1 score, latency, throughput. Metrics are only useful when paired with their associated run context (parameters + version + timing).

Artifacts — the outputs of training: model checkpoints, tokenizers, preprocessors, visualizations. These are large binary blobs that can't live in a Git repository but need versioned storage.

Metadata — timestamps, run duration, hardware used, git commit hash, dataset version. This contextual glue connects runs to the broader development history.

Storage Backends

Experiment data can be stored locally or sent to a remote backend:

  • Local filesystem: works for solo practitioners; no server setup required; hard to share across a team
  • SQL database (MLflow): PostgreSQL, MySQL, SQLite; self-hosted; enterprise-friendly
  • Cloud object storage (S3/GCS/Azure Blob): scalable, durable; often paired with a tracking server
  • SaaS (W&B): managed infrastructure; fast onboarding; ongoing cost and data residency considerations

The choice of backend affects everything downstream: query performance, access control, auditability, and cost. Teams should evaluate this before picking a tool.

MLflow

MLflow is an open-source ML lifecycle platform with four main components. For experiment tracking and model versioning, two are most relevant: the MLflow Tracking API and the MLflow Model Registry.

MLflow Tracking

MLflow Tracking uses a Python API to log parameters, metrics, and artifacts:

import mlflow

mlflow.set_experiment("image-classification")

with mlflow.start_run(run_name="resnet50-baseline"):
    mlflow.log_param("epochs", 100)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("learning_rate", 0.001)

    # training loop...
    for epoch in range(100):
        train_loss = train_one_epoch(model, train_loader)
        val_acc = evaluate(model, val_loader)
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_accuracy": val_acc
        })

    mlflow.log_artifact("checkpoints/best_model.pt")

MLflow automatically organizes runs by experiment name and stores everything under a local mlruns/ directory by default. Switching to a remote tracking server is a single environment variable:

export MLFLOW_TRACKING_URI="http://mlflow-server:5000"

The MLflow Tracking server is a Flask application that can be deployed on a single node or behind a load balancer. It stores run data in a backend SQLite/PostgreSQL database and artifacts in the configured object storage. The UI is served as a React front-end.

Strengths:

  • Open-source with no per-seat pricing
  • Self-hosted, giving full data control
  • Active ecosystem with integrations (Spark, Hugging Face, LangChain)
  • Model Registry integrates directly with Tracking

Weaknesses:

  • The UI is functional but dated compared to commercial alternatives
  • Scaling the tracking server requires manual ops effort
  • No built-in collaboration features (run sharing, comments, reports) out of the box
  • Auto-logging for complex frameworks requires additional configuration

MLflow Model Registry

The Model Registry extends MLflow Tracking with a lifecycle layer. Models aren't just stored as artifacts — they go through stages: Staging, Production, Archived. This mirrors real deployment workflows.

# Register a model from a completed run
model_uri = "runs:/<run_id>/model"
model_name = "recommender-v1"

registered_model = mlflow.register_model(model_uri, model_name)

# Transition to staging
client = mlflow.MlflowClient()
client.transition_model_version_stage(
    name=model_name,
    version=registered_model.version,
    stage="Staging"
)

# After validation, promote to production
client.transition_model_version_stage(
    name=model_name,
    version=registered_model.version,
    stage="Production"
)

Each model version stores metadata including the source run, dataset version, training parameters, and deployment stage. The registry provides API endpoints for querying models by stage or version, which can be wired directly into a deployment pipeline (Kubernetes, SageMaker, etc.).

Practical notes for teams:

  • Use git commit hashes as run tags to link experiments to code versions
  • Set up webhooks on stage transitions to trigger validation pipelines
  • Avoid registering every checkpoint — register only candidate releases
  • The registry works best when deployment is also automated; manual promotion negates most benefits

Weights & Biases

Weights & Biases (W&B) is a SaaS platform focused on research and experimentation. Its core product is experiment tracking, with strong visualization and collaboration features.

Quick Setup

import wandb

wandb.init(
    project="image-classification",
    name="resnet50-baseline",
    config={
        "epochs": 100,
        "batch_size": 32,
        "learning_rate": 0.001
    }
)

for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader)
    val_acc = evaluate(model, val_loader)
    wandb.log({
        "train_loss": train_loss,
        "val_accuracy": val_acc
    })

Unlike MLflow, W&B requires authentication and sends data to W&B's servers by default. Self-hosting is available via W&B Server (formerly "W&B Local") for enterprise customers. The Python API is lightweight — wandb.log() handles metric streaming, and the platform automatically generates visualization dashboards for runs.

Key Features

W&B separates concerns into Runs, Projects, and Sweeps:

  • Runs are individual training executions
  • Projects group related runs (e.g., by team or problem domain)
  • Sweeps automate hyperparameter search across multiple runs

Sweeps are particularly useful. A sweep config defines a parameter space and a search strategy (grid, random, Bayesian), and W&B launches parallel runs to explore it:

sweep_config = {
    "method": "bayes",
    "metric": {"goal": "maximize", "name": "val_accuracy"},
    "parameters": {
        "learning_rate": {"min": 1e-5, "max": 1e-2, "distribution": "log_uniform"},
        "batch_size": {"values": [16, 32, 64, 128]},
        "optimizer": {"values": ["adam", "sgd"]}
    }
}

W&B also provides Artifacts for model versioning, similar to the MLflow Model Registry:

wandb.log_artifact("checkpoints/best_model.pt", name="recommender-v1", type="model")

Artifacts are versioned, downloadable, and can be linked to specific runs. The W&B UI surfaces run comparisons, parallel coordinate plots, and statistical summaries without additional configuration.

Strengths:

  • Best-in-class visualization and collaboration UI
  • Sweeps automate hyperparameter search natively
  • Minimal setup friction for individual researchers
  • Public dashboard sharing is straightforward

Weaknesses:

  • SaaS model means data leaves your infrastructure (consider W&B Server for compliance requirements)
  • Per-seat pricing can become significant at team scale
  • Artifact management is less structured than MLflow's model registry for production deployment scenarios
  • Customization of the tracking server is limited

Comparing W&B Artifacts to MLflow Model Registry

W&B Artifacts and MLflow Model Registry solve overlapping problems but with different mental models. W&B treats artifacts as outputs of runs, linked to but not dependent on the tracking system. MLflow Model Registry is more opinionated about stages and transitions. For small teams prioritizing experiment exploration and visualization, W&B's approach is often more ergonomic. For teams needing explicit production promotion workflows, MLflow's stage model maps more directly to deployment pipelines.

DVC — Data Version Control

DVC (Data Version Control) takes a fundamentally different approach. Rather than building a separate experiment tracking database, it extends Git to handle large files and ML pipelines natively. If your team already lives in Git, DVC integrates without introducing a new system.

Data and Model Versioning

DVC uses content-addressable storage to version large files. Instead of storing binary blobs in Git (which destroys performance), DVC stores cryptographic hashes and manages the actual files in a separate cache:

# Track a dataset directory
dvc init
dvc add data/training-images/

# This creates data/training-images.dvc (a small text file) that Git can manage
git add data/training-images.dvc data/.dvccache
git commit -m "Add training images v2"

The .dvc file is a compact manifest: it contains the hash of the data contents, not the data itself. Anyone cloning the repo can retrieve the actual files with dvc checkout. This means:

  • Git history shows exactly which dataset version was used in each commit
  • Switching branches switches both code and data simultaneously
  • The cache is shareable via remote storage (S3, GCS, SSH)

DVC also versions model checkpoints:

dvc add models/recommender-v3.pt
git add models/recommender-v3.pt.dvc
git commit -m "Save recommender v3 checkpoint"

Now a pull request that modifies the model also commits the model version in a single atomic operation. Bisecting a bug becomes a matter of git bisect + dvc checkout.

ML Pipelines

DVC's pipeline feature (dvc.yaml) is a directed acyclic graph (DAG) definition for multi-step ML workflows:

# dvc.yaml
stages:
    preprocess:
        cmd: python preprocess.py --input data/raw/ --output data/processed/
        deps:
            - data/raw/
            - scripts/preprocess.py
        outs:
            - data/processed/

    train:
        cmd: python train.py --data data/processed/ --output models/checkpoint.pt
        deps:
            - data/processed/
            - scripts/train.py
        outs:
            - models/checkpoint.pt
        params:
            - learning_rate
            - epochs

Running dvc repro automatically determines which stages need re-running based on dependency changes. Modified preprocessing code? DVC re-runs preprocessing, then training. This is Makefile-like dependency management purpose-built for ML.

DVC does not natively provide the real-time metric logging that MLflow and W&B offer. It's designed to track the outputs and structure of experiments, not the moment-to-moment progression of training. In practice, many teams use DVC alongside MLflow: DVC for pipeline and artifact versioning, MLflow for live metric tracking.

Strengths:

  • Deep Git integration — no new server infrastructure
  • Pipelines bring dependency management to ML
  • Content-addressable storage handles large files efficiently
  • Teams with existing Git workflows adopt it with minimal friction

Weaknesses:

  • No real-time metric logging out of the box
  • Pipeline definition format has a learning curve
  • Collaboration requires shared remote storage configuration
  • The experiment tracking features are more limited than dedicated tools

Comparing ML Tracking Tools

The three tools serve overlapping but distinct needs. The table below summarizes the key comparison points.

Feature MLflow Weights & Biases DVC
License Open-source (Apache 2.0) Commercial SaaS / Enterprise Open-source (Apache 2.0)
Deployment Self-hosted or local SaaS or self-hosted (enterprise) Self-hosted only
Metric logging Yes (with Tracking server) Yes (native, low friction) No (file-based output only)
Artifact storage Yes (Model Registry) Yes (Artifacts) Yes (DVC cache + remote)
Hyperparameter search No (third-party integration) Yes (Sweeps) No
Pipeline management No No Yes (dvc.yaml DAGs)
Git integration No (separate server) No Yes (native)
Collaboration features Basic Advanced (dashboards, reports) Basic (remote storage)
Setup complexity Medium Low Low
UI quality Functional Best-in-class Functional
Per-seat cost Free (self-hosted) Paid (SaaS tiers) Free
Best suited for Enterprise teams, production MLOps Research teams, rapid experimentation Teams with Git-first workflows

A few practical notes on this comparison:

  • MLflow and DVC are not mutually exclusive. MLflow handles live experiment tracking while DVC manages pipeline reproducibility and artifact versioning. Many production ML stacks use both.
  • W&B vs. MLflow is often a trade-off between friction and control. W&B's onboarding is faster and the UI is more polished. MLflow's self-hosted model gives teams full data ownership and extensibility.
  • DVC's Git-first model is its defining advantage. For teams already using Git for code, DVC makes data and model versioning feel like a natural extension rather than a separate system.

Model Registry Patterns

A model registry is more than a storage bucket — it's a pipeline with governance. The core pattern involves three stages: tracking, staging, and production.

Registration Triggers

Models should be registered automatically when training objectives are met. Manual registration is error-prone and doesn't scale.

client = mlflow.MlflowClient()

def register_candidate(run_id, model_name, metrics):
    val_acc = metrics["val_accuracy"]
    if val_acc < 0.90:
        return  # Skip below threshold

    model_uri = f"runs:/{run_id}/model"
    result = mlflow.register_model(model_uri, model_name)

    # Auto-assign to staging for review
    client.transition_model_version_stage(
        name=model_name,
        version=result.version,
        stage="Staging"
    )

The same pattern applies in W&B using Artifacts and the API.

Staging Gate

Staging is where validation happens. What this validation looks like depends on the use case:

  • Regression testing — verify the model doesn't degrade on known test cases
  • Shadow deployment — route a small percentage of production traffic to the candidate and compare outputs
  • Smoke tests — basic latency and throughput checks under load
  • Bias auditing — check for demographic disparities in predictions

Automated gating reduces the risk of a bad model reaching production without adding manual bottlenecks.

Production Promotion

Promoting a model to production should be a single API call or automated pipeline step:

# Promote after staging validation passes
client.transition_model_version_stage(
    name="recommender-v1",
    version=3,
    stage="Production"
)

# Archive the previous production version
client.transition_model_version_stage(
    name="recommender-v1",
    version=2,
    stage="Archived"
)

The registry API can then serve model metadata to deployment systems:

# Fetch the current production model
production_model = client.get_latest_versions("recommender-v1", stages=["Production"])
model_uri = production_model.source

This design keeps the production model identifiable regardless of how many candidates were tested. It also supports rollback: if a deployed model causes issues, the previous version is one API call away from returning to production.

Naming Conventions

Consistent naming conventions are essential for a registry to stay navigable. A practical pattern:

<task>-<architecture>-<version>

Examples: image-resnet50-v3, nlp-bert-base-v1, recommend-wdl-v2. Including the architecture or task family makes the registry searchable without opening individual runs. Versions should increment monotonically per model line, not per experiment.

Scaling Across Teams

Experiment tracking at scale introduces organizational challenges beyond the tools themselves.

Centralized vs. Federated Tracking Servers

For small teams, a single MLflow tracking server or W&B team account works well. At larger scale, consider a federated model:

  • Per-team experiments: Each team gets its own MLflow experiment or W&B project. This limits naming collisions and access control complexity.
  • Shared artifact store: Model checkpoints and datasets live in a single object storage bucket with consistent naming conventions, accessible across teams.
  • Cross-team reporting: Aggregate metrics via API to surface organization-wide training statistics.

Access Control

Production model access should be gated differently from experiment access:

  • All runs readable: Any team member should be able to view experiment results for reproducibility
  • Production write-restricted: Only automated pipelines or designated admins can transition models to Production
  • Audit logging: Track who requested which transition and when, stored outside the registry for compliance

MLflow's database backend supports basic role-based access control via the Tracking server. W&B's enterprise tier adds fine-grained per-project permissions. DVC's access control follows standard filesystem permissions for its remote storage.

Avoiding Registry Bloat

Without curation, a model registry becomes a graveyard of abandoned experiments. Practical hygiene:

  • Register only candidates, not every checkpoint
  • Archive stale versions older than N months
  • Tag releases: Mark specific versions with release labels (v2.1.0-rc3) for traceability
  • Set retention policies: Auto-archive or delete runs older than a threshold

Most teams find that experiment data has a useful life of 3–6 months; after that, the specific run is less interesting than the pattern of results it represents. Keeping detailed notes in a run's tags and notes fields is more valuable than keeping every artifact indefinitely.

Conclusion

Experiment tracking and model versioning are foundational practices for any team doing ML development at scale. The specific tool matters less than the discipline of recording what was tried, what was produced, and how results changed over time. MLflow provides an open-source, self-hosted foundation with a production-oriented Model Registry. Weights & Biases offers the smoothest researcher experience with excellent visualization and native hyperparameter search. DVC brings Git-native versioning to datasets and models, ideal for teams already using Git workflows.

In practice, many mature ML stacks combine these tools: DVC for pipeline and artifact versioning, MLflow or W&B for real-time metric tracking, and a Model Registry for the staging-to-production pipeline. The combination costs more to set up but pays dividends in reproducibility, collaboration, and deployment confidence.

Start simple. Instrument one training script with parameter and metric logging. Run it a few times. Then evaluate whether you need more structure. Most teams don't need a fully mature MLOps platform on day one — they need to stop losing track of what they tried and what worked.