Back to Systems
Engineering

How I Debug ML Systems

Philosophy

ML bugs are different from traditional software bugs. The code runs fine, but the model fails silently. I approach ML debugging systematically, starting with data.

Principles

1

Start with the data

Most ML bugs are data bugs. Check data quality before touching the model.

Example: Audit training data: missing values, outliers, label correctness
2

Overfit on purpose

If model cannot overfit on tiny dataset, something is fundamentally wrong.

Example: Train on 100 samples with high capacity model, expect near-perfect fit
3

Check intermediate outputs

Log feature distributions, predictions, and losses at each step.

Example: Print feature means, log training loss every epoch, validate predictions

Anti-Patterns I Avoid

Jumping to hyperparameter tuning

Tuning a broken pipeline optimizes nothing useful.

Instead: Get baseline working first, then tune

Ignoring warnings

Sklearn and TensorFlow warnings often indicate real problems.

Instead: Treat warnings as errors, investigate each one

Resources That Shaped This

  • How to debug machine learning models
  • A Recipe for Training Neural Networks (Karpathy)
  • Common ML debugging patterns