Engineering

How I Debug ML Systems

Philosophy

ML bugs are different from traditional software bugs. The code runs fine, but the model fails silently. I approach ML debugging systematically, starting with data.

Principles

Start with the data

Most ML bugs are data bugs. Check data quality before touching the model.

Example: Audit training data: missing values, outliers, label correctness

Overfit on purpose

If model cannot overfit on tiny dataset, something is fundamentally wrong.

Example: Train on 100 samples with high capacity model, expect near-perfect fit

Check intermediate outputs

Log feature distributions, predictions, and losses at each step.

Example: Print feature means, log training loss every epoch, validate predictions

Anti-Patterns I Avoid

Jumping to hyperparameter tuning

Tuning a broken pipeline optimizes nothing useful.

Instead: Get baseline working first, then tune

Ignoring warnings

Sklearn and TensorFlow warnings often indicate real problems.

Instead: Treat warnings as errors, investigate each one

Resources That Shaped This

→How to debug machine learning models
→A Recipe for Training Neural Networks (Karpathy)
→Common ML debugging patterns