Back to Systems
Engineering
How I Debug ML Systems
Philosophy
ML bugs are different from traditional software bugs. The code runs fine, but the model fails silently. I approach ML debugging systematically, starting with data.
Principles
1
Start with the data
Most ML bugs are data bugs. Check data quality before touching the model.
Example: Audit training data: missing values, outliers, label correctness
2
Overfit on purpose
If model cannot overfit on tiny dataset, something is fundamentally wrong.
Example: Train on 100 samples with high capacity model, expect near-perfect fit
3
Check intermediate outputs
Log feature distributions, predictions, and losses at each step.
Example: Print feature means, log training loss every epoch, validate predictions
Anti-Patterns I Avoid
✕
Jumping to hyperparameter tuning
Tuning a broken pipeline optimizes nothing useful.
Instead: Get baseline working first, then tune
✕
Ignoring warnings
Sklearn and TensorFlow warnings often indicate real problems.
Instead: Treat warnings as errors, investigate each one
Resources That Shaped This
- →How to debug machine learning models
- →A Recipe for Training Neural Networks (Karpathy)
- →Common ML debugging patterns