Back to Systems
Machine Learning
How I Evaluate Models
Philosophy
A model is only as good as its evaluation. I don't trust single metrics or single train-test splits. Rigorous evaluation prevents embarrassing surprises in production.
Principles
1
Match metric to business goal
Accuracy is almost never the right metric. Choose based on what errors cost.
Example: Fraud detection: prioritize recall (catch all fraud) over precision
2
Use cross-validation religiously
Single splits are not reliable. K-fold cross-validation gives you confidence intervals.
Example: 5-fold CV: mean accuracy 85% ± 3% is more useful than "85% accuracy"
3
Evaluate on realistic data
Holdout set should reflect real-world deployment conditions, including time.
Example: Train on months 1-6, validate on 7-9, test on 10-12
Anti-Patterns I Avoid
✕
Evaluating on training data
Training accuracy/loss is meaningless for predicting real-world performance.
Instead: Always report holdout set or cross-validation results
✕
Ignoring class imbalance
95% accuracy on 95% majority class data tells you nothing.
Instead: Use precision, recall, F1, AUC-ROC for imbalanced problems
Resources That Shaped This
- →Sklearn model evaluation documentation
- →Beyond accuracy: precision, recall, F1, and ROC
- →Time series cross-validation strategies