Back to Systems
Machine Learning

How I Evaluate Models

Philosophy

A model is only as good as its evaluation. I don't trust single metrics or single train-test splits. Rigorous evaluation prevents embarrassing surprises in production.

Principles

1

Match metric to business goal

Accuracy is almost never the right metric. Choose based on what errors cost.

Example: Fraud detection: prioritize recall (catch all fraud) over precision
2

Use cross-validation religiously

Single splits are not reliable. K-fold cross-validation gives you confidence intervals.

Example: 5-fold CV: mean accuracy 85% ± 3% is more useful than "85% accuracy"
3

Evaluate on realistic data

Holdout set should reflect real-world deployment conditions, including time.

Example: Train on months 1-6, validate on 7-9, test on 10-12

Anti-Patterns I Avoid

Evaluating on training data

Training accuracy/loss is meaningless for predicting real-world performance.

Instead: Always report holdout set or cross-validation results

Ignoring class imbalance

95% accuracy on 95% majority class data tells you nothing.

Instead: Use precision, recall, F1, AUC-ROC for imbalanced problems

Resources That Shaped This

  • Sklearn model evaluation documentation
  • Beyond accuracy: precision, recall, F1, and ROC
  • Time series cross-validation strategies