A Practical Guide to Cross-Validation
Why This Matters
A single train-test split is not reliable. Your model might perform great on one split and poorly on another. Cross-validation gives you confidence intervals instead of point estimates.
The Basic Idea
Instead of one split, do K splits. Train K models, each time using a different fold as the test set. Report mean ± standard deviation of the metric.
When to Use What
**K-Fold (k=5 or 10)** - Default choice for most problems - Good balance of bias and variance in estimates
**Stratified K-Fold** - Classification with imbalanced classes - Maintains class proportions in each fold
**Time Series Split** - Time-dependent data - Prevents data leakage from future to past
**Leave-One-Out** - Very small datasets (<100 samples) - Computationally expensive
Common Mistakes
- **Fitting preprocessing in the loop**: StandardScaler should be fit only on training fold
- **Ignoring variance**: Mean score of 0.85 ± 0.15 is very different from 0.85 ± 0.02
- **Using test set for tuning**: Nested CV or separate holdout needed for hyperparameter search
Code Example
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline# Create pipeline to prevent leakage pipe = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ])
# Stratified 5-fold cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipe, X, y, cv=cv, scoring='f1')
print(f"F1 Score: {scores.mean():.3f} ± {scores.std():.3f}")
Takeaways
- Always use cross-validation for model evaluation
- Report both mean and standard deviation
- Choose CV strategy based on data characteristics
- Use pipelines to prevent preprocessing leakage