A Practical Guide to Cross-Validation

Why This Matters

A single train-test split is not reliable. Your model might perform great on one split and poorly on another. Cross-validation gives you confidence intervals instead of point estimates.

The Basic Idea

Instead of one split, do K splits. Train K models, each time using a different fold as the test set. Report mean ± standard deviation of the metric.

When to Use What

**K-Fold (k=5 or 10)** - Default choice for most problems - Good balance of bias and variance in estimates

**Stratified K-Fold** - Classification with imbalanced classes - Maintains class proportions in each fold

**Time Series Split** - Time-dependent data - Prevents data leakage from future to past

**Leave-One-Out** - Very small datasets (<100 samples) - Computationally expensive

Common Mistakes

**Fitting preprocessing in the loop**: StandardScaler should be fit only on training fold
**Ignoring variance**: Mean score of 0.85 ± 0.15 is very different from 0.85 ± 0.02
**Using test set for tuning**: Nested CV or separate holdout needed for hyperparameter search

Code Example

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create pipeline to prevent leakage pipe = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ])

# Stratified 5-fold cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipe, X, y, cv=cv, scoring='f1')

print(f"F1 Score: {scores.mean():.3f} ± {scores.std():.3f}")

Takeaways

Always use cross-validation for model evaluation
Report both mean and standard deviation
Choose CV strategy based on data characteristics
Use pipelines to prevent preprocessing leakage