How I Design ML Pipelines
Philosophy
A good ML pipeline is reproducible, testable, and maintainable. Too many data scientists write spaghetti notebooks that work once and never again. I believe in treating ML code like production software — with proper structure, version control, and documentation.
Principles
Separate concerns
Data loading, preprocessing, feature engineering, and modeling should be distinct modules. Each step should be independently testable.
Configuration over code
Hyperparameters, file paths, and feature lists should be in config files, not hardcoded. This makes experiments reproducible.
Version your data
Model performance depends on both code AND data. Track data versions alongside code versions.
Anti-Patterns I Avoid
The "it works on my machine" notebook
Notebooks with absolute paths, missing dependencies, and cells run in random order.
Training and evaluation in the same loop
Mixing training code with evaluation makes it hard to reproduce results separately.
Resources That Shaped This
- →Made with ML (madewithml.com)
- →ML Engineering by Andriy Burkov
- →Cookiecutter Data Science template