Machine Learning

How I Design ML Pipelines

Philosophy

A good ML pipeline is reproducible, testable, and maintainable. Too many data scientists write spaghetti notebooks that work once and never again. I believe in treating ML code like production software — with proper structure, version control, and documentation.

Principles

Separate concerns

Data loading, preprocessing, feature engineering, and modeling should be distinct modules. Each step should be independently testable.

Example: data_loader.py, preprocessor.py, features.py, model.py

Configuration over code

Hyperparameters, file paths, and feature lists should be in config files, not hardcoded. This makes experiments reproducible.

Example: config.yaml with model_params, data_paths, feature_list

Version your data

Model performance depends on both code AND data. Track data versions alongside code versions.

Example: Use DVC, log data checksums, or maintain data manifests

Anti-Patterns I Avoid

The "it works on my machine" notebook

Notebooks with absolute paths, missing dependencies, and cells run in random order.

Instead: Convert to scripts, use relative paths, document dependencies in requirements.txt

Training and evaluation in the same loop

Mixing training code with evaluation makes it hard to reproduce results separately.

Instead: Save model artifacts, run evaluation as separate step

Resources That Shaped This

→Made with ML (madewithml.com)
→ML Engineering by Andriy Burkov
→Cookiecutter Data Science template