Data Engineering

How I Handle Data Quality

Philosophy

Garbage in, garbage out. No model can fix fundamentally flawed data. I spend more time understanding and cleaning data than tuning models. Data quality is where most ML projects succeed or fail.

Principles

Understand before cleaning

Before dropping rows or filling nulls, understand WHY the data looks the way it does. Missing data often carries information.

Example: Missing salary might mean "declined to answer" vs "not asked"

Document every transformation

Every cleaning decision is a modeling decision. Document what you changed and why.

Example: Logged: "Removed 234 rows with negative age values (data entry errors)"

Validate at every step

Add assertions and checks after each transformation. Catch problems early.

Example: assert df["age"].min() >= 0, "Negative ages found after cleaning"

Anti-Patterns I Avoid

Filling all nulls with mean

Mean imputation is almost never the right answer. It distorts distributions and hides patterns.

Instead: Investigate why values are missing, use domain-appropriate strategies

Dropping all rows with any null

Removing incomplete rows can introduce severe bias if missingness is not random.

Instead: Analyze missingness patterns, consider multiple imputation

Resources That Shaped This

→Thoughtful Machine Learning with Python
→Data Cleaning chapter in Python for Data Analysis
→Sklearn preprocessing documentation