Back to Systems
Data Engineering

How I Handle Data Quality

Philosophy

Garbage in, garbage out. No model can fix fundamentally flawed data. I spend more time understanding and cleaning data than tuning models. Data quality is where most ML projects succeed or fail.

Principles

1

Understand before cleaning

Before dropping rows or filling nulls, understand WHY the data looks the way it does. Missing data often carries information.

Example: Missing salary might mean "declined to answer" vs "not asked"
2

Document every transformation

Every cleaning decision is a modeling decision. Document what you changed and why.

Example: Logged: "Removed 234 rows with negative age values (data entry errors)"
3

Validate at every step

Add assertions and checks after each transformation. Catch problems early.

Example: assert df["age"].min() >= 0, "Negative ages found after cleaning"

Anti-Patterns I Avoid

Filling all nulls with mean

Mean imputation is almost never the right answer. It distorts distributions and hides patterns.

Instead: Investigate why values are missing, use domain-appropriate strategies

Dropping all rows with any null

Removing incomplete rows can introduce severe bias if missingness is not random.

Instead: Analyze missingness patterns, consider multiple imputation

Resources That Shaped This

  • Thoughtful Machine Learning with Python
  • Data Cleaning chapter in Python for Data Analysis
  • Sklearn preprocessing documentation