Back to Systems
Data Engineering
How I Handle Data Quality
Philosophy
Garbage in, garbage out. No model can fix fundamentally flawed data. I spend more time understanding and cleaning data than tuning models. Data quality is where most ML projects succeed or fail.
Principles
1
Understand before cleaning
Before dropping rows or filling nulls, understand WHY the data looks the way it does. Missing data often carries information.
Example: Missing salary might mean "declined to answer" vs "not asked"
2
Document every transformation
Every cleaning decision is a modeling decision. Document what you changed and why.
Example: Logged: "Removed 234 rows with negative age values (data entry errors)"
3
Validate at every step
Add assertions and checks after each transformation. Catch problems early.
Example: assert df["age"].min() >= 0, "Negative ages found after cleaning"
Anti-Patterns I Avoid
✕
Filling all nulls with mean
Mean imputation is almost never the right answer. It distorts distributions and hides patterns.
Instead: Investigate why values are missing, use domain-appropriate strategies
✕
Dropping all rows with any null
Removing incomplete rows can introduce severe bias if missingness is not random.
Instead: Analyze missingness patterns, consider multiple imputation
Resources That Shaped This
- →Thoughtful Machine Learning with Python
- →Data Cleaning chapter in Python for Data Analysis
- →Sklearn preprocessing documentation