Data Cleaning & Preprocessing

🧹 What is Data Cleaning & Preprocessing?

It’s the process of preparing raw data so it can be analyzed or modeled.

Messy data leads to misleading results—so this step is crucial!

🔧 Common Steps in Data Cleaning:

Task	Description
Handling Missing Values	Fill (impute), remove, or flag missing entries
Removing Duplicates	Drop repeated rows or records
Fixing Inconsistencies	Standardize categories (e.g., "NY" vs "New York")
Correcting Errors	Catch typos or wrong values (e.g., age = -3)
Outlier Detection	Identify extreme values using boxplots, Z-scores, or IQR

🧪 Data Preprocessing Techniques:

Technique	Description
Normalization / Scaling	Adjust numeric data to a standard range (e.g., 0 to 1)
Encoding Categorical Data	Convert strings to numbers (e.g., one-hot encoding)
Feature Engineering	Create new features (e.g., extract year from date)
Binning	Group continuous data into categories (e.g., age groups)
Text Preprocessing	Clean text: lowercasing, removing punctuation, stopwords, etc.

🧠 Real-World Example (Before & After):

Raw data:

Name: Alice, Age: "Twenty-two", Gender: "f", Income: 50k
Name: Bob, Age: 35, Gender: "M", Income: null
Name: Alice, Age: "Twenty-two", Gender: "f", Income: 50k  ← duplicate

After cleaning:

Name: Alice, Age: 22, Gender: Female, Income: 50000
Name: Bob, Age: 35, Gender: Male, Income: (filled or removed)

🛠️ Tools Commonly Used:

Python: pandas, numpy, scikit-learn
R: dplyr, tidyr, caret
Excel: Filters, formulas, Power Query

Would you like a Python code example of cleaning a simple dataset?

in Data science