🧹 What is Data Cleaning & Preprocessing?
It’s the process of preparing raw data so it can be analyzed or modeled.
Messy data leads to misleading results—so this step is crucial!
🔧 Common Steps in Data Cleaning:
Task | Description |
---|---|
Handling Missing Values | Fill (impute), remove, or flag missing entries |
Removing Duplicates | Drop repeated rows or records |
Fixing Inconsistencies | Standardize categories (e.g., "NY" vs "New York") |
Correcting Errors | Catch typos or wrong values (e.g., age = -3) |
Outlier Detection | Identify extreme values using boxplots, Z-scores, or IQR |
🧪 Data Preprocessing Techniques:
Technique | Description |
---|---|
Normalization / Scaling | Adjust numeric data to a standard range (e.g., 0 to 1) |
Encoding Categorical Data | Convert strings to numbers (e.g., one-hot encoding) |
Feature Engineering | Create new features (e.g., extract year from date) |
Binning | Group continuous data into categories (e.g., age groups) |
Text Preprocessing | Clean text: lowercasing, removing punctuation, stopwords, etc. |
🧠 Real-World Example (Before & After):
Raw data:
Name: Alice, Age: "Twenty-two", Gender: "f", Income: 50k Name: Bob, Age: 35, Gender: "M", Income: null Name: Alice, Age: "Twenty-two", Gender: "f", Income: 50k ← duplicate
After cleaning:
Name: Alice, Age: 22, Gender: Female, Income: 50000 Name: Bob, Age: 35, Gender: Male, Income: (filled or removed)
🛠️ Tools Commonly Used:
- Python: pandas, numpy, scikit-learn
- R: dplyr, tidyr, caret
- Excel: Filters, formulas, Power Query
Would you like a Python code example of cleaning a simple dataset?