Skip to Content

Data Cleaning & Preprocessing


🧹 What is Data Cleaning & Preprocessing?

It’s the process of preparing raw data so it can be analyzed or modeled.

Messy data leads to misleading results—so this step is crucial!

🔧 Common Steps in Data Cleaning:

Task Description
Handling Missing Values Fill (impute), remove, or flag missing entries
Removing Duplicates Drop repeated rows or records
Fixing Inconsistencies Standardize categories (e.g., "NY" vs "New York")
Correcting Errors Catch typos or wrong values (e.g., age = -3)
Outlier Detection Identify extreme values using boxplots, Z-scores, or IQR

🧪 Data Preprocessing Techniques:

Technique Description
Normalization / Scaling Adjust numeric data to a standard range (e.g., 0 to 1)
Encoding Categorical Data Convert strings to numbers (e.g., one-hot encoding)
Feature Engineering Create new features (e.g., extract year from date)
Binning Group continuous data into categories (e.g., age groups)
Text Preprocessing Clean text: lowercasing, removing punctuation, stopwords, etc.

🧠 Real-World Example (Before & After):

Raw data:

Name: Alice, Age: "Twenty-two", Gender: "f", Income: 50k
Name: Bob, Age: 35, Gender: "M", Income: null
Name: Alice, Age: "Twenty-two", Gender: "f", Income: 50k  ← duplicate

After cleaning:

Name: Alice, Age: 22, Gender: Female, Income: 50000
Name: Bob, Age: 35, Gender: Male, Income: (filled or removed)

🛠️ Tools Commonly Used:

  • Python: pandas, numpy, scikit-learn
  • R: dplyr, tidyr, caret
  • Excel: Filters, formulas, Power Query

Would you like a Python code example of cleaning a simple dataset?