Python Docs
Data Cleaning with Pandas
Data cleaning prepares raw data for reliable analysis: handling missing values, fixing types, normalizing categories, parsing dates, removing duplicates, and dealing with outliers.
Setup
Create a small example DataFrame with messy values for IDs, names, ages, join dates, and scores.
Code
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1, 2, 2, 4],
'name': [' Alice ', 'bob', 'Bob ', None],
'age': ['25', ' thirty ', '30', ''],
'joined': ['2024-01-01', '01/02/2024', 'Feb 3, 2024', None],
'score': [85, None, 92, 120]
})Missing Values
Detect missing data, standardize blanks as NaN, and either fill or drop rows depending on how critical the column is.
Code
# Detect
print(df.isna().sum())
# Replace blanks with NaN before cleanup
df['age'] = df['age'].replace({'': np.nan})
# Fill numeric columns
df['score'] = df['score'].fillna(df['score'].median())
# Drop rows where critical fields are missing
df = df.dropna(subset=['name'])Type Conversion
Normalize text formatting, coerce numeric fields, and parse dates that appear in different formats.
Code
# Clean whitespace and uniform case
df['name'] = df['name'].str.strip().str.title()
# Coerce problematic age strings to number
# non-convertible become NaN, then fill if needed
age_clean = (df['age'].str.extract(r'(\d+)')
.astype(float)
.squeeze())
df['age'] = age_clean
# Parse dates in different formats
df['joined'] = pd.to_datetime(df['joined'], errors='coerce')Deduplication
Remove duplicate records using business keys such as id and name.
Code
# Keep first occurrence, drop duplicates on selected keys df = df.drop_duplicates(subset=['id', 'name'])
Outliers
Clip numeric values to a plausible range to reduce the impact of extreme outliers.
Code
# Clip scores to plausible range [0, 100] df['score'] = df['score'].clip(lower=0, upper=100)
Category Normalization
Standardize label variations and convert to a categorical type for memory savings and validation.
Code
# Example: standardize labels
mapping = {'Bob': 'Bob', 'B0b': 'Bob'}
df['name'] = df['name'].replace(mapping)
# Categorical dtype saves memory and validates values
categories = pd.CategoricalDtype(categories=['Alice','Bob'], ordered=False)
df['name'] = df['name'].astype(categories)Column Hygiene
Normalize column names and trim whitespace on all string columns to keep schemas consistent.
Code
# Rename columns consistently
df = df.rename(columns=lambda c: c.strip().lower().replace(' ', '_'))
# Remove leading/trailing whitespace across object columns
obj_cols = df.select_dtypes(include=['object']).columns
for c in obj_cols:
if df[c].notna().any():
df[c] = df[c].str.strip()Validation
Add simple assertions to ensure your cleaned data meets core expectations.
Code
# Basic validation rules assert df['id'].is_unique, 'id must be unique after dedupe' assert df['score'].between(0,100).all(), 'score out of range' assert df['joined'].notna().any(), 'joined should have valid dates'
End-to-End Pipeline
Wrap the cleaning logic into a reusable function so that the same rules can be applied consistently to new data.
Code
def clean(df: pd.DataFrame) -> pd.DataFrame:
out = df.copy()
# Missing values
out['age'] = out['age'].replace({'': np.nan})
out['score'] = out['score'].fillna(out['score'].median())
out = out.dropna(subset=['name'])
# Types and parsing
out['name'] = out['name'].str.strip().str.title()
out['age'] = (out['age'].astype(str)
.str.extract(r'(\d+)')
.astype(float)
.squeeze())
out['joined'] = pd.to_datetime(out['joined'], errors='coerce')
# Deduplicate and outliers
out = out.drop_duplicates(subset=['id', 'name'])
out['score'] = out['score'].clip(0, 100)
# Column hygiene
out = out.rename(columns=lambda c: c.strip().lower().replace(' ', '_'))
for c in out.select_dtypes(include=['object']).columns:
out[c] = out[c].str.strip()
return outTips
- Always inspect with
df.info()anddf.describe()first. - Prefer
errors='coerce'when parsing to surface bad values. - Keep a reproducible cleaning function and unit tests for it.