Python Docs

Data Cleaning with Pandas

Data cleaning prepares raw data for reliable analysis: handling missing values, fixing types, normalizing categories, parsing dates, removing duplicates, and dealing with outliers.

Setup

Create a small example DataFrame with messy values for IDs, names, ages, join dates, and scores.

Code

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [1, 2, 2, 4],
    'name': [' Alice ', 'bob', 'Bob ', None],
    'age': ['25', ' thirty ', '30', ''],
    'joined': ['2024-01-01', '01/02/2024', 'Feb 3, 2024', None],
    'score': [85, None, 92, 120]
})

Missing Values

Detect missing data, standardize blanks as NaN, and either fill or drop rows depending on how critical the column is.

Code

# Detect
print(df.isna().sum())

# Replace blanks with NaN before cleanup
df['age'] = df['age'].replace({'': np.nan})

# Fill numeric columns
df['score'] = df['score'].fillna(df['score'].median())

# Drop rows where critical fields are missing
df = df.dropna(subset=['name'])

Type Conversion

Normalize text formatting, coerce numeric fields, and parse dates that appear in different formats.

Code

# Clean whitespace and uniform case
df['name'] = df['name'].str.strip().str.title()

# Coerce problematic age strings to number
# non-convertible become NaN, then fill if needed
age_clean = (df['age'].str.extract(r'(\d+)')
                    .astype(float)
                    .squeeze())

df['age'] = age_clean

# Parse dates in different formats
df['joined'] = pd.to_datetime(df['joined'], errors='coerce')

Deduplication

Remove duplicate records using business keys such as id and name.

Code

# Keep first occurrence, drop duplicates on selected keys
df = df.drop_duplicates(subset=['id', 'name'])

Outliers

Clip numeric values to a plausible range to reduce the impact of extreme outliers.

Code

# Clip scores to plausible range [0, 100]
df['score'] = df['score'].clip(lower=0, upper=100)

Category Normalization

Standardize label variations and convert to a categorical type for memory savings and validation.

Code

# Example: standardize labels
mapping = {'Bob': 'Bob', 'B0b': 'Bob'}
df['name'] = df['name'].replace(mapping)

# Categorical dtype saves memory and validates values
categories = pd.CategoricalDtype(categories=['Alice','Bob'], ordered=False)
df['name'] = df['name'].astype(categories)

Column Hygiene

Normalize column names and trim whitespace on all string columns to keep schemas consistent.

Code

# Rename columns consistently
df = df.rename(columns=lambda c: c.strip().lower().replace(' ', '_'))

# Remove leading/trailing whitespace across object columns
obj_cols = df.select_dtypes(include=['object']).columns
for c in obj_cols:
    if df[c].notna().any():
        df[c] = df[c].str.strip()

Validation

Add simple assertions to ensure your cleaned data meets core expectations.

Code

# Basic validation rules
assert df['id'].is_unique, 'id must be unique after dedupe'
assert df['score'].between(0,100).all(), 'score out of range'
assert df['joined'].notna().any(), 'joined should have valid dates'

End-to-End Pipeline

Wrap the cleaning logic into a reusable function so that the same rules can be applied consistently to new data.

Code

def clean(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()

    # Missing values
    out['age'] = out['age'].replace({'': np.nan})
    out['score'] = out['score'].fillna(out['score'].median())
    out = out.dropna(subset=['name'])

    # Types and parsing
    out['name'] = out['name'].str.strip().str.title()
    out['age'] = (out['age'].astype(str)
                          .str.extract(r'(\d+)')
                          .astype(float)
                          .squeeze())
    out['joined'] = pd.to_datetime(out['joined'], errors='coerce')

    # Deduplicate and outliers
    out = out.drop_duplicates(subset=['id', 'name'])
    out['score'] = out['score'].clip(0, 100)

    # Column hygiene
    out = out.rename(columns=lambda c: c.strip().lower().replace(' ', '_'))
    for c in out.select_dtypes(include=['object']).columns:
        out[c] = out[c].str.strip()

    return out

Tips

  • Always inspect with df.info() and df.describe() first.
  • Prefer errors='coerce' when parsing to surface bad values.
  • Keep a reproducible cleaning function and unit tests for it.