Python Docs

Data Preprocessing

Data preprocessing converts raw data into clean and useful form for machine learning. It includes splitting data, scaling numerical features, encoding categories, and building automated pipelines.

Why Data Preprocessing?

Most ML models expect numerical, scaled, and well-structured features. Preprocessing ensures:

Numerical data is standardized for optimal model performance.
Categorical data is encoded into machine-readable vectors.
Training and testing data are separated to avoid data leakage.
A single Pipeline automates all steps, preventing manual mistakes.

Install Dependencies

Command

pip install scikit-learn

Full Preprocessing + Model Pipeline

Below is a complete example with:

Standard scaling for numeric data
One-Hot encoding for categorical data
ColumnTransformer for different column types
Pipeline chaining preprocessing + model

Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame({
    'age': [22, 35, 58, 44],
    'city': ['NY', 'SF', 'LA', 'NY'],
    'label': [0, 1, 1, 0],
})

X = df[['age', 'city']]
y = df['label']

num = ['age']
cat = ['city']

pre = ColumnTransformer([
    ('num', StandardScaler(), num),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat),
])

clf = Pipeline([
    ('pre', pre),
    ('model', LogisticRegression())
])

X_train, X_test, y_train, y_test =
    train_test_split(X, y, test_size=0.25, random_state=42)

clf.fit(X_train, y_train)
print('Score:', clf.score(X_test, y_test))

Output (example):

Score: 1.0

More Things You Can Add in Preprocessing

Min-Max Scaling for algorithms sensitive to scale.
Label Encoding for ordinal categories.
Polynomial Features for capturing non-linear patterns.
Imputers to fill missing values automatically.
Feature Selection to reduce noise & improve accuracy.

Summary

Data preprocessing is one of the most important steps in machine learning. Scikit-learn Pipelines ensure that all transformations are applied consistently during both training and testing, avoiding data leakage.