Python Docs
Data Preprocessing
Data preprocessing converts raw data into clean and useful form for machine learning. It includes splitting data, scaling numerical features, encoding categories, and building automated pipelines.
Why Data Preprocessing?
Most ML models expect numerical, scaled, and well-structured features. Preprocessing ensures:
- Numerical data is standardized for optimal model performance.
- Categorical data is encoded into machine-readable vectors.
- Training and testing data are separated to avoid data leakage.
- A single Pipeline automates all steps, preventing manual mistakes.
Install Dependencies
Command
pip install scikit-learn
Full Preprocessing + Model Pipeline
Below is a complete example with:
- Standard scaling for numeric data
- One-Hot encoding for categorical data
- ColumnTransformer for different column types
- Pipeline chaining preprocessing + model
Example
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame({
'age': [22, 35, 58, 44],
'city': ['NY', 'SF', 'LA', 'NY'],
'label': [0, 1, 1, 0],
})
X = df[['age', 'city']]
y = df['label']
num = ['age']
cat = ['city']
pre = ColumnTransformer([
('num', StandardScaler(), num),
('cat', OneHotEncoder(handle_unknown='ignore'), cat),
])
clf = Pipeline([
('pre', pre),
('model', LogisticRegression())
])
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.25, random_state=42)
clf.fit(X_train, y_train)
print('Score:', clf.score(X_test, y_test))Output (example):
Score: 1.0
More Things You Can Add in Preprocessing
- Min-Max Scaling for algorithms sensitive to scale.
- Label Encoding for ordinal categories.
- Polynomial Features for capturing non-linear patterns.
- Imputers to fill missing values automatically.
- Feature Selection to reduce noise & improve accuracy.
Summary
Data preprocessing is one of the most important steps in machine learning. Scikit-learn Pipelines ensure that all transformations are applied consistently during both training and testing, avoiding data leakage.