Python Docs
Pandas for Data Science
Pandas provides high-performance tools for loading, cleaning, transforming, aggregating, and joining tabular datasets using DataFrames. It is the foundation of most data science workflows in Python.
Why Pandas?
Pandas is essential because it provides:
- Easy loading of CSV, Excel, SQL, JSON
- Powerful filtering and indexing
- Groupby operations for summaries
- Merging & joining like SQL
- Handling missing values
- Reshaping (pivot, melt)
- Time-series and rolling operations
Load & Inspect Data
Use Pandas to load files and quickly inspect their structure.
Example
import pandas as pd
df = pd.read_csv('sales.csv')
print(df.head()) # preview first 5 rows
print(df.info()) # column types + non-null countsWhat this shows:
head()quickly previews your dataset.info()reveals dtypes and missing values.- Useful before any cleaning or modeling.
Select, Group & Join
These three operations form the core of data analysis: filtering rows, summarizing data, and combining datasets.
Examples
# filtering high-value transactions
high = df[df['amount'] > 1000]
# groupby: total sales by region
agg = df.groupby('region')['amount'].sum().reset_index()
# join with metadata
meta = pd.read_csv('regions.csv')
merged = df.merge(meta, on='region', how='left')
print(merged.head())Explanation:
- Filtering: Select rows where
amount > 1000. - Groupby: Summarize sales by region — common in dashboards.
- Merging: Join
sales.csvwithregions.csv(SQL-style join: left/inner/right/outer).
Handling Missing Values
Missing values must be handled before training ML models or running analysis.
Example
df['amount'] = df['amount'].fillna(df['amount'].median())
Alternatives:
df.dropna()— remove missing rows entirelydf.fillna(0)— replace missing values with defaultdf.fillna(method='ffill')— forward fill for time-series
Real-World Uses of Pandas in Data Science
- Sales forecasting (cleaning, aggregating, merging datasets)
- Customer segmentation (groupby, statistics)
- Analysis of product performance (join with metadata)
- Preprocessing for machine learning models
- ETL pipelines (Load → Clean → Transform → Export)
Summary
Pandas is essential for all data science workflows — from loading and cleaning to grouping, joining, and preparing data for ML. Its expressive syntax makes complex analytics easy to perform.