Python Docs

Pandas Tutorial

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrame and Series that make working with structured data fast, easy, and expressive.

Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrame and Series that make working with structured data fast, easy, and expressive.

Key Features

  • Fast and efficient DataFrame object with indexing
  • Tools for reading and writing data between in-memory data structures and different file formats
  • Data alignment and integrated handling of missing data
  • Reshaping and pivoting of datasets
  • Label-based slicing, indexing, and subsetting
  • Group by functionality for aggregation and transformation
  • Time series functionality

Installation

pip install pandas

Import Pandas:

import pandas as pd
import numpy as np  # Often used together

Pandas Data Structures

Series

A Series is a one-dimensional labeled array:

import pandas as pd

# Create Series from list
s = pd.Series([1, 3, 5, 7, 9])
print(s)

# Create Series with custom index
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s)

# Create Series from dictionary
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)
print(s)

# Access elements
print(s['a'])      # 10
print(s.iloc[0])   # 10

DataFrame

A DataFrame is a two-dimensional labeled data structure:

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Paris', 'London', 'Tokyo']
}
df = pd.DataFrame(data)
print(df)

# Create DataFrame from list of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25},
    {'Name': 'Bob', 'Age': 30},
    {'Name': 'Charlie', 'Age': 35}
]
df = pd.DataFrame(data)
print(df)

# Create DataFrame from NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)

Viewing and Inspecting Data

# First/last rows
print(df.head())      # First 5 rows
print(df.tail(10))    # Last 10 rows

# Basic information
print(df.info())      # Column types, non-null counts
print(df.describe())  # Statistical summary
print(df.shape)       # (rows, columns)
print(df.columns)     # Column names
print(df.index)       # Index
print(df.dtypes)      # Data types

# Unique values
print(df['column'].unique())
print(df['column'].nunique())
print(df['column'].value_counts())

Reading and Writing Data

CSV Files

# Read CSV
df = pd.read_csv('data.csv')

# Read with specific options
df = pd.read_csv('data.csv', 
                 sep=',',           # Delimiter
                 header=0,          # Row to use as column names
                 index_col=0,       # Column to use as index
                 na_values=['NA'])  # Values to recognize as NaN

# Write to CSV
df.to_csv('output.csv', index=False)

Excel Files

# Read Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Write to Excel
df.to_excel('output.xlsx', sheet_name='Data', index=False)

# Write multiple sheets
with pd.ExcelWriter('output.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet1')
    df2.to_excel(writer, sheet_name='Sheet2')

JSON Files

# Read JSON
df = pd.read_json('data.json')

# Write to JSON
df.to_json('output.json', orient='records', indent=2)

SQL Databases

import sqlite3

# Read from SQL
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

# Write to SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

Selecting Data

Selecting Columns

# Single column (returns Series)
print(df['Name'])

# Multiple columns (returns DataFrame)
print(df[['Name', 'Age']])

# Using dot notation
print(df.Name)  # Works if column name has no spaces

Selecting Rows

# By position (iloc)
print(df.iloc[0])        # First row
print(df.iloc[0:3])      # First 3 rows
print(df.iloc[0:3, 0:2]) # First 3 rows, first 2 columns

# By label (loc)
print(df.loc[0])         # Row with index 0
print(df.loc[0:2])       # Rows 0 to 2 (inclusive)
print(df.loc[0:2, 'Name':'Age'])  # Specific rows and columns

Boolean Indexing

# Filter rows
print(df[df['Age'] > 30])
print(df[df['City'] == 'New York'])

# Multiple conditions
print(df[(df['Age'] > 25) & (df['City'] == 'Paris')])
print(df[(df['Age'] < 28) | (df['City'] == 'Tokyo')])

# Using isin()
cities = ['New York', 'London']
print(df[df['City'].isin(cities)])

Data Manipulation

Adding/Removing Columns

# Add new column
df['Country'] = ['USA', 'France', 'UK', 'Japan']
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Adult')

# Remove columns
df = df.drop('Country', axis=1)
df = df.drop(['City', 'Age_Group'], axis=1)

Adding/Removing Rows

# Add row
new_row = {'Name': 'Eve', 'Age': 27, 'City': 'Berlin'}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)

# Remove rows
df = df.drop(0, axis=0)  # Drop row with index 0
df = df[df['Age'] > 25]  # Keep only rows where Age > 25

Modifying Values

# Modify single value
df.loc[0, 'Age'] = 26

# Modify column
df['Age'] = df['Age'] + 1

# Conditional modification
df.loc[df['Age'] > 30, 'Status'] = 'Senior'
df.loc[df['Age'] <= 30, 'Status'] = 'Junior'

Handling Missing Data

# Check for missing values
print(df.isnull())
print(df.isnull().sum())  # Count per column
print(df.isna())          # Same as isnull()

# Drop missing values
df_clean = df.dropna()              # Drop rows with any NaN
df_clean = df.dropna(axis=1)        # Drop columns with any NaN
df_clean = df.dropna(thresh=2)      # Keep rows with at least 2 non-NaN

# Fill missing values
df_filled = df.fillna(0)            # Fill with 0
df_filled = df.fillna(df.mean())    # Fill with mean
df_filled = df.fillna(method='ffill')  # Forward fill
df_filled = df.fillna(method='bfill')  # Backward fill

# Fill specific columns
df['Age'].fillna(df['Age'].mean(), inplace=True)

GroupBy Operations

# Group by single column
grouped = df.groupby('City')
print(grouped.mean())
print(grouped.sum())
print(grouped.count())

# Group by multiple columns
grouped = df.groupby(['City', 'Status'])
print(grouped.mean())

# Aggregate with different functions
print(df.groupby('City').agg({
    'Age': ['mean', 'min', 'max'],
    'Name': 'count'
}))

# Apply custom function
def age_range(x):
    return x.max() - x.min()

print(df.groupby('City')['Age'].apply(age_range))

Sorting Data

# Sort by column
df_sorted = df.sort_values('Age')
df_sorted = df.sort_values('Age', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['City', 'Age'])

# Sort by index
df_sorted = df.sort_index()

# In-place sorting
df.sort_values('Age', inplace=True)

Merging and Joining

Concatenation

# Vertical concatenation (stack rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)

# Horizontal concatenation (stack columns)
result = pd.concat([df1, df2], axis=1)

Merge (SQL-style joins)

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [85, 90, 95]})

# Inner join
merged = pd.merge(df1, df2, on='ID', how='inner')

# Left join
merged = pd.merge(df1, df2, on='ID', how='left')

# Right join
merged = pd.merge(df1, df2, on='ID', how='right')

# Outer join
merged = pd.merge(df1, df2, on='ID', how='outer')

Pivot Tables

# Create sample data
data = {
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 160],
    'Quantity': [5, 8, 6, 9]
}
df = pd.DataFrame(data)

# Create pivot table
pivot = df.pivot_table(
    values='Sales',
    index='Date',
    columns='Product',
    aggfunc='sum'
)
print(pivot)

# Multiple aggregations
pivot = df.pivot_table(
    values=['Sales', 'Quantity'],
    index='Date',
    columns='Product',
    aggfunc={'Sales': 'sum', 'Quantity': 'mean'}
)

String Operations

# String methods
df['Name_Lower'] = df['Name'].str.lower()
df['Name_Upper'] = df['Name'].str.upper()
df['Name_Length'] = df['Name'].str.len()

# String contains
df_filtered = df[df['Name'].str.contains('Alice')]

# String replacement
df['Name'] = df['Name'].str.replace('Alice', 'Alicia')

# Split strings
df[['First', 'Last']] = df['FullName'].str.split(' ', expand=True)

# Strip whitespace
df['Name'] = df['Name'].str.strip()

Date and Time Operations

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Extract components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.day_name()

# Date arithmetic
df['NextWeek'] = df['Date'] + pd.Timedelta(days=7)

# Date range
dates = pd.date_range('2024-01-01', periods=10, freq='D')

# Resample time series
df.set_index('Date').resample('M').sum()  # Monthly sum

Apply and Map Functions

# Apply function to column
df['Age_Squared'] = df['Age'].apply(lambda x: x ** 2)

# Apply function to DataFrame
def calculate_category(row):
    if row['Age'] < 30:
        return 'Young'
    else:
        return 'Adult'

df['Category'] = df.apply(calculate_category, axis=1)

# Map values
status_map = {25: 'Junior', 30: 'Mid', 35: 'Senior'}
df['Level'] = df['Age'].map(status_map)

# Replace values
df['City'] = df['City'].replace({'New York': 'NYC', 'Los Angeles': 'LA'})

Practical Example: Sales Analysis

# Create sample sales data
data = {
    'Date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100),
    'Quantity': np.random.randint(1, 20, 100)
}
df = pd.DataFrame(data)

# Analysis
# 1. Total sales by product
product_sales = df.groupby('Product')['Sales'].sum()
print(product_sales)

# 2. Average quantity by region
region_avg = df.groupby('Region')['Quantity'].mean()
print(region_avg)

# 3. Monthly sales trend
df['Month'] = df['Date'].dt.to_period('M')
monthly_sales = df.groupby('Month')['Sales'].sum()
print(monthly_sales)

# 4. Top 10 sales days
top_days = df.nlargest(10, 'Sales')[['Date', 'Product', 'Sales']]
print(top_days)

# 5. Sales summary by product and region
summary = df.pivot_table(
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc='sum',
    fill_value=0
)
print(summary)

Best Practices

  • Use vectorized operations: Avoid loops; use built-in Pandas methods.
  • Chain operations: Use method chaining for cleaner code.
  • Set appropriate data types: Use category dtype for categorical data.
  • Use inplace sparingly: Prefer assignment for clarity.
  • Handle missing data explicitly: Don't ignore NaN values.
  • Index wisely: Set meaningful indexes for faster lookups.

Summary

Pandas is the go-to library for data manipulation in Python. Its powerful DataFrame structure and rich functionality make it essential for data cleaning, transformation, and analysis. Master Pandas to work efficiently with structured data.