Working with large datasets in Pandas can feel like fighting an uphill battle. Your laptop fans spin louder, the progress bar crawls, and sometimes your kernel just dies. But you do not need a powerful workstation to handle millions of rows. You need smarter code.
This guide covers practical techniques to reduce Pandas memory usage by 85% or more. You will learn exactly which operations slow you down, how to fix them, and when to consider alternatives like Polars for large-scale workloads.
The Memory Problem in Pandas
Pandas defaults to using high-precision numeric types. An integer column with values 0-255 still gets int64, wasting 7 bytes per value. String columns use object types that store pointers, not the actual strings. For a 5GB CSV file, Pandas might consume 7.8GB of RAM during loading.
The solution is not buying more RAM. The solution is understanding how Pandas stores data and choosing the right types.
Understanding Pandas Data Types
Default Memory Waste
import pandas as pd
# Load a typical CSV
df = pd.read_csv("data.csv")
# Check memory usage by column
print(df.memory_usage(deep=True))
print(f"\nTotal memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
Output might show:
id 8000000 bytes
name 16000000 bytes
category 24000000 bytes
value 64000000 bytes
Total memory: 112 MB
The category column with 100 unique values uses object type, which stores Python strings. This wastes memory and slows operations.
Choosing Efficient Data Types
Reduce memory by matching types to your actual data:
# Before: int64 for small integers
df['small_int'] = df['small_int'].astype('int8')
# Before: float64 for precision you do not need
df['price'] = df['price'].astype('float32')
# Convert to category for repeated string values
df['category'] = df['category'].astype('category')
Memory reduction examples:
- int8 vs int64: 87.5% reduction for values -128 to 127
- float32 vs float64: 50% reduction for most financial data
- category vs object: 80-90% reduction for columns with few unique values
Automatic Memory Optimization
Let Pandas figure out the best types:
def reduce_memory(df):
for col in df.select_dtypes(include=['object']):
num_unique = df[col].nunique()
num_total = len(df)
if num_unique / num_total < 0.5: # Less than 50% unique
df[col] = df[col].astype('category')
for col in df.select_dtypes(include=['integer']):
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes(include=['float']):
df[col] = pd.to_numeric(df[col], downcast='float')
return df
df = reduce_memory(df)
This function automatically:
- Converts low-cardinality strings to category type
- Downcasts integers to smallest suitable type
- Downcasts floats to float32 or float16
Vectorization Over Iteration
Row-by-row operations are the biggest performance killer in Pandas. The difference between apply and vectorized operations can be 100x or more.
The iterrows Trap
# Slow: iterating row by row
for index, row in df.iterrows():
df.at[index, 'result'] = row['value'] * 2
Time: 237.3 seconds for 1M rows
Vectorized Solution
# Fast: vectorized operation
df['result'] = df['value'] * 2
Time: 1 second for 1M rows
Speedup: 237x
Replace Apply with Vectorization
# Slow: apply with custom function
def complex_calc(row):
return row['a'] + row['b'] - row['c']
df['result'] = df.apply(complex_calc, axis=1)
Instead, use direct operations:
# Fast: direct vectorized math
df['result'] = df['a'] + df['b'] - df['c']
The vectorized version delegates work to NumPy, which uses pre-compiled C code with SIMD instructions. Python loops cannot compete.
Using Query for Complex Filters
# Chained filters are slow
df[(df['date'] > '2024-01-01') & (df['category'] == 'A') & (df['value'] > 100)]
# Query is faster for complex conditions
df.query("date > '2024-01-01' and category == 'A' and value > 100")
Query compiles the filter expression into efficient code, avoiding intermediate DataFrame creations.
Chunk Processing for Huge Files
When files exceed available memory, process in chunks:
chunk_size = 100000
total_rows = 0
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
# Process each chunk
chunk['processed'] = chunk['value'] * 2
# Write to output (append mode)
chunk.to_csv("output.csv", mode='a', header=not total_rows)
total_rows += len(chunk)
print(f"Processed {total_rows} rows")
Key parameters:
chunksize: Rows per chunk (100K-500K works well)usecols: Load only needed columnsdtype: Specify types during import to avoid conversion
# Optimal import with type specifications
df = pd.read_csv(
"data.csv",
usecols=['id', 'date', 'category', 'value'],
dtype={
'id': 'int32',
'category': 'category',
'value': 'float32'
},
parse_dates=['date']
)
Indexing for Fast Lookups
Proper indexing accelerates filtering and joins by 10-100x:
# Set index on frequently-queried columns
df.set_index('date', inplace=True)
# Query by index is much faster
df.loc['2024-01-01':'2024-01-31']
Multi-Index for Complex Queries
# Create multi-index for common query patterns
df = df.set_index(['category', 'date'])
df.sort_index(inplace=True)
# Fast lookup on both dimensions
df.loc[('Electronics', '2024-01-01'):('Electronics', '2024-01-31')]
Avoid Setting Index Inside Loops
Setting an index requires copying data. Do it once, then query:
# Good: set index once
df.set_index('id', inplace=True)
for value in values:
result = df.loc[value] # Fast index lookup
GroupBy Optimization
Multiple aggregations should use a single groupby:
# Bad: multiple groupby passes
avg = df.groupby('category')['value'].mean()
max_val = df.groupby('category')['value'].max()
count = df.groupby('category')['value'].count()
# Good: single groupby with multiple aggregations
result = df.groupby('category')['value'].agg(['mean', 'max', 'count'])
The optimized version makes one pass through the data instead of three.
When to Consider Polars
For datasets larger than RAM or requiring multi-threaded processing, Polars delivers dramatically better performance.
Performance Comparison
| Metric | Pandas | Polars | Speedup |
|---|---|---|---|
| 10GB dataset processing | 365.71s | 3.89s | 94x |
| 25M row financial data | 187.38s | 11.66s | 16x |
| Memory for 5GB file | 7.8GB | 190MB | 41x |
| Peak memory (8GB data) | 6.8GB | 1.3GB | 5x |
| Energy consumption | 10.1Wh | 2.2Wh | 4.5x |
Polars Migration Example
import polars as pl
# Polars has similar but different syntax
df = pl.read_csv("data.csv")
# Group by
result = df.group_by('category').agg([
pl.col('value').mean().alias('avg'),
pl.col('value').max().alias('max')
])
# Filter
result = df.filter(pl.col('value') > 100)
The migration typically takes 1-4 weeks for standard pipelines. Polars has broad Pandas compatibility, and official guides provide syntax mappings.
When to Stick with Pandas
Pandas remains the right choice when:
- Datasets fit comfortably in RAM (under 10GB)
- You rely on ecosystem libraries (scikit-learn, statsmodels)
- Your team knows Pandas well and needs to ship fast
- Prototyping, where development speed matters more than execution speed
Complete Memory Optimization Workflow
Combine all techniques into a production-ready pipeline:
import pandas as pd
import numpy as np
def optimize_dataframe(df, cat_threshold=0.5):
"""Optimize DataFrame memory with multiple techniques."""
# Step 1: Optimize string columns
for col in df.select_dtypes(include=['object']):
num_unique = df[col].nunique()
num_total = len(df)
if num_unique / num_total < cat_threshold:
df[col] = df[col].astype('category')
# Step 2: Downcast integers
for col in df.select_dtypes(include=['integer']):
col_min = df[col].min()
col_max = df[col].max()
if col_min >= np.iinfo(np.int8).min and col_max <= np.iinfo(np.int8).max:
df[col] = df[col].astype('int8')
elif col_min >= np.iinfo(np.int16).min and col_max <= np.iinfo(np.int16).max:
df[col] = df[col].astype('int16')
elif col_min >= np.iinfo(np.int32).min and col_max <= np.iinfo(np.int32).max:
df[col] = df[col].astype('int32')
# Step 3: Downcast floats
for col in df.select_dtypes(include=['float']):
df[col] = pd.to_numeric(df[col], downcast='float')
return df
# Usage
df = pd.read_csv("large_data.csv")
print(f"Before: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
df = optimize_dataframe(df)
print(f"After: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
Typical output:
Before: 1124.50 MB
After: 156.32 MB
Memory reduction: 86%
Practical Example: 10 Million Rows
A real-world benchmark on 10M rows:
| Operation | Before | After | Improvement |
|---|---|---|---|
| Load time | 45s | 8s | 5.6x |
| Memory usage | 850MB | 95MB | 8.9x |
| Filter time | 12s | 0.8s | 15x |
| GroupBy time | 8s | 0.5s | 16x |
The key changes:
- Specified dtypes during CSV import
- Converted low-cardinality strings to category
- Downcast numeric columns
- Used vectorized operations instead of apply
- Set index on frequently-queried columns
Measuring Your Own Results
Profile before and after:
import time
import tracemalloc
def benchmark(operation, *args, **kwargs):
tracemalloc.start()
start = time.time()
result = operation(*args, **kwargs)
current, peak = tracemalloc.get_traced_memory()
elapsed = time.time() - start
tracemalloc.stop()
return result, elapsed, peak / 1024 / 1024
# Example usage
result, time_sec, memory_mb = benchmark(lambda: df.groupby('cat')['val'].mean())
print(f"Time: {time_sec:.3f}s, Memory: {memory_mb:.2f}MB")
Summary
Memory optimization in Pandas follows clear patterns. Start with dtype optimization, which often delivers 80% memory reduction with minimal code changes. Add chunk processing for files that exceed RAM. Use vectorized operations and avoid iterrows and apply. Set indexes on columns you query frequently.
For datasets that genuinely exceed single-machine capacity, Polars offers 10-100x performance improvements with reasonable migration effort.
The key insight: most performance issues come from a handful of common patterns. Fix those patterns, and your Pandas code will handle datasets you never thought possible.
Sources:
Discussion
Leave a comment
No comments yet
Be the first to start the conversation.