Polars vs Pandas: The Complete Migration Guide

A practical guide to Polars covering benchmarks, API comparisons, lazy evaluation, and when to migrate from Pandas. Includes real code examples and production patterns.

I spent three years writing Pandas code that worked fine — until it didn’t. The dataset that used to load in two seconds started taking four minutes. The GroupBy that ran on my laptop crashed on the server with an out-of-memory error. The ETL pipeline that processed 5GB files became a 40-minute ordeal.

The fix wasn’t more RAM or a bigger machine. It was switching to Polars.

This guide covers what you actually need to know to make that switch: real benchmark numbers, side-by-side API comparisons, how lazy evaluation works, and a practical decision framework for when Polars is worth the migration cost.

Why Polars Is Different at a Fundamental Level

Most “Polars vs Pandas” comparisons stop at “Polars is faster.” That’s true, but it misses why — and understanding the why helps you use Polars correctly.

Pandas was built on NumPy, which uses row-oriented memory layout and single-threaded execution. It was designed in an era when datasets fit comfortably in RAM and multi-core CPUs were a novelty. The mental model is a spreadsheet: you index rows, mutate cells, and chain operations imperatively.

Polars was built from scratch in Rust in 2020, using Apache Arrow’s columnar memory format. Every operation is parallelized across all CPU cores automatically. More importantly, Polars includes a query optimizer — the same concept that makes SQL databases fast. When you write a Polars query, you’re describing what you want, and the engine figures out the most efficient way to compute it.

The practical difference: Polars doesn’t just run the same operations faster. It runs fewer operations.

The Benchmark Numbers

Let’s get concrete. These are real numbers from recent benchmarks.

Official PDS-H benchmark (Polars 1.30.0, AWS c7a.24xlarge, ~10GB CSV data):

EngineTotal Timevs Polars Streaming
Polars streaming3.89s1x
DuckDB5.87s1.5x slower
Polars in-memory9.68s2.5x slower
Dask46.02s11.8x slower
PySpark120.11s30.9x slower
Pandas365.71s94x slower

At 100GB, Pandas exits with an out-of-memory error. Polars streaming finishes in 24 seconds.

Independent benchmark (5M rows, M1 MacBook):

  • Pandas: ~4.5 seconds total
  • Polars: ~0.3 seconds total
  • 15x faster

Specific operations on 28M rows of real stock data:

OperationPandasPolarsDifference
Large join (10M × 10M rows)18.7s2.1s9x faster
CSV read (1GB)8–10s<2s5x faster
GroupBy aggregationbaselinebaseline5–10x faster
Sort (10M rows)baselinebaseline11x faster
String regex extraction8.2s11.3sPandas 40% faster

That last row matters: Polars isn’t universally faster. String-heavy regex operations are one area where Pandas still has an edge. Keep that in mind when profiling your own workloads.

Memory efficiency:

  • 1GB CSV: Polars uses ~87% less memory than Pandas
  • 12GB Parquet: Pandas needs 16GB+ RAM (OOM on most machines), Polars peaks at 2GB

Core API Differences

The biggest adjustment when switching from Pandas is the expression-based API. In Pandas, you often manipulate DataFrames directly. In Polars, you write expressions that describe transformations, and the engine applies them.

Reading Data

import pandas as pd
import polars as pl

# Pandas
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")

# Polars (eager — runs immediately)
df = pl.read_csv("data.csv")
df = pl.read_parquet("data.parquet")

# Polars (lazy — recommended for large files)
lf = pl.scan_parquet("data.parquet")
lf = pl.scan_csv("data.csv")

The scan_* functions don’t read the file yet. They build a query plan. More on that in the lazy evaluation section.

Filtering Rows

# Pandas
result = df[df["amount"] > 100]

# Polars
result = df.filter(pl.col("amount") > 100)

GroupBy and Aggregation

# Pandas
result = (
    df.groupby("category")
    .agg({"sales": "sum", "units": "mean"})
    .reset_index()
    .sort_values("sales", ascending=False)
)

# Polars
result = (
    df.group_by("category")
    .agg([
        pl.col("sales").sum(),
        pl.col("units").mean()
    ])
    .sort("sales", descending=True)
)

Notice there’s no reset_index() in Polars. Polars has no row index — one of the bigger conceptual shifts. You never need to reset it because it doesn’t exist.

Adding and Transforming Columns

This is where Polars really shines. In Pandas, adding multiple columns often means multiple passes over the data. In Polars, with_columns runs all transformations in parallel in a single pass:

# Pandas (three separate operations, three passes)
df["revenue_usd"] = df["revenue"] * 1.1
df["name_upper"] = df["name"].str.upper()
df["is_large"] = df["revenue"] > 10000

# Polars (one call, all columns computed in parallel)
df = df.with_columns([
    (pl.col("revenue") * 1.1).alias("revenue_usd"),
    pl.col("name").str.to_uppercase().alias("name_upper"),
    (pl.col("revenue") > 10000).alias("is_large")
])

Conditional Logic

Pandas uses .loc[] for conditional assignment. Polars uses when/then/otherwise:

# Pandas
df.loc[df["value"] > 100, "category"] = "high"

# Polars
df = df.with_columns(
    pl.when(pl.col("value") > 100)
    .then(pl.lit("high"))
    .otherwise(pl.col("category"))
    .alias("category")
)

Joins

# Pandas
result = df1.merge(df2, on="id", how="inner")

# Polars
result = df1.join(df2, on="id", how="inner")

One important difference: Polars joins don’t preserve row insertion order. If order matters, add an explicit .sort() after the join.

Schema Strictness

Polars is strict about types. Pandas will silently coerce mismatched types; Polars raises an error:

# Pandas: silently converts int to float
df1 = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
df2 = pd.DataFrame({"a": [5, 6], "b": [7, 8]})  # b is int
result = pd.concat([df1, df2])  # works, b becomes float64

# Polars: raises SchemaError
df1 = pl.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})
pl.concat([df1, df2])  # SchemaError: column 'b' has dtype i64 but expected f64

This strictness catches bugs early. But it means migration requires explicit .cast() calls in places where Pandas was silently handling type mismatches.

Lazy Evaluation: The Feature That Changes Everything

If you only learn one Polars concept, make it lazy evaluation. It’s what separates Polars from “fast Pandas” and makes it a genuine query engine.

How It Works

When you call pl.scan_parquet() or df.lazy(), Polars doesn’t execute anything. It builds a query plan — a description of what you want to compute. Only when you call .collect() does it actually run.

result = (
    pl.scan_parquet("clickstream.parquet")   # no data read yet
    .filter(pl.col("click_velocity") > 10)   # adds filter to plan
    .group_by("user_id")                     # adds aggregation to plan
    .agg(pl.col("amount").sum())
    .sort("amount", descending=True)
    .collect()                               # NOW it executes
)

Between scan_parquet and collect, Polars’ query optimizer rewrites your plan to be as efficient as possible.

What the Optimizer Does Automatically

Predicate pushdown: Your .filter() gets pushed down to the file reader. For Parquet files, this means entire row groups that don’t match the filter are skipped before being read into memory. A filter that eliminates 80% of rows means reading 80% less data from disk.

Projection pushdown: If your query only uses 3 columns from a 100-column Parquet file, Polars only reads those 3 columns. Parquet’s columnar format makes this essentially free.

Common subexpression elimination: If you reference the same expression twice, Polars computes it once.

Join reordering: Polars automatically picks the most efficient join order based on estimated cardinalities.

Inspecting the Query Plan

lf = (
    pl.scan_parquet("transactions.parquet")
    .filter(pl.col("amount") > 100)
    .group_by("user_id")
    .agg(pl.col("amount").sum())
)

# See what the optimizer produces
print(lf.explain(optimized=True))

This is useful for debugging performance issues — you can verify that predicate pushdown is actually happening.

Streaming: Processing Data Larger Than RAM

For datasets that don’t fit in memory, Polars has a streaming engine that processes data in chunks:

# Enable streaming globally
pl.Config.set_engine_affinity(engine="streaming")
result = lf.collect()

# Or write directly to file without materializing in memory
lf.sink_parquet("output.parquet")
lf.sink_csv("output.csv")

The official benchmark shows Polars streaming at 3.89 seconds on 10GB data — faster than DuckDB, and 94x faster than Pandas. At 100GB, Pandas crashes; Polars streaming finishes in 24 seconds.

When Lazy Doesn’t Work

Not every operation supports lazy mode. pivot is the most common example:

result = (
    lf.filter(pl.col("region").is_in(["East", "West"]))
    .collect()          # materialize first
    .pivot(...)         # pivot only works on DataFrame
    .lazy()             # convert back to lazy if needed
    .select(pl.max("*"))
    .collect()
)

When you hit an operation that requires eager execution, collect, do the operation, then go back to lazy if you have more transformations.

Migrating from Pandas: A Practical Approach

Don’t rewrite everything at once. Profile first, then migrate the bottlenecks.

Step 1: Find Your Bottlenecks

import cProfile
cProfile.run("run_pipeline()", sort="cumulative")

Or use py-spy for a lower-overhead sampling profiler on running processes. You’re looking for the stages that consume the most time — usually large joins, GroupBy operations on millions of rows, or reading big files.

Step 2: Migrate the Slow Parts First

A real example from a production ETL pipeline: a 5-stage pipeline had one stage doing a join on two 10M-row tables. That stage took 40 minutes in Pandas. After migrating just that stage to Polars, it dropped to 4 minutes. The other four stages stayed in Pandas.

Step 3: Handle the Boundaries

Polars and Pandas interoperate cleanly:

# Polars → Pandas (for scikit-learn, seaborn, etc.)
df_pandas = df_polars.to_pandas()

# Pandas → Polars
df_polars = pl.from_pandas(df_pandas)

# Polars → NumPy (lower overhead than going through Pandas)
X = df_polars.select(["feature1", "feature2"]).to_numpy()

Keep .to_pandas() calls at the boundaries of your pipeline, not scattered throughout. Each conversion copies data.

Common Pattern Translations

Replace .apply() with expressions:

# Pandas (slow — single-threaded Python loop)
df["result"] = df["col"].apply(lambda x: x * 2 if x > 0 else 0)

# Polars (fast — vectorized, parallel)
df = df.with_columns(
    pl.when(pl.col("col") > 0)
    .then(pl.col("col") * 2)
    .otherwise(0)
    .alias("result")
)

Replace .loc[] indexing:

# Pandas
subset = df.loc[df["status"] == "active", ["id", "name"]]

# Polars
subset = df.filter(pl.col("status") == "active").select(["id", "name"])

No more reset_index():

# Pandas (needed after groupby)
result = df.groupby("col").sum().reset_index()

# Polars (no index, no reset needed)
result = df.group_by("col").agg(pl.all().sum())

When to Use Polars vs Pandas

The honest answer: it depends on your data size and your existing codebase.

Use Polars when:

  • Your dataset is over 1M rows (performance gap becomes significant)
  • Your dataset is over 10GB (Pandas may OOM; Polars streaming handles it)
  • You have join-heavy workloads (9x faster on large joins)
  • You’re reading Parquet files (predicate/projection pushdown is a huge win)
  • You’re building a new data pipeline from scratch
  • You’re in a memory-constrained environment

Stick with Pandas when:

  • Your data is under 1GB (performance difference is negligible)
  • You need deep scikit-learn, statsmodels, or seaborn integration
  • You’re processing Excel files (Pandas’ read_excel ecosystem is mature)
  • You need HDF5, Stata, SAS, or SPSS formats
  • Your team has a large existing Pandas codebase and the migration cost outweighs the benefit
  • Your workload is string-regex-heavy (Pandas is ~40% faster here)

The hybrid approach (recommended for most production systems):

import polars as pl
import duckdb

# DuckDB for initial SQL-style filtering and aggregation
conn = duckdb.connect(":memory:")
df = conn.query("""
    SELECT customer_id, SUM(amount) AS total, COUNT(*) AS txn_count
    FROM read_csv('transactions.csv', AUTO_DETECT=TRUE)
    WHERE region = 'East'
    GROUP BY customer_id
    HAVING txn_count > 5
""").pl()  # zero-copy output as Polars DataFrame

# Polars for complex transformations
df = (
    df.with_columns(
        (pl.col("total") / pl.col("txn_count")).alias("avg_value")
    )
    .filter(pl.col("avg_value") > 100)
    .sort("total", descending=True)
)

# Convert to Pandas/NumPy only at the ML boundary
from sklearn.ensemble import RandomForestClassifier
X = df.select(["total", "txn_count", "avg_value"]).to_numpy()

DuckDB handles SQL-style queries with a familiar interface. Polars handles expression-based transformations. Pandas/NumPy handles the ML framework boundary. Each tool does what it’s best at.

Pitfalls to Watch Out For

Lazy re-scanning: Calling .collect() multiple times on the same scan_parquet re-reads the file each time. Cache intermediate results:

# Bad: reads the file twice
lf = pl.scan_parquet("big_file.parquet")
count = lf.select(pl.len()).collect()
result = lf.filter(...).collect()

# Good: read once, cache
df = pl.read_parquet("big_file.parquet")
count = len(df)
result = df.filter(...)

Join order isn’t preserved: Always add .sort() explicitly if row order matters after a join.

Type casting on migration: Polars won’t silently coerce types. Budget time for explicit .cast() calls when migrating existing Pandas code.

Streaming limitations: Not all operations work in streaming mode. If you get an error, check the Polars docs for streaming-compatible operations.

.to_pandas() cost: Each conversion copies data. Keep conversions at pipeline boundaries, not inside loops.

The Ecosystem Today

Polars has reached the point where most of the Python data stack works with it directly:

  • DuckDB: Zero-copy interop via Apache Arrow. duckdb.sql("SELECT ...").pl() returns a Polars DataFrame.
  • Plotly: Native Polars support.
  • Streamlit: st.dataframe(polars_df) works directly.
  • Altair: Native support via Narwhals.
  • Pandera: Schema validation for Polars DataFrames.
  • Microsoft Fabric: Polars is now in the default Python Notebook build.

The main gaps are scikit-learn, XGBoost, LightGBM, and matplotlib/seaborn — all still expect Pandas or NumPy. The .to_numpy() path is the lowest-overhead bridge for ML frameworks.

Getting Started

Install Polars:

pip install polars

For Arrow-based extras (recommended):

pip install polars[all]

The official Polars documentation is genuinely good — the migration guide from Pandas is worth reading if you’re doing a serious migration.

The Bottom Line

Polars isn’t a drop-in replacement for Pandas. The API is different, the mental model is different, and migration takes real effort. But for data-heavy workloads, the payoff is substantial: 5–15x faster on typical operations, 94x faster on large analytical queries, and the ability to process datasets that would OOM in Pandas.

The practical approach: keep Pandas for small datasets and ML framework integration, use Polars for anything over a million rows or any pipeline where performance actually matters, and don’t be afraid to mix both in the same codebase. The .to_pandas() and .from_pandas() bridges are there for a reason.

Start with your slowest pipeline stage. Profile it, migrate it to Polars, measure the improvement. That’s usually enough to make the case for the rest.


Sources: Polars official benchmarks, tildalice.io Polars vs Pandas 2026, endjin production migration case study, Real Python LazyFrame guide

Spread The Article

Share this guide

Send this article to your network or keep a copy of the direct link.

X Facebook LinkedIn Reddit Telegram

Discussion

Leave a comment

No comments yet

Be the first to start the conversation.