I Rewrote a Real Data Workflow in Polars. Pandas Didn’t Stand a Chance.

— I wasn’t actively in search of Polars.

I’ve been on a little bit of a Pandas optimization journey currently. First, I wrote about why you need to stop writing loops in Pandas and suppose in columns as an alternative.

Then I went deeper to profiling actual workflows, fixing vectorization errors, and ended up slicing a 61-second pipeline down to 0.33 seconds utilizing nothing however higher Pandas and NumPy. That one shocked even me.

So I used to be in a very good place with Pandas. I felt like I lastly understood the best way to use it correctly.

Then somebody dropped a touch upon certainly one of my posts. One thing alongside the strains of: “Have you ever tried Polars? It’s constructed for precisely this type of factor.”

I’d seen the identify floating round within the knowledge group. There was buzz round it — one thing about pace, a couple of utterly totally different mind-set about knowledge pipelines. However I’d by no means truly touched it.

That remark was sufficient to push me over the sting.

So I did what I all the time do. I obtained curious, I put in it, and I rewrote the very same workflow from my final article, the one I’d already optimized in Pandas, in a instrument I’d by no means used earlier than.

What I discovered shocked me. Not simply the pace numbers, however what Polars quietly teaches you about how knowledge pipelines truly work.

Isn’t Pandas Sufficient?

Truthful query.

In my final article, I took a gradual Pandas pipeline and optimized it right down to 0.33 seconds. Vectorized operations, right knowledge varieties, no pointless copies. The outcomes have been truthfully higher than I anticipated.

So why are we even speaking about Polars?

Right here’s the factor. Every little thing I did in that article was me doing the optimization manually. I needed to know which operations have been gradual, why they have been gradual, and the best way to repair them. Polars does lots of that considering for you routinely, earlier than it even runs your code.

On high of that, Polars is constructed on a totally totally different basis than Pandas. It makes use of all of your CPU cores by default. It manages reminiscence otherwise. And it introduces a approach of writing knowledge pipelines that, as soon as it clicks, modifications how you concentrate on the entire course of.

Optimized Pandas is spectacular. But it surely nonetheless has a ceiling. This text is about what’s on the opposite aspect of it.

The Workflow

To maintain issues constant and helpful for anybody who adopted together with my final article. I’m utilizing the identical artificial e-commerce dataset. A million rows. Nothing unique. Simply the form of knowledge you’d realistically encounter within the wild.

If you wish to generate it your self, right here’s the setup code:

import pandas as pd
import numpy as np
import time

np.random.seed(42)

n = 1_000_000

areas = ['north', 'south', 'east', 'west']
classes = ['electronics', 'clothing', 'furniture', 'food', 'sports']
statuses = ['completed', 'returned', 'pending', 'cancelled']

df = pd.DataFrame({
    'order_id': np.arange(1000, 1000 + n),
    'order_date': pd.date_range(begin='2022-01-01', intervals=n, freq='1min'),
    'area': np.random.selection(areas, measurement=n),
    'class': np.random.selection(classes, measurement=n),
    'gross sales': np.random.randint(100, 10000, measurement=n),
    'amount': np.random.randint(1, 20, measurement=n),
    'low cost': np.spherical(np.random.uniform(0.0, 0.5, measurement=n), 2),
    'standing': np.random.selection(statuses, measurement=n),
})

df.to_csv('large_sales_data.csv', index=False)

The pipeline we’re working on it’s easy. The form of factor that exhibits up in actual workflows on a regular basis:

Repair knowledge varieties upfront
Calculate web income per order
Flag high-value orders
Combination complete web income by area

Easy. Acquainted. And already optimized in Pandas from the final article, which makes it the proper benchmark for Polars.

The Pandas Model

I’m not going to point out you the naive Pandas code right here. I already did that in my final article — the model with three .apply() calls that took 61 seconds on this identical dataset. Should you haven’t learn that one, it’s value a glance.

What I’m displaying right here is the optimized model. One of the best Pandas can do on this pipeline.

import pandas as pd
import numpy as np
import time

df = pd.read_csv('large_sales_data.csv')

begin = time.time()

# Repair knowledge varieties upfront
df['region'] = df['region'].astype('class')
df['category'] = df['category'].astype('class')
df['status'] = df['status'].astype('class')

# Vectorized income calculation
df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)

# Vectorized flagging
df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')

# Aggregation
consequence = df.groupby('area')['net_revenue'].sum()

finish = time.time()
print(f"Pandas runtime: {finish - begin:.2f} seconds")
print(consequence)

That is clear Pandas. Vectorized operations, right knowledge varieties, no pointless intermediate columns. Every little thing I discovered from happening that rabbit gap.

And the consequence?

Pandas runtime: 0.31 seconds

That’s already actually good. Genuinely spectacular for 1,000,000 rows.
However right here’s the query I stored sitting with after seeing that quantity: what does it appear like when the instrument is doing the optimization for you, as an alternative of you doing it manually?

That’s what we’re about to search out out.

Putting in Polars and First Impressions

Getting began was easy. Should you’re on Google Colab like me, one line is all you want:

!pip set up polars 
import polars as pl 
print(pl.__version__)

1.35.2

Achieved. No setting complications, no dependency conflicts. That alone was a very good begin.

However then I opened the Polars documentation and instantly seen one thing. The syntax regarded acquainted sufficient — DataFrames, columns, filtering — however the way in which you’re purported to suppose about operations felt totally different.

In Pandas, you’re employed together with your knowledge step-by-step. You do one thing, retailer the consequence, do one thing else. In Polars, you describe what you need as a single expression, and Polars figures out the best way to execute it.

I didn’t totally perceive that but. However I used to be about to.

The opposite factor that caught my eye instantly was this idea of lazy vs keen execution. In Pandas, each line of code runs the second you write it. Polars offers you a selection: you may run eagerly like Pandas, or you may construct up a full question plan first and let Polars optimize it earlier than executing something.

I didn’t know what that meant in apply but. However I stored seeing it all over the place within the docs. So I made a decision one of the best ways to know it was to only rewrite my pipeline and see what occurred.

The Keen Model

import polars as pl
import time

begin = time.time()

consequence = (
    pl.read_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .with_columns([
        pl.when(pl.col('net_revenue') > 50000)
        .then(pl.lit('high'))
        .otherwise(pl.lit('low'))
        .alias('order_flag')
    ])
    .group_by('area')
    .agg(pl.col('net_revenue').sum())
)

finish = time.time()
print(f"Polars Keen runtime: {finish - begin:.2f} seconds")
print(consequence)

Polars Keen runtime: 0.83 seconds

The very first thing I seen was the tactic chaining. In Pandas, I used to be making separate assignments, doing one factor, storing the consequence, doing the subsequent factor.

Right here, every little thing flows as a single expression from begin to end. You’re describing the complete pipeline without delay.

Let me break down the syntax rapidly since a few of it’s going to look unfamiliar:

pl.read_csv() — reads the CSV, identical idea as pd.read_csv(). Nothing stunning.
.with_columns([...]) — that is how Polars provides or transforms columns. The equal of df['new_col'] = ... in Pandas. You possibly can compute a number of columns in a single name.
pl.col('gross sales') — that is the way you reference a column in Polars. As a substitute of df['sales'], you write pl.col('gross sales'). You’re not grabbing the info straight — you’re describing an operation on that column. That distinction issues greater than it sounds.
.alias('net_revenue') — simply naming the consequence. Like saying “name this new column net_revenue.”
pl.when(...).then(...).in any other case(...) — Polars’ model of np.the place(). Reads nearly like plain English: when this situation is true, return this worth, in any other case return that one.
.group_by('area').agg(...) — identical idea as Pandas .groupby(). Group the info, then outline your aggregation. Totally different syntax, identical thought.

Now right here’s the factor. That keen model ran in 0.83 seconds. That’s truly slower than our optimized Pandas at 0.31 seconds. If I’d stopped right here, I’d have written Polars off completely.

However I stored studying the docs. And I discovered one thing referred to as lazy analysis.

The Lazy Model

begin = time.time()

consequence = (
    pl.scan_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .with_columns([
        pl.when(pl.col('net_revenue') > 50000)
        .then(pl.lit('high'))
        .otherwise(pl.lit('low'))
        .alias('order_flag')
    ])
    .group_by('area')
    .agg(pl.col('net_revenue').sum())
    .acquire()
)

finish = time.time()
print(f"Polars Lazy runtime: {finish - begin:.2f} seconds")
print(consequence)

Polars Lazy runtime: 0.20 seconds

Spot the 2 variations from the keen model. pl.read_csv() turned pl.scan_csv() — this tells Polars to not load something but, simply begin constructing a question plan. And .acquire() was added on the finish — that’s the place you inform Polars “okay, now execute every little thing.”

Two modifications. That’s it.

And similar to that, 0.83 seconds turned 0.20 seconds.

Polars lazy is 35% quicker than our already optimized Pandas pipeline. With out me doing any guide optimization. With out me profiling something. With out me figuring out prematurely which operations have been bottlenecks.

Polars figured that out by itself.

That’s after I began paying consideration.

Psychological Mannequin Shift #1 — Lazy vs Keen Execution

That is the one which modified how I take into consideration knowledge pipelines.
In Pandas, each line of code executes the second you write it. You assign a column — it runs. You filter a DataFrame — it runs. You group and combination — it runs.

Every operation is impartial, instant, and unaware of what comes earlier than or after it.

That’s keen execution. It’s intuitive. It feels pure as a result of it matches how we take into consideration writing code step-by-step.

Polars offers you a selection.

Once you use pl.scan_csv() as an alternative of pl.read_csv(), you’re telling Polars: don’t execute something but. Simply begin constructing a plan. Each .with_columns(), each .filter(), each .group_by() you chain after that isn’t working — it’s being recorded. You’re describing what you need, not triggering it.

Then whenever you name .acquire() on the finish, Polars takes that whole plan, appears at it as an entire, optimizes it, after which executes it in a single environment friendly cross.

Consider it this manner:

Pandas is like following a recipe step-by-step. Chop the onions. Put them within the pan. Add the garlic. Every instruction occurs instantly, so as, one by one.
Polars lazy is sort of a chef who reads the complete recipe first, figures out probably the most environment friendly approach to put together every little thing, after which executes it within the optimum order. Identical consequence. Much less wasted effort.

That optimization step is the important thing. Earlier than Polars runs a single line of your pipeline it asks: what columns can we really want? What rows can we get rid of early? What operations will be parallelized? What work will be skipped completely?

You possibly can truly see the question plan Polars builds earlier than it executes.

Run this:

lazy_query = (
    pl.scan_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .with_columns([
        pl.when(pl.col('net_revenue') > 50000)
        .then(pl.lit('high'))
        .otherwise(pl.lit('low'))
        .alias('order_flag')
    ])
    .group_by('area')
    .agg(pl.col('net_revenue').sum())
)

print(lazy_query.clarify())

That .clarify() name exhibits you precisely what Polars is planning on doing earlier than it does it. It’s the optimizer’s considering made seen.
That is one thing Pandas merely doesn’t have.

In Pandas, the optimization is your duty. You need to know which operations are costly, profile your code, and restructure it manually — which is strictly what I did in my final article. In Polars lazy mode, the optimizer handles that for you.

That’s not a small distinction. That’s a essentially totally different relationship between you and your knowledge pipeline.

Psychological Mannequin Shift #2 — Question Optimization

As soon as I understood lazy analysis, I began questioning — okay, however what precisely is Polars optimizing? What’s it truly doing otherwise below the hood?

Turns on the market are two huge ones value figuring out about. And when you perceive them, you begin seeing why Polars is quicker not simply on this pipeline, however on nearly any non-trivial workflow.

Predicate Pushdown

Let’s say you add a filter to our pipeline — you solely need accomplished orders:

consequence = (
    pl.scan_csv('large_sales_data.csv')
    .with_columns([
        (pl.col('sales') * pl.col('quantity') * (1 - 0.075)).alias('net_revenue')
    ])
    .filter(pl.col('standing') == 'accomplished')
    .group_by('area')
    .agg(pl.col('net_revenue').sum())
    .acquire()
)

In keen mode — Pandas or Polars keen — that is what occurs: load all a million rows, compute web income for all of them, then filter right down to accomplished orders. You probably did costly work on rows you have been going to throw away anyway.

Polars lazy does one thing smarter. It appears on the whole question plan and says: there’s a filter right here. Let me apply that filter as early as doable — ideally earlier than loading knowledge, or proper after. That approach I’m solely doing the costly computations on the rows that really matter.

That’s predicate pushdown. Push the filter right down to the earliest doable level within the pipeline. Much less knowledge processed. Much less reminiscence used. Sooner consequence.

Projection Pruning

Identical thought, however for columns as an alternative of rows.

Our CSV has eight columns. However our ultimate consequence solely wants gross sales, amount, area, and standing. Polars appears on the question plan, figures out which columns are literally wanted to supply the ultimate output, and solely masses these. The remainder are ignored completely.

In Pandas, you load every little thing first and determine what you want later. Polars figures out what it wants earlier than loading something.
These two optimizations — predicate pushdown and projection pruning — are what a database question optimizer does. Should you’ve ever written SQL and puzzled why the database is quick even on huge tables, this can be a huge a part of why.

And that’s the psychological shift right here. Once you write a Polars lazy pipeline, you’re not likely writing a script anymore. You’re writing a question. You’re describing what you need, and Polars — like a database engine — figures out probably the most environment friendly approach to get it.

That’s a special mind-set. And as soon as it clicks, you begin approaching knowledge pipelines otherwise. As a substitute of asking “what ought to I do subsequent?”, you begin asking “what do I really want on the finish?” and dealing backwards from there.

Psychological Mannequin Shift #3 — Columnar Reminiscence

There’s another piece to why Polars is quick. And it lives at a degree under your code.

Pandas shops knowledge in a row-oriented format below the hood. Think about a spreadsheet the place every row is saved collectively in reminiscence — all of the values for order 1, then all of the values for order 2, and so forth. That works superb for trying up particular person information.

However whenever you wish to compute one thing throughout a complete column — like summing all income values — Pandas has to hop via reminiscence, choosing out the related worth from every row one by one.

Polars makes use of a columnar reminiscence format referred to as Apache Arrow. As a substitute of storing rows collectively, it shops columns collectively. All of the income values sit subsequent to one another in reminiscence. All of the area values sit subsequent to one another. When it is advisable compute one thing on a column, every little thing you want is already in a single contiguous block of reminiscence.

Why does that matter? Two causes.

First, trendy CPUs are constructed to course of contiguous blocks of reminiscence extraordinarily effectively. When your knowledge is specified by columns, the CPU can tear via it utilizing vectorized directions — processing a number of values in a single operation. Row-oriented storage breaks that optimization.

Second, columnar storage means Polars solely touches the columns it wants. In case your DataFrame has 20 columns however your operation solely wants 3, Polars works with these 3 columns in reminiscence. Pandas masses the entire thing regardless.

You don’t should do something to get this profit. It’s simply how Polars is constructed.

Consider it this manner. Pandas is sort of a submitting cupboard the place every drawer accommodates every little thing about one buyer — their identify, their orders, their fee historical past, all collectively.

Polars is sort of a submitting cupboard the place one drawer has all of the names, one other has all of the orders, one other has all of the fee histories. If it is advisable analyze fee histories throughout all clients, Polars opens one drawer. Pandas opens each single one.

That’s the columnar reminiscence mannequin. And mixed with lazy analysis and question optimization, it’s why Polars can do in 0.20 seconds what takes Pandas 0.31 seconds — even after Pandas has been fastidiously optimized by hand.

The place Pandas Nonetheless Wins

I wish to be straight with you right here. This text isn’t a Pandas obituary.

After going via this whole train, there are nonetheless conditions the place I’d attain for Pandas with out hesitation.

For fast exploration and advert hoc evaluation, Pandas is difficult to beat. The syntax is acquainted, the ecosystem is huge, and whenever you’re simply poking round a dataset attempting to know it, the efficiency distinction doesn’t matter. You’re not working a pipeline — you’re considering out loud.

For small datasets, Polars’ benefits largely disappear. The overhead of constructing a question plan solely pays off when there’s sufficient knowledge to make optimization worthwhile. On a couple of thousand rows, simply use Pandas.

For visualization and integrations, Pandas is deeply woven into the Python knowledge ecosystem. Matplotlib, Seaborn, Scikit-learn, Statsmodels — all of them converse Pandas natively. Polars is catching up, however Pandas continues to be the frequent language.

And truthfully, for groups and collaborators, familiarity issues. If everybody in your crew is aware of Pandas and no one is aware of Polars, introducing it has an actual price.

I’m not changing Pandas. I’m changing into extra intentional about after I use it.

Small dataset, fast exploration, visualization, crew familiarity — Pandas. Massive dataset, repeated pipeline, manufacturing workflow, efficiency issues — Polars.

Each instruments. Proper conditions.

Conclusion

Right here’s what I got here in anticipating: a pace comparability. Run the identical code in two libraries, present the numbers, carried out.

Right here’s what I truly obtained: a special mind-set about knowledge.

The pace numbers are actual — we went from a 61-second damaged Pandas pipeline in my final article, to 0.31 seconds with optimized Pandas, to 0.20 seconds with Polars lazy analysis. That’s a journey value documenting.

However the extra lasting factor is what Polars quietly teaches you alongside the way in which. That filters ought to be utilized as early as doable. That you need to solely load what you want. That describing what you need and letting an optimizer determine the how is a authentic — and highly effective — approach to write knowledge pipelines.

These aren’t simply Polars concepts. They’re good knowledge engineering concepts. And I wouldn’t have sat down to essentially perceive them if a touch upon certainly one of my posts hadn’t pushed me towards a instrument I’d by no means used earlier than.

Pandas didn’t change into worse via any of this. My expectations simply obtained larger.

Should you’ve obtained a workflow that feels slower than it ought to — even after you’ve cleaned up the code — it is likely to be value asking whether or not the instrument itself has a ceiling. Generally the subsequent step isn’t writing higher code. It’s writing the identical code in one thing constructed otherwise.

What workflows are you working that is likely to be value a rewrite?

Source link

I Rewrote a Real Data Workflow in Polars. Pandas Didn’t Stand a Chance.

Give Your AI Unlimited Updated Context

The Joy of Typing | Towards Data Science

How Major Reasoning Models Converge to the Same “Brain” as They Model Reality Increasingly Better

Deconstruct Any Metric with a Few Simple ‘What’ Questions

Timer-XL: A Long-Context Foundation Model for Time-Series Forecasting

Beyond Lists: Using Python Deque for Real-Time Sliding Windows

Trump Pivots on AI Regulation, Worker Ousted by DOGE Runs for Office, and Hantavirus Explained

French prosecutors escalate an investigation into Elon Musk and X, focused on alleged algorithmic manipulation and sexual deepfakes, to a criminal probe (Lora Kolodny/CNBC)

One of the Biggest Causes of Dishwasher Decline Is Easy to Prevent

Sardinia’s Renewable Energy Conflict: Identity At Stake

Featured Picks

November Nights Will Dazzle with Three Meteor Showers. Here’s How to Watch Like a Pro

Today’s NYT Wordle Hints, Answer and Help for Dec. 9 #1634

The Man Behind AlphaGo Thinks AI Is Taking the Wrong Path

I Rewrote a Real Data Workflow in Polars. Pandas Didn’t Stand a Chance.

Isn’t Pandas Sufficient?

The Workflow

The Pandas Model

Putting in Polars and First Impressions

The Keen Model

The Lazy Model

Psychological Mannequin Shift #1 — Lazy vs Keen Execution

Psychological Mannequin Shift #2 — Question Optimization

Predicate Pushdown

Projection Pruning

Psychological Mannequin Shift #3 — Columnar Reminiscence

The place Pandas Nonetheless Wins

Conclusion

Related Posts