If with Python for information, you’ve got most likely skilled the frustration of ready minutes for a Pandas operation to complete.
At first, all the things appears advantageous, however as your dataset grows and your workflows develop into extra advanced, your laptop computer all of a sudden feels prefer it’s making ready for lift-off.
A few months in the past, I labored on a mission analyzing e-commerce transactions with over 3 million rows of information.
It was a reasonably fascinating expertise, however more often than not, I watched easy groupby operations that usually ran in seconds all of a sudden stretch into minutes.
At that time, I spotted Pandas is wonderful, however it’s not at all times sufficient.
This text explores fashionable options to Pandas, together with Polars and DuckDB, and examines how they’ll simplify and enhance the dealing with of enormous datasets.
For readability, let me be upfront about a number of issues earlier than we start.
This text just isn’t a deep dive into Rust reminiscence administration or a proclamation that Pandas is out of date.
As a substitute, it’s a sensible, hands-on information. You will notice actual examples, private experiences, and actionable insights into workflows that may prevent time and sanity.
Why Pandas Can Really feel Sluggish
Again once I was on the e-commerce mission, I keep in mind working with CSV recordsdata over two gigabytes, and each filter or aggregation in Pandas typically took a number of minutes to finish.
Throughout that point, I might stare on the display, wishing I might simply seize a espresso or binge a number of episodes of a present whereas the code ran.
The primary ache factors I encountered had been pace, reminiscence, and workflow complexity.
Everyone knows how massive CSV recordsdata eat huge quantities of RAM, typically greater than what my laptop computer might comfortably deal with. On prime of that, chaining a number of transformations additionally made code more durable to keep up and slower to execute.
Polars and DuckDB handle these challenges in several methods.
Polars, inbuilt Rust, makes use of multi-threaded execution to course of massive datasets effectively.
DuckDB, alternatively, is designed for analytics and executes SQL queries while not having you to load all the things into reminiscence.
Principally, every of them has its personal superpower. Polars is the speedster, and DuckDB is form of just like the reminiscence magician.
And the perfect half? Each combine seamlessly with Python, permitting you to reinforce your workflows with no full rewrite.
Setting Up Your Surroundings
Earlier than we begin coding, be sure your setting is prepared. For consistency, I used Pandas 2.2.0, Polars 0.20.0, and DuckDB 1.9.0.
Pinning variations can prevent complications when following tutorials or sharing code.
pip set up pandas==2.2.0 polars==0.20.0 duckdb==1.9.0
In Python, import the libraries:
import pandas as pd
import polars as pl
import duckdb
import warnings
warnings.filterwarnings("ignore")
For instance, I’ll use an e-commerce gross sales dataset with columns comparable to order ID, product ID, area, nation, income, and date. You possibly can obtain comparable datasets from Kaggle or generate artificial information.
Loading Information
Loading information effectively units the tone for the remainder of your workflow. I keep in mind a mission the place the CSV file had almost 5 million rows.
Pandas dealt with it, however the load instances had been lengthy, and the repeated reloads throughout testing had been painful.
It was a kind of moments the place you want your laptop computer had a “quick ahead” button.
Switching to Polars and DuckDB utterly improved all the things, and all of a sudden, I might entry and manipulate the info nearly immediately, which actually made the testing and iteration processes much more gratifying.
With Pandas:
df_pd = pd.read_csv("gross sales.csv")
print(df_pd.head(3))
With Polars:
df_pl = pl.read_csv("gross sales.csv")
print(df_pl.head(3))
With DuckDB:
con = duckdb.join()
df_duck = con.execute("SELECT * FROM 'gross sales.csv'").df()
print(df_duck.head(3))
DuckDB can question CSVs straight with out loading your entire datasets into reminiscence, making it a lot simpler to work with massive recordsdata.
Filtering Information
The issue right here is that filtering in Pandas might be sluggish when coping with thousands and thousands of rows. I as soon as wanted to research European transactions in a large gross sales dataset. Pandas took minutes, which slowed down my evaluation.
With Pandas:
filtered_pd = df_pd[df_pd.region == "Europe"]
Polars is quicker and might course of a number of filters effectively:
filtered_pl = df_pl.filter(pl.col("area") == "Europe")
DuckDB makes use of SQL syntax:
filtered_duck = con.execute("""
SELECT *
FROM 'gross sales.csv'
WHERE area = 'Europe'
""").df()
Now you possibly can filter via massive datasets in seconds as an alternative of minutes, leaving you extra time to give attention to the insights that basically matter.
Aggregating Giant Datasets Shortly
Aggregation is commonly the place Pandas begins to really feel sluggish. Think about calculating whole income per nation for a advertising report.
In Pandas:
agg_pd = df_pd.groupby("nation")["revenue"].sum().reset_index()
In Polars:
agg_pl = df_pl.groupby("nation").agg(pl.col("income").sum())
In DuckDB:
agg_duck = con.execute("""
SELECT nation, SUM(income) AS total_revenue
FROM 'gross sales.csv'
GROUP BY nation
""").df()
I keep in mind operating this aggregation on a ten million-row dataset. In Pandas, it took almost half an hour. Polars accomplished the identical operation in beneath a minute.
The sense of aid was nearly like ending a marathon and realizing your legs nonetheless work.
Becoming a member of Datasets at Scale
Becoming a member of datasets is a kind of issues that sounds easy till you might be truly knee-deep within the information.
In actual initiatives, your information often lives in a number of sources, so you must mix them utilizing shared columns like buyer IDs.
I discovered this the exhausting manner whereas engaged on a mission that required combining thousands and thousands of buyer orders with an equally massive demographic dataset.
Every file was large enough by itself, however merging them felt like attempting to power two puzzle items collectively whereas your laptop computer begged for mercy.
Pandas took so lengthy that I started timing the joins the identical manner folks time how lengthy it takes their microwave popcorn to complete.
Spoiler: the popcorn received each time.
Polars and DuckDB gave me a manner out.
With Pandas:
merged_pd = df_pd.merge(pop_df_pd, on="nation", how="left")
Polars:
merged_pl = df_pl.be part of(pop_df_pl, on="nation", how="left")
DuckDB:
merged_duck = con.execute("""
SELECT *
FROM 'gross sales.csv' s
LEFT JOIN 'pop.csv' p
USING (nation)
""").df()
Joins on massive datasets that used to freeze your workflow now run easily and effectively.
Lazy Analysis in Polars
One factor I didn’t respect early in my information science journey was how a lot time will get wasted whereas operating transformations line by line.
Polars approaches this in another way.
It makes use of a way referred to as lazy analysis, which basically waits till you’ve got accomplished defining your transformations earlier than executing any operations.
It examines your entire pipeline, determines probably the most environment friendly path, and executes all the things concurrently.
It’s like having a pal who listens to your whole order earlier than strolling to the kitchen, as an alternative of 1 who takes every instruction individually and retains going forwards and backwards.
This TDS article indepthly explains lazy analysis.
Right here’s what the move seems like:
Pandas:
df = df[df["amount"] > 100]
df = df.groupby("section").agg({"quantity": "imply"})
df = df.sort_values("quantity")
Polars Lazy Mode:
import polars as pl
df_lazy = (
pl.scan_csv("gross sales.csv")
.filter(pl.col("quantity") > 100)
.groupby("section")
.agg(pl.col("quantity").imply())
.kind("quantity")
)
end result = df_lazy.accumulate()
The primary time I used lazy mode, it felt unusual not seeing instantaneous outcomes. However as soon as I ran the ultimate .accumulate(), the pace distinction was apparent.
Lazy analysis received’t magically resolve each efficiency problem, however it brings a degree of effectivity that Pandas wasn’t designed for.
Conclusion and takeaways
Working with massive datasets doesn’t need to really feel like wrestling along with your instruments.
Utilizing Polars and DuckDB confirmed me that the issue wasn’t at all times the info. Generally, it was the device I used to be utilizing to deal with it.
If there may be one factor you are taking away from this tutorial, let it’s this: you don’t need to abandon Pandas, however you possibly can attain for one thing higher when your datasets begin pushing their limits.
Polars provides you pace in addition to smarter execution, then DuckDB allows you to question big recordsdata like they’re tiny. Collectively, they make working with massive information really feel extra manageable and fewer tiring.
If you wish to go deeper into the concepts explored on this tutorial, the official documentation of Polars and DuckDB are good locations to start out.

