Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

If with Python for information, you’ve got most likely skilled the frustration of ready minutes for a Pandas operation to complete.

At first, all the things appears advantageous, however as your dataset grows and your workflows develop into extra advanced, your laptop computer all of a sudden feels prefer it’s making ready for lift-off.

A few months in the past, I labored on a mission analyzing e-commerce transactions with over 3 million rows of information.

It was a reasonably fascinating expertise, however more often than not, I watched easy groupby operations that usually ran in seconds all of a sudden stretch into minutes.

At that time, I spotted Pandas is wonderful, however it’s not at all times sufficient.

This text explores fashionable options to Pandas, together with Polars and DuckDB, and examines how they’ll simplify and enhance the dealing with of enormous datasets.

For readability, let me be upfront about a number of issues earlier than we start.

This text just isn’t a deep dive into Rust reminiscence administration or a proclamation that Pandas is out of date.

As a substitute, it’s a sensible, hands-on information. You will notice actual examples, private experiences, and actionable insights into workflows that may prevent time and sanity.

Why Pandas Can Really feel Sluggish

Again once I was on the e-commerce mission, I keep in mind working with CSV recordsdata over two gigabytes, and each filter or aggregation in Pandas typically took a number of minutes to finish.

Throughout that point, I might stare on the display, wishing I might simply seize a espresso or binge a number of episodes of a present whereas the code ran.

The primary ache factors I encountered had been pace, reminiscence, and workflow complexity.

Everyone knows how massive CSV recordsdata eat huge quantities of RAM, typically greater than what my laptop computer might comfortably deal with. On prime of that, chaining a number of transformations additionally made code more durable to keep up and slower to execute.

Polars and DuckDB handle these challenges in several methods.

Polars, inbuilt Rust, makes use of multi-threaded execution to course of massive datasets effectively.

DuckDB, alternatively, is designed for analytics and executes SQL queries while not having you to load all the things into reminiscence.

Principally, every of them has its personal superpower. Polars is the speedster, and DuckDB is form of just like the reminiscence magician.

And the perfect half? Each combine seamlessly with Python, permitting you to reinforce your workflows with no full rewrite.

Setting Up Your Surroundings

Earlier than we begin coding, be sure your setting is prepared. For consistency, I used Pandas 2.2.0, Polars 0.20.0, and DuckDB 1.9.0.

Pinning variations can prevent complications when following tutorials or sharing code.

pip set up pandas==2.2.0 polars==0.20.0 duckdb==1.9.0

In Python, import the libraries:

import pandas as pd
import polars as pl
import duckdb
import warnings
warnings.filterwarnings("ignore")

For instance, I’ll use an e-commerce gross sales dataset with columns comparable to order ID, product ID, area, nation, income, and date. You possibly can obtain comparable datasets from Kaggle or generate artificial information.

Loading Information

Loading information effectively units the tone for the remainder of your workflow. I keep in mind a mission the place the CSV file had almost 5 million rows.

Pandas dealt with it, however the load instances had been lengthy, and the repeated reloads throughout testing had been painful.

It was a kind of moments the place you want your laptop computer had a “quick ahead” button.

Switching to Polars and DuckDB utterly improved all the things, and all of a sudden, I might entry and manipulate the info nearly immediately, which actually made the testing and iteration processes much more gratifying.

With Pandas:

df_pd = pd.read_csv("gross sales.csv")
print(df_pd.head(3))

With Polars:

df_pl = pl.read_csv("gross sales.csv")
print(df_pl.head(3))

With DuckDB:

con = duckdb.join()
df_duck = con.execute("SELECT * FROM 'gross sales.csv'").df()
print(df_duck.head(3))

DuckDB can question CSVs straight with out loading your entire datasets into reminiscence, making it a lot simpler to work with massive recordsdata.

Filtering Information

The issue right here is that filtering in Pandas might be sluggish when coping with thousands and thousands of rows. I as soon as wanted to research European transactions in a large gross sales dataset. Pandas took minutes, which slowed down my evaluation.

With Pandas:

filtered_pd = df_pd[df_pd.region == "Europe"]

Polars is quicker and might course of a number of filters effectively:

filtered_pl = df_pl.filter(pl.col("area") == "Europe")

DuckDB makes use of SQL syntax:

filtered_duck = con.execute("""
    SELECT *
    FROM 'gross sales.csv'
    WHERE area = 'Europe'
""").df()

Now you possibly can filter via massive datasets in seconds as an alternative of minutes, leaving you extra time to give attention to the insights that basically matter.

Aggregating Giant Datasets Shortly

Aggregation is commonly the place Pandas begins to really feel sluggish. Think about calculating whole income per nation for a advertising report.

In Pandas:

agg_pd = df_pd.groupby("nation")["revenue"].sum().reset_index()

In Polars:

agg_pl = df_pl.groupby("nation").agg(pl.col("income").sum())

In DuckDB:

agg_duck = con.execute("""
    SELECT nation, SUM(income) AS total_revenue
    FROM 'gross sales.csv'
    GROUP BY nation
""").df()

I keep in mind operating this aggregation on a ten million-row dataset. In Pandas, it took almost half an hour. Polars accomplished the identical operation in beneath a minute.

The sense of aid was nearly like ending a marathon and realizing your legs nonetheless work.

Becoming a member of Datasets at Scale

Becoming a member of datasets is a kind of issues that sounds easy till you might be truly knee-deep within the information.

In actual initiatives, your information often lives in a number of sources, so you must mix them utilizing shared columns like buyer IDs.

I discovered this the exhausting manner whereas engaged on a mission that required combining thousands and thousands of buyer orders with an equally massive demographic dataset.

Every file was large enough by itself, however merging them felt like attempting to power two puzzle items collectively whereas your laptop computer begged for mercy.

Pandas took so lengthy that I started timing the joins the identical manner folks time how lengthy it takes their microwave popcorn to complete.

Spoiler: the popcorn received each time.

Polars and DuckDB gave me a manner out.

With Pandas:

merged_pd = df_pd.merge(pop_df_pd, on="nation", how="left")

Polars:

merged_pl = df_pl.be part of(pop_df_pl, on="nation", how="left")

DuckDB:

merged_duck = con.execute("""
    SELECT *
    FROM 'gross sales.csv' s
    LEFT JOIN 'pop.csv' p
    USING (nation)
""").df()

Joins on massive datasets that used to freeze your workflow now run easily and effectively.

Lazy Analysis in Polars

One factor I didn’t respect early in my information science journey was how a lot time will get wasted whereas operating transformations line by line.

Polars approaches this in another way.

It makes use of a way referred to as lazy analysis, which basically waits till you’ve got accomplished defining your transformations earlier than executing any operations.

It examines your entire pipeline, determines probably the most environment friendly path, and executes all the things concurrently.

It’s like having a pal who listens to your whole order earlier than strolling to the kitchen, as an alternative of 1 who takes every instruction individually and retains going forwards and backwards.

This TDS article indepthly explains lazy analysis.

Right here’s what the move seems like:

Pandas:

df = df[df["amount"] > 100]
df = df.groupby("section").agg({"quantity": "imply"})
df = df.sort_values("quantity")

Polars Lazy Mode:

import polars as pl

df_lazy = (
    pl.scan_csv("gross sales.csv")
      .filter(pl.col("quantity") > 100)
      .groupby("section")
      .agg(pl.col("quantity").imply())
      .kind("quantity")
)

end result = df_lazy.accumulate()

The primary time I used lazy mode, it felt unusual not seeing instantaneous outcomes. However as soon as I ran the ultimate .accumulate(), the pace distinction was apparent.

Lazy analysis received’t magically resolve each efficiency problem, however it brings a degree of effectivity that Pandas wasn’t designed for.

Conclusion and takeaways

Working with massive datasets doesn’t need to really feel like wrestling along with your instruments.

Utilizing Polars and DuckDB confirmed me that the issue wasn’t at all times the info. Generally, it was the device I used to be utilizing to deal with it.

If there may be one factor you are taking away from this tutorial, let it’s this: you don’t need to abandon Pandas, however you possibly can attain for one thing higher when your datasets begin pushing their limits.

Polars provides you pace in addition to smarter execution, then DuckDB allows you to question big recordsdata like they’re tiny. Collectively, they make working with massive information really feel extra manageable and fewer tiring.

If you wish to go deeper into the concepts explored on this tutorial, the official documentation of Polars and DuckDB are good locations to start out.

Source link

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Scientists Have Made a French Fry Breakthrough

Meta’s warning Australians under 16 that Facebook, Instagram and Threads are off limits ahead of government social media ban

From Data to Stories: Code Agents for KPI Narratives

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Why Pandas Can Really feel Sluggish

Setting Up Your Surroundings

Loading Information

Filtering Information

Aggregating Giant Datasets Shortly

Becoming a member of Datasets at Scale

Lazy Analysis in Polars

Conclusion and takeaways

Related Posts