Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • AI Machine-Vision Earns Man Overboard Certification
    • Battery recycling startup Renewable Metals charges up on $12 million Series A
    • The Influencers Normalizing Not Having Sex
    • Sources say NSA is using Mythos Preview, and a source says it is also being used widely within the DoD, despite Anthropic’s designation as a supply chain risk (Axios)
    • Today’s NYT Wordle Hints, Answer and Help for April 20 #1766
    • Scandi-style tiny house combines smart storage and simple layout
    • Our Favorite Apple Watch Has Never Been Less Expensive
    • Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, April 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB
    Artificial Intelligence

    Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

    Editor Times FeaturedBy Editor Times FeaturedNovember 21, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    If with Python for information, you’ve got most likely skilled the frustration of ready minutes for a Pandas operation to complete.

    At first, all the things appears advantageous, however as your dataset grows and your workflows develop into extra advanced, your laptop computer all of a sudden feels prefer it’s making ready for lift-off.

    A few months in the past, I labored on a mission analyzing e-commerce transactions with over 3 million rows of information.

    It was a reasonably fascinating expertise, however more often than not, I watched easy groupby operations that usually ran in seconds all of a sudden stretch into minutes.

    At that time, I spotted Pandas is wonderful, however it’s not at all times sufficient.

    This text explores fashionable options to Pandas, together with Polars and DuckDB, and examines how they’ll simplify and enhance the dealing with of enormous datasets.

    For readability, let me be upfront about a number of issues earlier than we start.

    This text just isn’t a deep dive into Rust reminiscence administration or a proclamation that Pandas is out of date.

    As a substitute, it’s a sensible, hands-on information. You will notice actual examples, private experiences, and actionable insights into workflows that may prevent time and sanity.


    Why Pandas Can Really feel Sluggish

    Again once I was on the e-commerce mission, I keep in mind working with CSV recordsdata over two gigabytes, and each filter or aggregation in Pandas typically took a number of minutes to finish.

    Throughout that point, I might stare on the display, wishing I might simply seize a espresso or binge a number of episodes of a present whereas the code ran.

    The primary ache factors I encountered had been pace, reminiscence, and workflow complexity.

    Everyone knows how massive CSV recordsdata eat huge quantities of RAM, typically greater than what my laptop computer might comfortably deal with. On prime of that, chaining a number of transformations additionally made code more durable to keep up and slower to execute.

    Polars and DuckDB handle these challenges in several methods.

    Polars, inbuilt Rust, makes use of multi-threaded execution to course of massive datasets effectively.

    DuckDB, alternatively, is designed for analytics and executes SQL queries while not having you to load all the things into reminiscence.

    Principally, every of them has its personal superpower. Polars is the speedster, and DuckDB is form of just like the reminiscence magician.

    And the perfect half? Each combine seamlessly with Python, permitting you to reinforce your workflows with no full rewrite.

    Setting Up Your Surroundings

    Earlier than we begin coding, be sure your setting is prepared. For consistency, I used Pandas 2.2.0, Polars 0.20.0, and DuckDB 1.9.0.

    Pinning variations can prevent complications when following tutorials or sharing code.

    pip set up pandas==2.2.0 polars==0.20.0 duckdb==1.9.0

    In Python, import the libraries:

    import pandas as pd
    import polars as pl
    import duckdb
    import warnings
    warnings.filterwarnings("ignore")
    

    For instance, I’ll use an e-commerce gross sales dataset with columns comparable to order ID, product ID, area, nation, income, and date. You possibly can obtain comparable datasets from Kaggle or generate artificial information.

    Loading Information

    Loading information effectively units the tone for the remainder of your workflow. I keep in mind a mission the place the CSV file had almost 5 million rows.

    Pandas dealt with it, however the load instances had been lengthy, and the repeated reloads throughout testing had been painful.

    It was a kind of moments the place you want your laptop computer had a “quick ahead” button.

    Switching to Polars and DuckDB utterly improved all the things, and all of a sudden, I might entry and manipulate the info nearly immediately, which actually made the testing and iteration processes much more gratifying.

    With Pandas:

    df_pd = pd.read_csv("gross sales.csv")
    print(df_pd.head(3))

    With Polars:

    df_pl = pl.read_csv("gross sales.csv")
    print(df_pl.head(3))

    With DuckDB:

    con = duckdb.join()
    df_duck = con.execute("SELECT * FROM 'gross sales.csv'").df()
    print(df_duck.head(3))

    DuckDB can question CSVs straight with out loading your entire datasets into reminiscence, making it a lot simpler to work with massive recordsdata.

    Filtering Information

    The issue right here is that filtering in Pandas might be sluggish when coping with thousands and thousands of rows. I as soon as wanted to research European transactions in a large gross sales dataset. Pandas took minutes, which slowed down my evaluation.

    With Pandas:

    filtered_pd = df_pd[df_pd.region == "Europe"]

    Polars is quicker and might course of a number of filters effectively:

    filtered_pl = df_pl.filter(pl.col("area") == "Europe")

    DuckDB makes use of SQL syntax:

    filtered_duck = con.execute("""
        SELECT *
        FROM 'gross sales.csv'
        WHERE area = 'Europe'
    """).df()

    Now you possibly can filter via massive datasets in seconds as an alternative of minutes, leaving you extra time to give attention to the insights that basically matter.

    Aggregating Giant Datasets Shortly

    Aggregation is commonly the place Pandas begins to really feel sluggish. Think about calculating whole income per nation for a advertising report.

    In Pandas:

    agg_pd = df_pd.groupby("nation")["revenue"].sum().reset_index()

    In Polars:

    agg_pl = df_pl.groupby("nation").agg(pl.col("income").sum())
    

    In DuckDB:

    agg_duck = con.execute("""
        SELECT nation, SUM(income) AS total_revenue
        FROM 'gross sales.csv'
        GROUP BY nation
    """).df()

    I keep in mind operating this aggregation on a ten million-row dataset. In Pandas, it took almost half an hour. Polars accomplished the identical operation in beneath a minute.

    The sense of aid was nearly like ending a marathon and realizing your legs nonetheless work.

    Becoming a member of Datasets at Scale

    Becoming a member of datasets is a kind of issues that sounds easy till you might be truly knee-deep within the information.

    In actual initiatives, your information often lives in a number of sources, so you must mix them utilizing shared columns like buyer IDs.

    I discovered this the exhausting manner whereas engaged on a mission that required combining thousands and thousands of buyer orders with an equally massive demographic dataset.

    Every file was large enough by itself, however merging them felt like attempting to power two puzzle items collectively whereas your laptop computer begged for mercy.

    Pandas took so lengthy that I started timing the joins the identical manner folks time how lengthy it takes their microwave popcorn to complete.

    Spoiler: the popcorn received each time.

    Polars and DuckDB gave me a manner out.

    With Pandas:

    merged_pd = df_pd.merge(pop_df_pd, on="nation", how="left")

    Polars:

    merged_pl = df_pl.be part of(pop_df_pl, on="nation", how="left")

    DuckDB:

    merged_duck = con.execute("""
        SELECT *
        FROM 'gross sales.csv' s
        LEFT JOIN 'pop.csv' p
        USING (nation)
    """).df()

    Joins on massive datasets that used to freeze your workflow now run easily and effectively.

    Lazy Analysis in Polars

    One factor I didn’t respect early in my information science journey was how a lot time will get wasted whereas operating transformations line by line.

    Polars approaches this in another way.

    It makes use of a way referred to as lazy analysis, which basically waits till you’ve got accomplished defining your transformations earlier than executing any operations.

    It examines your entire pipeline, determines probably the most environment friendly path, and executes all the things concurrently.

    It’s like having a pal who listens to your whole order earlier than strolling to the kitchen, as an alternative of 1 who takes every instruction individually and retains going forwards and backwards.

    This TDS article indepthly explains lazy analysis.

    Right here’s what the move seems like:

    Pandas:

    df = df[df["amount"] > 100]
    df = df.groupby("section").agg({"quantity": "imply"})
    df = df.sort_values("quantity")

    Polars Lazy Mode:

    import polars as pl
    
    df_lazy = (
        pl.scan_csv("gross sales.csv")
          .filter(pl.col("quantity") > 100)
          .groupby("section")
          .agg(pl.col("quantity").imply())
          .kind("quantity")
    )
    
    end result = df_lazy.accumulate()
    

    The primary time I used lazy mode, it felt unusual not seeing instantaneous outcomes. However as soon as I ran the ultimate .accumulate(), the pace distinction was apparent.

    Lazy analysis received’t magically resolve each efficiency problem, however it brings a degree of effectivity that Pandas wasn’t designed for.


    Conclusion and takeaways

    Working with massive datasets doesn’t need to really feel like wrestling along with your instruments.

    Utilizing Polars and DuckDB confirmed me that the issue wasn’t at all times the info. Generally, it was the device I used to be utilizing to deal with it.

    If there may be one factor you are taking away from this tutorial, let it’s this: you don’t need to abandon Pandas, however you possibly can attain for one thing higher when your datasets begin pushing their limits.

    Polars provides you pace in addition to smarter execution, then DuckDB allows you to question big recordsdata like they’re tiny. Collectively, they make working with massive information really feel extra manageable and fewer tiring.

    If you wish to go deeper into the concepts explored on this tutorial, the official documentation of Polars and DuckDB are good locations to start out.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    Comments are closed.

    Editors Picks

    AI Machine-Vision Earns Man Overboard Certification

    April 20, 2026

    Battery recycling startup Renewable Metals charges up on $12 million Series A

    April 20, 2026

    The Influencers Normalizing Not Having Sex

    April 20, 2026

    Sources say NSA is using Mythos Preview, and a source says it is also being used widely within the DoD, despite Anthropic’s designation as a supply chain risk (Axios)

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    UK cyber security company Darktrace names Ed Jennings, who previously led work management platform Quickbase, as CEO, its third chief in 18 months (Kieran Smith/Financial Times)

    March 9, 2026

    ACMA defends enforcement, rejecting claims of softened action in Sportsbet case

    January 3, 2026

    New benchmarks could help make AI models less biased

    March 21, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.