Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Motorola Razr Fold vs. Samsung Galaxy Z Fold 7: How the Book-Style Phones Compare
    • Agentic AI: How to Save on Tokens
    • Lightweight ebike conversion kit electrifies your bike
    • Taylor Swift Wants to Trademark Her Likeness. These TikTok Deepfake Ads Show Why
    • New Releases on Prime Video in May 2026: Jack Reacher, Spider-Noir and More
    • 4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers
    • Metajets use light propulsion for future space travel
    • Malta’s startup residency: A pathway for founders expanding into Europe (Sponsored)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, April 29
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»I Reduced My Pandas Runtime by 95% — Here’s What I Was Doing Wrong
    Artificial Intelligence

    I Reduced My Pandas Runtime by 95% — Here’s What I Was Doing Wrong

    Editor Times FeaturedBy Editor Times FeaturedApril 26, 2026No Comments21 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    for a while now. Nothing too loopy although. Simply fundamental information cleansing, exploratory information evaluation, and a few important capabilities. I’ve additionally explored issues like methodology chaining for cleaner, extra organized code, and operations that silently break your Pandas workflow, each of which I’ve written about earlier than.

    I by no means actually considered runtime. Truthfully, if my code ran with out errors and gave me the output I wanted, I used to be glad. Even when it took a couple of minutes for all my pocket book cells to complete, I didn’t care. No errors meant no issues, proper?

    Then I got here throughout the idea of vectorization. And one thing clicked.

    I went down the rabbit gap, as I often do. The extra I learn, the extra I spotted that “no errors” and “environment friendly code” are two very various things. Your Pandas code could be utterly appropriate and nonetheless be quietly horrible at scale.

    So this text is me documenting what I discovered. The errors that gradual Pandas code down, why they occur, how one can repair them, and when Pandas itself may be the bottleneck. In the event you’ve ever run a pocket book and simply assumed the wait time was regular, this one’s for you.

    Why “Working Code” Isn’t Good Sufficient

    There’s a motive this took me some time to consider. Pandas is designed to be forgiving. You’ll be able to write code in a dozen alternative ways and most of them will work. You get your output, your dataframe seems proper, and you progress on.

    However that flexibility comes with a hidden price.

    In contrast to SQL or production-grade information methods, Pandas doesn’t power you to consider effectivity. It doesn’t warn you while you’re doing one thing costly. It simply… does it. Slowly, typically. But it surely does it.

    Give it some thought this fashion. SQL has a question optimizer. It seems at what you’re asking for and figures out probably the most environment friendly approach to get it. Pandas doesn’t have that. It trusts you to put in writing environment friendly code. And in the event you don’t know what environment friendly seems like, you’ll by no means know you’re lacking it.

    The result’s that quite a lot of Pandas code within the wild is what I’d name politely inefficient. It really works on small datasets. It really works on medium datasets with somewhat persistence. However the second you throw real-world information at it, one thing that’s a couple of hundred thousand rows or extra, the cracks begin to present. What used to take seconds now takes minutes. What took minutes turns into unusable.

    And the irritating half is nothing seems fallacious. No errors. No warnings. Only a gradual pocket book and a spinning cursor.

    That’s the lure. Pandas optimizes for comfort, not velocity. And comfort is nice, till it isn’t.

    So the primary shift is a mindset one: working code and environment friendly code usually are not the identical factor. As soon as that clicks, all the pieces else follows.

    Profiling: Cease Guessing, Begin Measuring

    Right here’s one thing I observed whereas taking place this rabbit gap. Most individuals, once they really feel like their code is gradual, do considered one of two issues. They both rewrite the entire thing from scratch hoping one thing improves, or they only settle for it and wait.

    Neither of these is the best transfer.

    The correct transfer is to measure first. You’ll be able to’t optimize what you haven’t recognized. And as a rule, the a part of your code you suppose is gradual isn’t truly the issue.

    Pandas provides you a couple of easy instruments to begin with.

    %timeit — Know How Lengthy Issues Really Take

    %timeit is a Jupyter magic command that runs a line of code a number of occasions and offers you the typical execution time. It’s the only approach to examine two approaches and know, concretely, which one is quicker.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        'gross sales': np.random.randint(100, 10000, measurement=100_000),
        'low cost': np.random.uniform(0.0, 0.5, measurement=100_000)
    })
    
    # Strategy A
    %timeit df.apply(lambda row: row['sales'] * row['discount'], axis=1)
    
    # Strategy B
    %timeit df['sales'] * df['discount']

    On a dataset of 100,000 rows, the distinction isn’t delicate:

    1.91 s ± 228 ms per loop (imply ± std. dev. of seven runs, 1 loop every)
    316 μs ± 14 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)

    Similar output. Utterly totally different price. That’s the form of factor you’d by no means discover by simply operating the cell as soon as and shifting on.

    df.data() and df.memory_usage() — Know What You’re Carrying

    Pace isn’t nearly computation. Reminiscence performs an enormous position too. A dataframe that’s bloated with the fallacious information sorts will gradual all the pieces down earlier than you’ve even written a single transformation.

    df.data()

    Output:

    
    RangeIndex: 100000 entries, 0 to 99999
    Information columns (whole 2 columns):
     #   Column    Non-Null Depend   Dtype  
    ---  ------    --------------   -----  
     0   gross sales     100000 non-null  int64  
     1   low cost  100000 non-null  float64
    dtypes: float64(1), int64(1)
    reminiscence utilization: 1.5 MB

    To examine the reminiscence utilization

    df.memory_usage(deep=True)

    Output:

    Index          132
    gross sales       400000
    low cost    800000
    dtype: int64

    Right here, we are able to see that low cost is taking over twice the house. It’s because low cost is saved as a “heavier” quantity sort (float64) whereas gross sales is saved in a “lighter” sort (int32).

    This turns into particularly vital while you’re working with string columns or object sorts which are secretly consuming reminiscence. We’ll come again to this within the subsequent part.

    The Profiling Mindset

    The instruments themselves are easy. The shift is in the way you method your code. Earlier than you optimize something, ask: the place is the time truly going? Measure the gradual elements. Examine options. Let the numbers let you know what to repair.

    As a result of what feels gradual and what’s gradual are sometimes two various things fully.

    Mistake #1: Row-wise Operations (The Silent Killer)

    If there’s one factor I saved seeing come up time and again whereas researching this subject, it was this: individuals looping by means of Pandas dataframes row by row. And I get it. It feels pure. You concentrate on your information one row at a time, so that you write code that processes it one row at a time.

    The issue is, that’s not how Pandas thinks.

    How Pandas Really Works

    Pandas is constructed on prime of NumPy, which shops information in contiguous blocks of reminiscence, column by column. This implies Pandas is closely optimized to function on total columns directly. While you do this, it runs quick, low-level, vectorized operations below the hood.

    While you loop by means of rows as an alternative, you’re basically bypassing all of that. You’re dropping down into pure Python, one row at a time, with all of the overhead that comes with it. On a small dataset you’ll by no means discover. On a big one, you’ll be ready a very long time.

    There are two patterns that present up always.

    .iterrows()

    # Calculating a reduced worth row by row
    discounted_prices = []
    
    for index, row in df.iterrows():
        discounted_prices.append(row['sales'] * (1 - row['discount']))
    
    df['discounted_price'] = discounted_prices

    This works. It will provide you with the best reply. However on a dataframe with 100,000 rows, it’s painfully gradual.

    %timeit [row['sales'] * (1 - row['discount']) for index, row in df.iterrows()]

    Output:

    10.2 s ± 1.73 s per loop (imply ± std. dev. of seven runs, 1 loop every)

    .apply(axis=1)

    This one is sneakier as a result of it seems extra “Pandas-like.” However making use of a operate throughout axis=1 means making use of it row by row, which is actually the identical downside.

    %timeit df.apply(lambda row: row['sales'] * (1 - row['discount']), axis=1)

    Output:

    1.5 s ± 88.1 ms per loop (imply ± std. dev. of seven runs, 1 loop every)

    Quicker than .iterrows(), however nonetheless working row by row. Nonetheless gradual.

    The Repair: Vectorized Operations

    Right here’s the identical calculation, performed the way in which Pandas truly desires you to do it:

    df['discounted_price'] = df['sales'] * (1 - df['discount'])

    Let’s time it

    %timeit df['sales'] * (1 - df['discount'])

    Output:

    688 μs ± 236 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)

    That’s it. One line. No loop. No lambda. And it’s roughly 14,800x quicker than .iterrows() and 2,180x quicker than .apply(axis=1).

    What’s taking place right here is that Pandas passes the complete column to NumPy, which executes the operation on the C stage throughout the entire array directly. No Python overhead. No row-by-row iteration. Simply quick, low-level computation.

    When .apply() Is Really Advantageous

    To be honest, .apply() isn’t all the time the villain. While you’re making use of a operate column-wise (axis=0, which is the default), it’s typically completely cheap. The problem is particularly axis=1, which forces row-by-row execution.

    And typically your logic is genuinely advanced sufficient {that a} clear vectorized expression isn’t apparent. In these circumstances, np.vectorize() or np.the place() may give you one thing nearer to vectorized efficiency whereas nonetheless letting you specific conditional logic clearly.

    # As an alternative of this
    df['category'] = df.apply(
        lambda row: 'excessive' if row['sales'] > 5000 else 'low', axis=1
    )
    
    # Do that
    df['category'] = np.the place(df['sales'] > 5000, 'excessive', 'low')
    %timeit df.apply(lambda row: 'excessive' if row['sales'] > 5000 else 'low', axis=1)
    %timeit np.the place(df['sales'] > 5000, 'excessive', 'low')

    Output:

    1.31 s ± 189 ms per loop (imply ± std. dev. of seven runs, 1 loop every)
    1.3 ms ± 180 μs per loop (imply ± std. dev. of seven runs, 1,000 loops every)

    Similar outcome. About 1,000x quicker.

    The Rule of Thumb

    In the event you’re writing a loop over rows in Pandas, cease and ask your self: can this be expressed as a column operation? 9 occasions out of ten, the reply is sure. And when it’s, the efficiency distinction is transformative.

    In the event you’re looping by means of rows, you’re not utilizing Pandas. You’re utilizing Python with additional steps.

    Mistake #2: Pointless Copies and Reminiscence Bloat

    Row-wise operations get quite a lot of consideration when individuals speak about Pandas efficiency. Reminiscence will get rather a lot much less. Which is a disgrace, as a result of in my expertise studying about this, bloated reminiscence is simply as chargeable for gradual notebooks as dangerous computation.

    Right here’s the factor. Pandas operations don’t all the time modify your dataframe in place. Plenty of them quietly create a model new copy of your information behind the scenes. Do this sufficient occasions, and also you’re not simply holding one dataframe in reminiscence. You’re holding a number of, all of sudden, with out realizing it.

    The Hidden Value of Chained Operations

    Chained operations are a standard wrongdoer. They give the impression of being clear and readable, however every step can generate an intermediate copy that sits in reminiscence till rubbish assortment cleans it up.

    # Every step right here doubtlessly creates a brand new copy
    df2 = df[df['sales'] > 1000]
    df3 = df2.dropna()
    df4 = df3.reset_index(drop=True)
    df5 = df4[['sales', 'discount']]

    By the point you get to df5, you doubtlessly have 5 variations of your information floating round in reminiscence concurrently. On a small dataset that is invisible. On a big one, that is the way you run out of RAM.

    Short-term Columns That Stick Round

    One other sample that quietly eats reminiscence is creating columns you solely wanted briefly.

    df['gross_revenue'] = df['sales'] * df['quantity']
    df['tax'] = df['gross_revenue'] * 0.075
    df['net_revenue'] = df['gross_revenue'] - df['tax']
    
    # However you solely truly wanted net_revenue

    gross_revenue and tax at the moment are everlasting columns in your dataframe, taking over reminiscence for the remainder of your pocket book regardless that they had been simply stepping stones.

    The repair is easy. Both compute straight:

    df['net_revenue'] = (df['sales'] * df['quantity']) * (1 - 0.075)

    Or drop them as quickly as you’re performed:

    df.drop(columns=['gross_revenue', 'tax'], inplace=True)

    Unsuitable Information Sorts Are Quietly Costly

    This one stunned me once I got here throughout it. By default, Pandas is sort of beneficiant with how a lot reminiscence it assigns to every column. Integer columns get int64. Float columns get float64. String columns develop into object sort, which is without doubt one of the most memory-hungry sorts in Pandas.

    Let’s see what that truly seems like:

    df = pd.DataFrame({
        'order_id': np.random.randint(1000, 9999, measurement=100_000),
        'gross sales': np.random.randint(100, 10000, measurement=100_000),
        'low cost': np.random.uniform(0.0, 0.5, measurement=100_000),
        'area': np.random.alternative(['north', 'south', 'east', 'west'], measurement=100_000)
    })
    
    df.memory_usage(deep=True)

    Output

    Index           132
    order_id     400000
    gross sales        400000
    low cost     800000
    area      5350066
    dtype: int64

    That area column, which solely has 4 potential values, is consuming 5.3MB as an object sort. Convert it to a categorical and watch what occurs:

    df['region'] = df['region'].astype('class')
    df.memory_usage(deep=True)

    Output:

    Index          132
    order_id    400000
    gross sales       400000
    low cost    800000
    area      100386
    dtype: int64

    From 5.3MB all the way down to about 100KB. For one column. The identical logic applies to integer columns the place you don’t want the complete int64 vary. In case your values match comfortably in int32 and even int16, downcasting saves actual reminiscence.

    df['sales'] = df['sales'].astype('int32')
    df['order_id'] = df['order_id'].astype('int32')
    
    df.memory_usage(deep=True)

    Output:

    Index       128
    order_id    400000
    gross sales       400000
    low cost    800000
    area      100563
    dtype: int64

    A couple of small sort modifications and your dataframe is already considerably lighter. And a lighter dataframe means quicker operations throughout the board, as a result of there’s merely much less information to maneuver round.

    The Fast Reminiscence Verify Behavior

    Earlier than you run any heavy transformation, it’s price understanding what you’re working with:

    print(f"Reminiscence utilization: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

    It takes one second and it tells you precisely how a lot reminiscence your dataframe is consuming at that time. Make it a behavior earlier than and after main transformations and also you’ll shortly develop an instinct for when one thing is heavier than it needs to be.

    The Perception

    Gradual code isn’t all the time about computation. Generally your pocket book is gradual as a result of it’s carrying much more information than it must, in codecs which are far dearer than mandatory. Trimming reminiscence isn’t glamorous work, nevertheless it compounds. A dataframe that’s lighter to retailer is quicker to filter, quicker to merge, quicker to rework.

    Reminiscence and velocity usually are not separate issues. They’re the identical downside.

    Mistake #3: Overusing Pandas for All the things

    This one is somewhat totally different from the earlier two. It’s not a couple of particular operate or a foul behavior. It’s about understanding the bounds of your device.

    Pandas is genuinely nice. For many information duties, particularly on the scale most individuals are working at, it’s greater than sufficient. However there’s a model of Pandas utilization that I saved seeing described whereas researching this: individuals reaching for Pandas by default, for all the pieces, no matter whether or not it’s the best match.

    And at a sure scale, that turns into an issue.

    The Dataset

    To make this actual, I generated an artificial e-commerce dataset with 1 million rows. Nothing unique, simply the form of information you’d realistically encounter: orders, dates, areas, classes, gross sales figures, reductions, portions and statuses.

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    
    n = 1_000_000
    
    areas = ['north', 'south', 'east', 'west']
    classes = ['electronics', 'clothing', 'furniture', 'food', 'sports']
    statuses = ['completed', 'returned', 'pending', 'cancelled']
    
    df = pd.DataFrame({
        'order_id': np.arange(1000, 1000 + n),
        'order_date': pd.date_range(begin='2022-01-01', durations=n, freq='1min'),
        'area': np.random.alternative(areas, measurement=n),
        'class': np.random.alternative(classes, measurement=n),
        'gross sales': np.random.randint(100, 10000, measurement=n),
        'amount': np.random.randint(1, 20, measurement=n),
        'low cost': np.spherical(np.random.uniform(0.0, 0.5, measurement=n), 2),
        'standing': np.random.alternative(statuses, measurement=n),
    })
    
    df.to_csv('large_sales_data.csv', index=False)

    A million rows. Saved to a CSV. That is the dataset we’ll be working with for the remainder of the article.

    The place Pandas Begins to Battle

    Pandas masses your total dataset into reminiscence. That’s wonderful when your information is a couple of hundred thousand rows. It begins to get uncomfortable at a couple of million. And past that, you’re combating the device.

    The opposite situation is advanced, nested transformations the place you’re stacking a number of operations, creating intermediate outcomes, and usually asking Pandas to do quite a lot of heavy lifting in sequence. Every step provides overhead. The prices stack up.

    Right here’s a sensible instance utilizing our dataset. Say it is advisable to calculate a rolling common of gross sales per area, flag orders above a threshold, then mixture by month:

    # Step 1: Kind
    df = df.sort_values(['region', 'order_date'])
    
    # Step 2: Rolling common per area
    df['rolling_avg'] = (
        df.groupby('area')['sales']
        .remodel(lambda x: x.rolling(window=7).imply())
    )
    
    # Step 3: Flag high-value orders
    df['high_value'] = df['sales'] > df['rolling_avg'] * 1.5
    
    # Step 4: Month-to-month aggregation
    df['month'] = pd.to_datetime(df['order_date']).dt.to_period('M')
    monthly_summary = df.groupby(['region', 'month'])['sales'].sum()

    This works. However discover that Step 2 makes use of .remodel(lambda x: ...), which carries the identical row-adjacent price we talked about earlier. On 1 million rows, this pipeline will drag. Go forward and time it in your machine and also you’ll see precisely what I imply.

    What to Attain For As an alternative

    The excellent news is you don’t should abandon Pandas fully. There are a couple of choices relying on the scenario.

    Chunking
    In case your dataset is simply too giant to load all of sudden, Pandas permits you to course of it in chunks. As an alternative of loading all 1 million rows into reminiscence directly, you load and course of a portion at a time:

    chunk_size = 100_000
    outcomes = []
    
    for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
        chunk['discounted_price'] = chunk['sales'] * (1 - chunk['discount'])
        outcomes.append(chunk.groupby('area')['discounted_price'].sum())
    
    final_result = pd.concat(outcomes).groupby(stage=0).sum()
    print(final_result)

    As an alternative of asking Pandas to carry 1 million rows in reminiscence concurrently, you’re feeding it 100,000 rows at a time, processing every chunk, and assembling the outcomes on the finish. It’s not probably the most elegant sample, nevertheless it permits you to work with information that will in any other case crash your kernel.

    When to Think about Different Instruments
    Generally the sincere reply is that Pandas isn’t the best device for the job. This isn’t a criticism, it’s simply scope. A couple of price understanding about:

    • Polars: A contemporary dataframe library inbuilt Rust, designed for velocity. It makes use of lazy analysis, that means it optimizes your total question earlier than executing it. For giant datasets it may be dramatically quicker than Pandas.
    • Dask: Extends Pandas to work in parallel throughout a number of cores and even a number of machines. In the event you’re comfy with Pandas syntax, Dask feels acquainted.
    • DuckDB: Allows you to run SQL queries straight in your dataframes or CSV recordsdata with surprisingly quick efficiency. Nice for aggregations and analytical queries on giant information.

    The purpose isn’t to desert Pandas. For many on a regular basis information work, it’s the best alternative. The purpose is to acknowledge while you’ve hit its ceiling, and know that there are good choices on the opposite aspect of it.

    The Actual-World Refactor: From 61 Seconds to 0.33 Seconds

    That is the place all the pieces we’ve coated stops being theoretical.
    I took our 1 million row e-commerce dataset and wrote the form of Pandas code that feels utterly regular. The form of factor you’d write on a Tuesday afternoon with out pondering twice.

    Then I timed it.

    The Gradual Model

    import time
    
    df = pd.read_csv('large_sales_data.csv')
    
    begin = time.time()
    
    # Row-wise income calculation
    df['gross_revenue'] = df.apply(
        lambda row: row['sales'] * row['quantity'], axis=1
    )
    df['tax'] = df.apply(
        lambda row: row['gross_revenue'] * 0.075, axis=1
    )
    df['net_revenue'] = df.apply(
        lambda row: row['gross_revenue'] - row['tax'], axis=1
    )
    
    # Row-wise flagging
    df['order_flag'] = df.apply(
        lambda row: 'excessive' if row['net_revenue'] > 50000 else 'low', axis=1
    )
    
    # Closing aggregation
    outcome = df.groupby('area')['net_revenue'].sum()
    
    finish = time.time()
    print(f"Complete runtime: {finish - begin:.2f} seconds")

    Output:

    Complete runtime: 61.78 seconds

    Over a minute. For a four-step pipeline. And nothing seems fallacious. Let’s break down precisely what’s making it gradual.

    Three errors, multi function pipeline:

    • First, the info sorts are by no means addressed. The area, class and standing columns load as generic object sorts, that are memory-hungry and gradual to work with. We’re carrying that lifeless weight by means of each single operation.
    • Second, there are three separate .apply(axis=1) calls simply to calculate income. Every one loops by means of all 1 million rows in Python, separately. We already noticed in Part 4 how costly that’s. Right here we’re doing it 3 times in a row.
    • Third, gross_revenue and tax are created as everlasting columns regardless that they’re simply intermediate steps. They serve no goal past being stepping stones to net_revenue, however they sit in reminiscence for the remainder of the pipeline anyway.

    Right here’s how I’d repair this step-by-step

    Step 1: Repair information sorts upfront
    Earlier than anything, convert the apparent categorical columns:

    df['region'] = df['region'].astype('class')
    df['category'] = df['category'].astype('class')
    df['status'] = df['status'].astype('class')

    This alone reduces reminiscence utilization considerably and makes subsequent operations cheaper throughout the board.

    Step 2: Change .apply() with vectorized operations
    As an alternative of three separate row-wise calls, one vectorized expression does the identical work:

    # Earlier than: three .apply() calls, three passes by means of 1 million rows
    df['gross_revenue'] = df.apply(lambda row: row['sales'] * row['quantity'], axis=1)
    df['tax'] = df.apply(lambda row: row['gross_revenue'] * 0.075, axis=1)
    df['net_revenue'] = df.apply(lambda row: row['gross_revenue'] - row['tax'], axis=1)
    
    # After: one vectorized expression, no non permanent columns
    df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)

    Step 3: Change row-wise flagging with np.the place()

    # Earlier than
    df['order_flag'] = df.apply(
        lambda row: 'excessive' if row['net_revenue'] > 50000 else 'low', axis=1
    )
    
    # After
    df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')

    Similar logic. Vectorized. Carried out.

    The Quick Model

    Put all of it collectively and the pipeline seems like this:

    import time
    
    df = pd.read_csv('large_sales_data.csv')
    
    begin = time.time()
    
    # Repair 1: Appropriate information sorts upfront
    df['region'] = df['region'].astype('class')
    df['category'] = df['category'].astype('class')
    df['status'] = df['status'].astype('class')
    
    # Repair 2: Vectorized income calculation, no non permanent columns
    df['net_revenue'] = df['sales'] * df['quantity'] * (1 - 0.075)
    
    # Repair 3: Vectorized flagging with np.the place
    df['order_flag'] = np.the place(df['net_revenue'] > 50000, 'excessive', 'low')
    
    # Closing aggregation
    outcome = df.groupby('area')['net_revenue'].sum()
    
    finish = time.time()
    print(f"Complete runtime: {finish - begin:.2f} seconds")

    Output:

    Complete runtime: 0.33 seconds

    61.78 seconds all the way down to 0.33 seconds. A 99.5% discount in runtime. That’s like 187x quicker.

    It’s not a trick. That’s simply how Pandas is meant for use.

    Earlier than You Run Your Subsequent Pocket book

    All the things we coated comes down to a couple core habits. Not guidelines. Not methods. Only a totally different mind-set about your code earlier than you write it.

    • Suppose in columns, not rows. In the event you’re looping by means of a dataframe row by row, cease and ask whether or not the identical factor could be expressed as a column operation. 9 occasions out of ten, it may possibly.
    • Measure earlier than you optimize. Don’t guess the place the slowness is coming from. Use %timeit and df.memory_usage() to let the numbers let you know what to repair.
    • Watch your reminiscence, not simply your velocity. Unsuitable information sorts, pointless copies and non permanent columns all add up. A lighter dataframe is a quicker dataframe.
    • Know when to modify instruments. Pandas is the best alternative more often than not. However at a sure scale, the best optimization is recognizing that you simply’ve outgrown it.

    I began this rabbit gap as a result of I saved seeing the identical dialog come up in information communities. Folks annoyed with gradual notebooks, code that labored wonderful on small information and fell aside on actual information. I needed to know why.

    What I discovered was that the code wasn’t damaged. It simply wasn’t constructed to scale. And the hole between code that works and code that works effectively isn’t about being a sophisticated Pandas consumer. It’s a couple of handful of habits utilized constantly.

    In the event you’ve ever waited too lengthy for a pocket book to complete and simply assumed that was regular, now you already know it doesn’t should be.

    If this modified how you concentrate on your Pandas code, I’d love to listen to what bottlenecks you’ve been coping with. Be happy to say hello on any of those platforms

    Medium

    LinkedIn

    Twitter

    YouTube



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Agentic AI: How to Save on Tokens

    April 29, 2026

    4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

    April 29, 2026

    Ensembles of Ensembles of Ensembles: A Guide to Stacking

    April 29, 2026

    How AI Policy in South Africa Is Ruining Itself

    April 29, 2026

    PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

    April 28, 2026

    Correlation Doesn’t Mean Causation! But What Does It Mean?

    April 28, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    Motorola Razr Fold vs. Samsung Galaxy Z Fold 7: How the Book-Style Phones Compare

    April 29, 2026

    Agentic AI: How to Save on Tokens

    April 29, 2026

    Lightweight ebike conversion kit electrifies your bike

    April 29, 2026

    Taylor Swift Wants to Trademark Her Likeness. These TikTok Deepfake Ads Show Why

    April 29, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Free feminine hygiene startup raises $1.7 million Seed round

    January 27, 2026

    Aiper Scuba V3 Pool Robot Review: Eye on the Prize

    March 21, 2026

    Prague’s City Center Sparkles, Buzzes, and Burns at the Signal Festival

    December 1, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.