Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • OneOdio Focus A1 Pro review
    • The 11 Best Fans to Buy Before It Gets Hot Again (2026)
    • A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)
    • ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?
    • Francis Bacon and the Scientific Method
    • Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval
    • Sulfur lava exoplanet L 98-59 d defies classification
    • Hisense U7SG TV Review (2026): Better Design, Great Value
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»A Tale of Two Variances: Why NumPy and Pandas Give Different Answers
    Artificial Intelligence

    A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

    Editor Times FeaturedBy Editor Times FeaturedMarch 14, 2026No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    you might be analyzing a small dataset:

    [X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]]

    You need to calculate some abstract statistics to get an thought of the distribution of this knowledge, so you utilize numpy to calculate the imply and variance.

    import numpy as np
    
    X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
    imply = np.imply(X)
    var = np.var(X)
    
    print(f"Imply={imply:.2f}, Variance={var:.2f}")

    Your output Seems to be like this:

    Imply=10.00, Variance=10.60

    Nice! Now you’ve an thought of the distribution of your knowledge. Nevertheless, your colleague comes alongside and tells you that in addition they calculated some abstract statistics on this similar dataset utilizing the next code:

    import pandas as pd
    
    X = pd.Collection([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
    imply = X.imply()
    var = X.var()
    
    print(f"Imply={imply:.2f}, Variance={var:.2f}")

    Their output seems like this:

    Imply=10.00, Variance=11.78

    The means are the identical, however the variances are completely different! What offers?

    This discrepancy arises as a result of numpy and pandas use completely different default equations for calculating the variance of an array. This text will mathematically outline the 2 variances, clarify why they differ, and present easy methods to use both equation in several numerical libraries.


    Two Definitions

    There are two customary methods to calculate the variance, every meant for a unique objective. It comes down as to whether you might be calculating the variance of the whole inhabitants (the whole group you might be finding out) or only a pattern (a smaller subset of that inhabitants you even have knowledge for).

    The inhabitants variance, σ2sigma^2, is outlined as:

    [sigma^2 = frac{sum_{i=1}^N(x_i-mu)^2}{N}]

    Whereas the pattern variance, s2s^2, is outlined as:

    [s^2 = frac{sum_{i=1}^n(x_i-bar x)^2}{n-1}]

    (Word: xix_i represents every particular person knowledge level in your dataset. NN represents the entire variety of knowledge factors in a inhabitants, nn represents the entire variety of knowledge factors in a pattern, and x‾bar x is the pattern imply).

    Discover the 2 key variations between these equations:

    1. Within the numerator’s sum, σ2sigma^2 is calculated utilizing the inhabitants imply, μmu, whereas s2s^2 is calculated utilizing the pattern imply, x‾bar x.
    2. Within the denominator, σ2sigma^2 divides by the entire inhabitants dimension NN, whereas s2s^2 divides by the pattern dimension minus one, n−1n-1.

    It must be famous that the excellence between these two definitions issues essentially the most for small pattern sizes. As nn grows, the excellence between nn and n−1n-1 turns into much less and fewer vital.


    Why Are They Completely different?

    When calculating the inhabitants variance, it’s assumed that you’ve got all the info. You recognize the precise heart (the inhabitants imply μmu) and precisely how far each level is from that heart. Dividing by the entire variety of knowledge factors NN offers the true, precise common of these squared variations.

    Nevertheless, when calculating the pattern variance, it’s not assumed that you’ve got all the info so that you do not need the true inhabitants imply μmu. As an alternative, you solely have an estimate of μmu, which is the pattern imply x‾bar x. Nevertheless, it seems that utilizing the pattern imply as an alternative of the true inhabitants imply tends to underestimate the true inhabitants variance on common.

    This occurs as a result of the pattern imply is calculated straight from the pattern knowledge, which means it sits within the precise mathematical heart of that particular pattern. Because of this, the info factors in your pattern will at all times be nearer to their very own pattern imply than they’re to the true inhabitants imply, resulting in an artificially smaller sum of squared variations.

    To right for this underestimation, we apply what is named Bessel’s correction (named for German mathematician Friedrich Wilhelm Bessel), the place we divide not by nn, however by the marginally smaller n−1n-1 to right for this bias, as dividing by a smaller quantity makes the ultimate variance barely bigger.

    Levels of Freedom

    So why divide by n−1n-1 and never n−2n-2 or n−3n-3 or some other correction that additionally will increase the ultimate variance? That comes all the way down to an idea referred to as the Levels of Freedom.

    The levels of freedom refers back to the variety of impartial values in a calculation which are free to differ. For instance, think about you’ve a set of three numbers, (x1,x2,x3)(x_1, x_2, x_3). You have no idea what the values of those are however you do know that their pattern imply x‾=10bar x = 10.

    • The primary quantity x1x_1 could possibly be something (let’s say 8)
    • The second quantity x2x_2 is also something (let’s say 15)
    • As a result of the imply have to be 10, x3x_3 will not be free to differ and have to be the one quantity such that x‾=10bar x = 10, which on this case is 7.

    So on this instance, though there are 3 numbers, there are solely two levels of freedom, as implementing the pattern imply removes the flexibility of one in every of them to be free to differ.

    Within the context of variance, earlier than making any calculations, we begin with nn levels of freedom (similar to our nn knowledge factors). The calculation of the pattern imply x‾bar x primarily makes use of up one diploma of freedom, so by the point the pattern variance is calculated, there are n−1n-1 levels of freedom left to work with, which is why n−1n-1 is what seems within the denominator.


    Library Defaults and Align Them

    Now that we perceive the maths, we are able to lastly resolve the thriller from the start of the article! numpy and pandas gave completely different outcomes as a result of they default to completely different variance formulation.

    Many numerical libraries management this utilizing a parameter referred to as ddof, which stands for Delta Levels of Freedom. This represents the worth subtracted from the entire variety of observations within the denominator.

    • Setting ddof=0 divides the equation by nn, calculating the inhabitants variance.
    • Setting ddof=1 divides the equation by n−1n-1, calculating the pattern variance.

    These may also be utilized when calculating the usual deviation, which is simply the sq. root of the variance.

    Here’s a breakdown of how completely different in style libraries deal with these defaults and how one can override them:

    numpy

    By default, numpy assumes you might be calculating the inhabitants variance (ddof=0). In case you are working with a pattern and wish to use Bessel’s correction, you should explicitly go ddof=1.

    import numpy as np
    X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]          
    
    # Pattern variance and customary deviation
    np.var(X, ddof=1)
    np.std(X, ddof=1)
    
    # Inhabitants variance and customary deviation (Default)
    np.var(X)
    np.std(X)

    pandas

    By default, pandas takes the alternative method. It assumes your knowledge is a pattern and calculates the pattern variance (ddof=1). To calculate the inhabitants variance as an alternative, you should go ddof=0.

    import pandas as pd
    X = pd.Collection([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
    
    # Pattern variance and customary deviation (Default)
    X.var()
    X.std()          
    
    # Inhabitants variance and customary deviation 
    X.var(ddof=0)
    X.std(ddof=0)

    Python’s Constructed-in statistics Module

    Python’s customary library doesn’t use a ddof parameter. As an alternative, it gives explicitly named capabilities so there isn’t a ambiguity about which system is getting used.

    import statistics
    X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
    
    # Pattern variance and customary deviation
    statistics.variance(X)
    statistics.stdev(X)  
    
    # Inhabitants variance and customary deviation
    statistics.pvariance(X)
    statistics.pstdev(X)

    R

    In R, the usual var() and sd() capabilities calculate the pattern variance and pattern customary deviation by default. Not like the Python libraries, R doesn’t have a built-in argument to simply swap to the inhabitants system. To calculate the inhabitants variance, you should manually multiply the pattern variance by n−1nfrac{n-1}{n}.

    X <- c(15, 8, 13, 7, 7, 12, 15, 6, 8, 9)
    n <- size(X)
    
    # Pattern variance and customary deviation (Default)
    var(X)
    sd(X)
    
    # Inhabitants variance and customary deviation
    var(X) * ((n - 1) / n)
    sd(X) * sqrt((n - 1) / n)

    Conclusion

    This text explored a irritating but usually unnoticed trait of various statistical programming languages and libraries — they elect to make use of completely different default definitions of variance and customary deviation. An instance was given the place for a similar enter array, numpy and pandas return completely different values for the variance by default.

    This got here all the way down to the distinction between how variance must be calculated for the whole statistical inhabitants being studied vs how variance must be calculated primarily based on only a pattern from that inhabitants, with completely different libraries making completely different selections concerning the default. Lastly it was proven that though every library has its default, all of them can be utilized to calculate each forms of variance through the use of both a ddof argument, a barely completely different operate, or by means of a easy mathematical transformation.

    Thanks for studying!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    Comments are closed.

    Editors Picks

    OneOdio Focus A1 Pro review

    April 19, 2026

    The 11 Best Fans to Buy Before It Gets Hot Again (2026)

    April 19, 2026

    A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)

    April 19, 2026

    ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Can AI Companions Replace Phone Calls With Real Emotion?

    October 4, 2025

    The Minnesota Shooting Suspect’s Background Suggests Deep Ties to Christian Nationalism

    June 18, 2025

    Multiple sclerosis may have two distinct biological pathways

    February 4, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.