Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Supermassive black holes may create millions of new planets
    • Cheque in: 3 startups ended May by raising $15.5 million
    • Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish
    • New York City-based Mecka AI, which trains robots with human data sourced from body sensors and iPhones, raised $60M, including a $25M Series A (Ben Weiss/Fortune)
    • Is Instagram Down? What to Know
    • It’s the Lessons We Learned Along the Way. Or, Is It?
    • The forever chemicals impacting your health
    • WiseTech CEO threatened amid job cuts; founder Richard White calls in police
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, June 1
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»When 50/50 Isn’t Optimal: Debunking Even Rebalancing
    Artificial Intelligence

    When 50/50 Isn’t Optimal: Debunking Even Rebalancing

    Editor Times FeaturedBy Editor Times FeaturedJuly 24, 2025No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    for an Previous Problem

    You might be coaching your mannequin for spam detection. Your dataset has many extra positives than negatives, so that you make investments numerous hours of labor to rebalance it to a 50/50 ratio. Now you might be happy since you have been capable of deal with the category imbalance. What if I instructed you that 60/40 might have been not solely sufficient, however even higher?

    In most machine studying classification functions, the variety of cases of 1 class outnumbers that of different lessons. This slows down studying [1] and may doubtlessly induce biases within the skilled fashions [2]. Probably the most extensively used strategies to handle this depend on a easy prescription: discovering a strategy to give all lessons the identical weight. Most frequently, that is finished by means of easy strategies reminiscent of giving extra significance to minority class examples (reweighting), eradicating majority class examples from the dataset (undersampling), or together with minority class cases greater than as soon as (oversampling).

    The validity of those strategies is usually mentioned, with each theoretical and empirical work indicating that which answer works finest is dependent upon your particular utility [3]. Nonetheless, there’s a hidden speculation that’s seldom mentioned and too usually taken as a right: Is rebalancing even a good suggestion? To some extent, these strategies work, so the reply is sure. However ought to we totally rebalance our datasets? To make it easy, allow us to take a binary classification downside. Ought to we rebalance our coaching information to have 50% of every class? Instinct says sure, and instinct guided follow till now. On this case, instinct is mistaken. For intuitive causes.

    What Do We Imply by ‘Coaching Imbalance’?

    Earlier than we delve into how and why 50% shouldn’t be the optimum coaching imbalance in binary classification, allow us to outline some related portions. We name n₀ the variety of cases of 1 class (normally, the minority class), and n₁ these of the opposite class. This manner, the whole variety of information cases within the coaching set is n=n₀+n₁ . The amount we analyze at this time is the coaching imbalance,

    ρ⁽ᵗʳᵃⁱⁿ⁾ = n₀/n .

    Proof that fifty% Is Suboptimal

    Preliminary proof comes from empirical work on random forests. Kamalov and collaborators measured the optimum coaching imbalance, ρ⁽ᵒᵖᵗ⁾, on 20 datasets [4]. They discover its worth varies from downside to downside, however conclude that it is kind of ρ⁽ᵒᵖᵗ⁾=43%. Which means that, in keeping with their experiments, you need barely extra majority than minority class examples. That is nonetheless not the total story. If you wish to goal at optimum fashions, don’t cease right here and straightaway set your ρ⁽ᵗʳᵃⁱⁿ⁾ to 43%.

    In actual fact, this yr, theoretical work by Pezzicoli et al. [5], confirmed that the the optimum coaching imbalance shouldn’t be a common worth that’s legitimate for all functions. It isn’t 50% and it’s not 43%. It seems, the optimum imbalance varies. It could actually some instances be smaller than 50% (as Kamalov and collaborators measured), and others bigger than 50%. The particular worth of ρ⁽ᵒᵖᵗ⁾ will depend upon particulars of every particular classification downside. One strategy to discover ρ⁽ᵒᵖᵗ⁾ is to coach the mannequin for a number of values of ρ⁽ᵗʳᵃⁱⁿ⁾, and measure the associated efficiency. This might for instance seem like this:

    Picture by writer

    Though the precise patterns figuring out ρ⁽ᵒᵖᵗ⁾ are nonetheless unclear, evidently when information is plentiful in comparison with the mannequin measurement, the optimum imbalance is smaller than 50%, as in Kamalov’s experiments. Nonetheless, many different components — from how intrinsically uncommon minority cases are, to how noisy the coaching dynamics is — come collectively to set the optimum worth of the coaching imbalance, and to find out how a lot efficiency is misplaced when one trains away from ρ⁽ᵒᵖᵗ⁾.

    Why Excellent Steadiness Isn’t At all times Finest

    As we stated, the reply is definitely intuitive: as completely different lessons have completely different properties, there isn’t a cause why each lessons would carry the identical info. In actual fact, Pezzicoli’s staff proved that they normally don’t. Subsequently, to deduce the most effective choice boundary we’d want extra cases of a category than of the opposite. Pezzicoli’s work, which is within the context of anomaly detection, supplies us with a easy and insightful instance.

    Allow us to assume that the information comes from a multivariate Gaussian distribution, and that we label all of the factors to the fitting of a choice boundary as anomalies. In 2D, it will seem like this:

    Picture by writer, impressed from [5]

    The dashed line is our choice boundary, and the factors on the fitting of the choice boundary are the n₀ anomalies. Allow us to now rebalance our dataset to ρ⁽ᵗʳᵃⁱⁿ⁾=0.5. To take action, we have to discover extra anomalies. For the reason that anomalies are uncommon, those who we’re most certainly to seek out are near the choice boundary. Already by eye, the situation is strikingly clear:

    Picture by writer, impressed from [5]

    Anomalies, in yellow, are stacked alongside the choice boundary, and are subsequently extra informative about its place than the blue factors. This may induce to assume that it’s higher to privilege minority class factors. On the opposite aspect, anomalies solely cowl one aspect of the choice boundary, so as soon as one has sufficient minority class factors, it may develop into handy to spend money on extra majority class factors, in an effort to higher cowl the opposite aspect of the choice boundary. As a consequence of those two competing results, ρ⁽ᵒᵖᵗ⁾ is usually not 50%, and its actual worth is downside dependent.

    The Root Trigger Is Class Asymmetry

    Pezzicoli’s principle reveals that the optimum imbalance is usually completely different from 50%, as a result of completely different lessons have completely different properties. Nonetheless, they solely analyze one supply of range amongst lessons, that’s, outlier conduct. But, as it’s for instance proven by Sarao-Mannelli and coauthors [6], there are many results, such because the presence of subgroups inside lessons, which might produce an analogous impact. It’s the concurrence of a really giant variety of results figuring out range amongst lessons, that tells us what the optimum imbalance for our particular downside is. Till we have now a principle that treats all sources of asymmetry within the information collectively (together with these induced by how the mannequin structure processes them), we can not know the optimum coaching imbalance of a dataset beforehand.

    Key Takeaways & What You Can Do In another way

    If till now you rebalanced your binary dataset to 50%, you have been doing effectively, however you have been most certainly not doing the very best. Though we nonetheless should not have a principle that may inform us what the optimum coaching imbalance must be, now you realize that it’s doubtless not 50%. The excellent news is that it’s on the way in which: machine studying theorists are actively addressing this matter. Within the meantime, you possibly can consider ρ⁽ᵗʳᵃⁱⁿ⁾ as a hyperparameter which you’ll tune beforehand, simply as some other hyperparameter, to rebalance your information in probably the most environment friendly manner. So earlier than your subsequent mannequin coaching run, ask your self: is 50/50 actually optimum? Strive tuning your class imbalance — your mannequin’s efficiency may shock you.

    References

    [1] E. Francazi, M. Baity-Jesi, and A. Lucchi, A theoretical analysis of the learning dynamics under class imbalance (2023), ICML 2023

    [2] Ok. Ghosh, C. Bellinger, R. Corizzo, P. Branco,B. Krawczyk,and N. Japkowicz, The class imbalance problem in deep learning (2024), Machine Studying, 113(7), 4845–4901

    [3] E. Loffredo, M. Pastore, S. Cocco and R. Monasson, Restoring balance: principled under/oversampling of data for optimal classification (2024), ICML 2024

    [4] F. Kamalov, A.F. Atiya and D. Elreedy, Partial resampling of imbalanced data (2022), arXiv preprint arXiv:2207.04631

    [5] F.S. Pezzicoli, V. Ros, F.P. Landes and M. Baity-Jesi, Class imbalance in anomaly detection: Learning from an exactly solvable model (2025). AISTATS 2025

    [6] S. Sarao-Mannelli, F. Gerace, N. Rostamzadeh and L. Saglietti, Bias-inducing geometries: an exactly solvable data model with fairness implications (2022), arXiv preprint arXiv:2205.15935



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Solving a Murder Mystery Using Bayesian Inference

    May 31, 2026

    Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

    May 31, 2026

    Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet?

    May 30, 2026

    Meta-Cognitive Regulation Might Be the Most Important AI Skill Nobody Is Talking About

    May 30, 2026

    Comments are closed.

    Editors Picks

    Supermassive black holes may create millions of new planets

    June 1, 2026

    Cheque in: 3 startups ended May by raising $15.5 million

    June 1, 2026

    Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish

    June 1, 2026

    New York City-based Mecka AI, which trains robots with human data sourced from body sensors and iPhones, raised $60M, including a $25M Series A (Ben Weiss/Fortune)

    June 1, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    MediaWorld Accidentally Sold iPads for 15 Euros. Then It Asked for Them Back

    November 21, 2025

    Why you can’t build a startup ecosystem on panels and piss-ups

    May 8, 2026

    Evolution brings live show Crazy Time to Connecticut

    October 8, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.