Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    • New radio bursts detected from binary stars
    • Remarkable, Catalysr and Indigenous pre-accelerators score NSW government support for diverse founders
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Does More Data Always Yield Better Performance?
    Artificial Intelligence

    Does More Data Always Yield Better Performance?

    Editor Times FeaturedBy Editor Times FeaturedNovember 11, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    In knowledge science, we attempt to enhance the less-than-desirable efficiency of our mannequin as we match the information at hand. We strive methods starting from altering mannequin complexity to knowledge massaging and preprocessing. Nonetheless, as a rule, we’re suggested to “simply” get extra knowledge. Apart from that being simpler stated than finished, maybe we must always pause and query the standard knowledge. In different phrases,

    Does including extra knowledge at all times yield higher efficiency?

    On this article, let’s put this adage to the check utilizing actual knowledge and a software I constructed for such inquiry. We are going to make clear the subtleties related to knowledge assortment and enlargement, difficult the notion that such endeavors robotically enhance efficiency and calling for a extra conscious and strategic observe.

    What Does Extra Information Imply?

    Let’s first outline what we imply precisely by “extra knowledge”. In probably the most basic setting, we generally think about knowledge to be tabular. And when the thought of buying extra knowledge is usually recommended, including extra rows to our knowledge body (i.e, extra knowledge factors or samples) is what first involves thoughts.

    Nonetheless, another strategy can be including extra columns (i.e., extra attributes or options). The primary strategy expands the information vertically, whereas the second does so horizontally.

    We are going to subsequent think about the commonalities and peculiarities of the 2 approaches.

    Information will be expanded by including extra samples or extra columns. (Picture by creator)

    Case 1: Extra Samples

    Let’s think about the primary case of including extra samples. Does including extra samples essentially enhance mannequin efficiency?

    In an try to unravel it, I created a tool hosted as a HuggingFace space to focus on this query. This software permits the consumer to experiment with the consequences of fixing the attribute set, the pattern dimension, and/or mannequin complexity when analyzing the UCI Irvine – Predict Students’ Dropout and Academic Success dataset [1] with a call tree. Whereas each the software and the dataset are meant for academic functions, we are going to nonetheless be capable of derive priceless insights that generalize past this fundamental setting.

    …

    Characteristic/Depth/Pattern Explorer Instrument (Picture generated by the creator utilizing UCI dataset)

    Say the varsity’s dean fingers you some pupil information and asks you to establish the elements that predict pupil dropout to handle the difficulty. You might be given 1500 knowledge factors to begin with. You create a 700-data-point hidden out check set and you employ the remaining for coaching. The info furnished to you comprises the scholars’ nationalities and oldsters’ occupations, in addition to the GDP and inflation and unemployment charges.

    Nonetheless, the outcomes don’t appear spectacular. The F1 rating is low. So, naturally, you ask your dean to drag some strings to purchase extra pupil information (maybe from prior years or different faculties), which they do over a few weeks. You rerun the experiment each time you get a brand new batch of pupil information. Typical knowledge means that including extra knowledge steadily improves the modeling course of (Check F1 rating ought to enhance monotonically), however that’s not what you see. The efficiency erratically fluctuates as extra knowledge is available in. You might be confused. Why would extra knowledge ever damage efficiency? Why did the F1 rating drop from 46% all the way down to 39% when one of many batches was added? Shouldn’t the connection be causal?

    Variety of samples vs. efficiency: Even with cross-validated hyper-parameter tuning, each coaching and check F1 scores fluctuate because the variety of samples will increase. The influence of including extra samples will be messy and counter-intuitive. (Picture generated by creator utilizing UCI dataset)

    Nicely, the query is de facto whether or not extra samples essentially present extra data. Let’s first ponder the character of those extra samples: 

    • They could possibly be false (i.e., a bug in knowledge assortment)
    • They could possibly be biased (e.g., over-representing a particular case that doesn’t align with the true distribution as represented by the check set)
    • The check set itself could also be biased… 
    • Spurious patterns could also be launched by some batches and later cancelled by different batches. 
    • The attributes collected set up little to no correlation or causation with the goal (i.e., there are lurking variables unaccounted for). So, regardless of what number of samples you add, they don’t seem to be going to get you wherever! 

    So, sure, including extra knowledge is mostly a good suggestion, however we should take note of inconsistencies within the knowledge (e.g. two college students of the identical nationality and social standing could find yourself on totally different paths attributable to different elements). We should additionally fastidiously assess the usefulness of the out there attributes (e.g., maybe GDP has nothing to do with pupil dropout fee).

    Some could argue that this is able to not be a problem when you have got plenty of actual knowledge (In any case, this can be a comparatively small dataset). There’s advantage to that argument, however provided that the information is properly homogenized and accounts for the totally different variabilities and “levels of freedom” of the attribute set (i.e., the vary of values every attribute can take and the attainable mixtures of those values as seen in the true world). Research has proven circumstances during which massive datasets which are thought of gold normal present biases in attention-grabbing and obscure ways in which weren’t straightforward to identify at first look, inflicting deceptive reviews of excessive accuracy [2].

    Case 2: Extra Attributes

    Now, talking of attributes, let’s think about another state of affairs during which your dean fails to accumulate extra pupil information. Nonetheless, they arrive and say, “Hey you… I wasn’t in a position to get extra pupil information… however I used to be ready to make use of some SQL to get extra attributes on your knowledge… I’m certain you possibly can enhance your efficiency now. Proper?… Proper?!”

    Characteristic set vs. efficiency: Every vertical line exhibits a retraining of the choice tree (800 samples with cross-validated hyper-parameter tuning) with one extra attribute. Some attributes assist (Mom’s occupation), whereas others damage (Father’s occupation and Gender). Extra columns could typically imply extra noise and extra methods to overfit. (Picture generated by creator utilizing UCI dataset)

    Nicely, let’s put that to the check. Let’s have a look at the next instance the place we incrementally add extra attributes, increasing the scholars’ profile and together with their marital, monetary, and immigration statuses. Every time we add an attribute, we retrain the tree and consider its efficiency. As you possibly can see, whereas some increments enhance efficiency, others truly damage it. However once more, why?

    Trying on the attribute set extra carefully, we discover that not all attributes truly carry helpful data. The true world is messy… Some attributes (e.g., Gender) may present noise or false correlations within the coaching set that won’t generalize properly to the check set (overfitting).

    Additionally, whereas frequent knowledge says that as you add extra knowledge you must enhance your mannequin complexity, this observe doesn’t at all times yield the very best end result. Typically, when including an attribute, reducing mannequin complexity could assist with overfitting (e.g., when Course was launched to the combo).

    Characteristic set vs. tree depth: The optimum tree depth (chosen by grid search) fluctuates as attributes are added. Discover that extra attributes don’t at all times translate to a bigger tree. (Picture generated by creator utilizing UCI dataset)

    Conclusion

    Taking a step again and looking out on the huge image, we see that whereas gathering extra knowledge is a noble trigger, we needs to be cautious to not robotically assume that efficiency will get higher. There are two forces at play right here: how properly the mannequin suits the coaching knowledge, and the way reliably that match generalizes and extends to unseen knowledge.

    Let’s summarize how every kind of “extra knowledge” influences these forces—relying on whether or not the added knowledge is sweet (consultant, constant, informative) or unhealthy (biased, noisy, inconsistent):

    If knowledge high quality Is nice… If knowledge high quality is poor…
    Extra samples (rows) • Coaching error could rise barely (extra variations make it troublesome to suit).

    • Check error often drops. The mannequin turns into extra secure and assured.

    • Coaching error could fluctuate attributable to conflicting examples.

    • Check error usually rises.

    Extra attributes (columns) • Coaching error often drops (extra sign results in richer illustration.)

    • Check error drops as attributes encode true and generalizable patterns.

    • Coaching error often drops (the mannequin memorizes noisy patterns).

    • Check error rises attributable to spurious correlations.

    Generalization isn’t nearly amount—it’s additionally about high quality and the precise stage of mannequin complexity.

    To wrap up, subsequent time somebody means that you must “merely” get extra knowledge to magically enhance accuracy, talk about with them the intricacies of such a plan. Discuss concerning the traits of the procured knowledge when it comes to nature, dimension, and high quality. Level out the nuanced interaction between knowledge and mannequin complexities. It will assist make their effort worthwhile!

    Classes to Internalize:

    • At any time when attainable, don’t take others’ (or my) phrase for it. Experiment your self!
    • When including extra knowledge factors for coaching, ask your self: Do these samples signify the phenomenon you might be modeling. Are they exhibiting the mannequin extra attention-grabbing lifelike circumstances? or are they biased and/or inconsistent?
    • When including extra attributes, ask your self: Are these attributes hypothesized to hold data that enhances our skill to make higher predictions, or is it largely noise?
    • In the end, conduct hyper-parameter tuning and correct validation to remove doubts when assessing how informative the brand new coaching knowledge is.

    Strive it your self!

    In case you’d prefer to discover the dynamics showcased on this article your self, I host the interactive software here. As you experiment by adjusting the pattern dimension, variety of attributes, and/or mannequin depth, you’ll observe the influence of those changes on mannequin efficiency. Such experimentation enriches your perspective and understanding of the mechanisms underlying knowledge science and analytics.

    References:

    [1] M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) “Early prediction of pupil’s efficiency in greater schooling: a case research” Traits and Functions in Info Techniques and Applied sciences, vol.1, in Advances in Clever Techniques and Computing sequence. Springer. DOI: 10.1007/978-3-030-72657-7_16. This dataset is licensed below a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This permits for the sharing and adaptation of the datasets for any function, supplied that the suitable credit score is given.

    [2] Z. Liu and Ok. He, A Decade’s Battle on Dataset Bias: Are We There But? (2024), arXiv: https://arxiv.org/abs/2403.08632



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt

    June 2, 2026

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026

    London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform

    June 2, 2026

    How to Edit, Merge, and Split PDFs With Free Online Tools

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Elliptic estimates cross-chain criminal and high-risk activity has topped $21.8B in 2025, up from $7B in 2023; North Korea is responsible for ~12% of the total (Yohan Yun/Cointelegraph)

    July 17, 2025

    The Instagram-Fueled Boom in Copycat Vintage Car-Body Shells

    December 14, 2025

    The Reinforcement Learning Handbook: A Guide to Foundational Questions

    November 6, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.