In knowledge science, we attempt to enhance the less-than-desirable efficiency of our mannequin as we match the information at hand. We strive methods starting from altering mannequin complexity to knowledge massaging and preprocessing. Nonetheless, as a rule, we’re suggested to “simply” get extra knowledge. Apart from that being simpler stated than finished, maybe we must always pause and query the standard knowledge. In different phrases,
Does including extra knowledge at all times yield higher efficiency?
On this article, let’s put this adage to the check utilizing actual knowledge and a software I constructed for such inquiry. We are going to make clear the subtleties related to knowledge assortment and enlargement, difficult the notion that such endeavors robotically enhance efficiency and calling for a extra conscious and strategic observe.
What Does Extra Information Imply?
Let’s first outline what we imply precisely by “extra knowledge”. In probably the most basic setting, we generally think about knowledge to be tabular. And when the thought of buying extra knowledge is usually recommended, including extra rows to our knowledge body (i.e, extra knowledge factors or samples) is what first involves thoughts.
Nonetheless, another strategy can be including extra columns (i.e., extra attributes or options). The primary strategy expands the information vertically, whereas the second does so horizontally.
We are going to subsequent think about the commonalities and peculiarities of the 2 approaches.
Case 1: Extra Samples
Let’s think about the primary case of including extra samples. Does including extra samples essentially enhance mannequin efficiency?
In an try to unravel it, I created a tool hosted as a HuggingFace space to focus on this query. This software permits the consumer to experiment with the consequences of fixing the attribute set, the pattern dimension, and/or mannequin complexity when analyzing the UCI Irvine – Predict Students’ Dropout and Academic Success dataset [1] with a call tree. Whereas each the software and the dataset are meant for academic functions, we are going to nonetheless be capable of derive priceless insights that generalize past this fundamental setting.

…


Say the varsity’s dean fingers you some pupil information and asks you to establish the elements that predict pupil dropout to handle the difficulty. You might be given 1500 knowledge factors to begin with. You create a 700-data-point hidden out check set and you employ the remaining for coaching. The info furnished to you comprises the scholars’ nationalities and oldsters’ occupations, in addition to the GDP and inflation and unemployment charges.
Nonetheless, the outcomes don’t appear spectacular. The F1 rating is low. So, naturally, you ask your dean to drag some strings to purchase extra pupil information (maybe from prior years or different faculties), which they do over a few weeks. You rerun the experiment each time you get a brand new batch of pupil information. Typical knowledge means that including extra knowledge steadily improves the modeling course of (Check F1 rating ought to enhance monotonically), however that’s not what you see. The efficiency erratically fluctuates as extra knowledge is available in. You might be confused. Why would extra knowledge ever damage efficiency? Why did the F1 rating drop from 46% all the way down to 39% when one of many batches was added? Shouldn’t the connection be causal?

Nicely, the query is de facto whether or not extra samples essentially present extra data. Let’s first ponder the character of those extra samples:
- They could possibly be false (i.e., a bug in knowledge assortment)
- They could possibly be biased (e.g., over-representing a particular case that doesn’t align with the true distribution as represented by the check set)
- The check set itself could also be biased…
- Spurious patterns could also be launched by some batches and later cancelled by different batches.
- The attributes collected set up little to no correlation or causation with the goal (i.e., there are lurking variables unaccounted for). So, regardless of what number of samples you add, they don’t seem to be going to get you wherever!
So, sure, including extra knowledge is mostly a good suggestion, however we should take note of inconsistencies within the knowledge (e.g. two college students of the identical nationality and social standing could find yourself on totally different paths attributable to different elements). We should additionally fastidiously assess the usefulness of the out there attributes (e.g., maybe GDP has nothing to do with pupil dropout fee).
Some could argue that this is able to not be a problem when you have got plenty of actual knowledge (In any case, this can be a comparatively small dataset). There’s advantage to that argument, however provided that the information is properly homogenized and accounts for the totally different variabilities and “levels of freedom” of the attribute set (i.e., the vary of values every attribute can take and the attainable mixtures of those values as seen in the true world). Research has proven circumstances during which massive datasets which are thought of gold normal present biases in attention-grabbing and obscure ways in which weren’t straightforward to identify at first look, inflicting deceptive reviews of excessive accuracy [2].
Case 2: Extra Attributes
Now, talking of attributes, let’s think about another state of affairs during which your dean fails to accumulate extra pupil information. Nonetheless, they arrive and say, “Hey you… I wasn’t in a position to get extra pupil information… however I used to be ready to make use of some SQL to get extra attributes on your knowledge… I’m certain you possibly can enhance your efficiency now. Proper?… Proper?!”

Nicely, let’s put that to the check. Let’s have a look at the next instance the place we incrementally add extra attributes, increasing the scholars’ profile and together with their marital, monetary, and immigration statuses. Every time we add an attribute, we retrain the tree and consider its efficiency. As you possibly can see, whereas some increments enhance efficiency, others truly damage it. However once more, why?
Trying on the attribute set extra carefully, we discover that not all attributes truly carry helpful data. The true world is messy… Some attributes (e.g., Gender) may present noise or false correlations within the coaching set that won’t generalize properly to the check set (overfitting).
Additionally, whereas frequent knowledge says that as you add extra knowledge you must enhance your mannequin complexity, this observe doesn’t at all times yield the very best end result. Typically, when including an attribute, reducing mannequin complexity could assist with overfitting (e.g., when Course was launched to the combo).

Conclusion
Taking a step again and looking out on the huge image, we see that whereas gathering extra knowledge is a noble trigger, we needs to be cautious to not robotically assume that efficiency will get higher. There are two forces at play right here: how properly the mannequin suits the coaching knowledge, and the way reliably that match generalizes and extends to unseen knowledge.
Let’s summarize how every kind of “extra knowledge” influences these forces—relying on whether or not the added knowledge is sweet (consultant, constant, informative) or unhealthy (biased, noisy, inconsistent):
| If knowledge high quality Is nice… | If knowledge high quality is poor… | |
| Extra samples (rows) | • Coaching error could rise barely (extra variations make it troublesome to suit).
• Check error often drops. The mannequin turns into extra secure and assured. |
• Coaching error could fluctuate attributable to conflicting examples.
• Check error usually rises. |
| Extra attributes (columns) | • Coaching error often drops (extra sign results in richer illustration.)
• Check error drops as attributes encode true and generalizable patterns. |
• Coaching error often drops (the mannequin memorizes noisy patterns).
• Check error rises attributable to spurious correlations. |
Generalization isn’t nearly amount—it’s additionally about high quality and the precise stage of mannequin complexity.
To wrap up, subsequent time somebody means that you must “merely” get extra knowledge to magically enhance accuracy, talk about with them the intricacies of such a plan. Discuss concerning the traits of the procured knowledge when it comes to nature, dimension, and high quality. Level out the nuanced interaction between knowledge and mannequin complexities. It will assist make their effort worthwhile!
Classes to Internalize:
- At any time when attainable, don’t take others’ (or my) phrase for it. Experiment your self!
- When including extra knowledge factors for coaching, ask your self: Do these samples signify the phenomenon you might be modeling. Are they exhibiting the mannequin extra attention-grabbing lifelike circumstances? or are they biased and/or inconsistent?
- When including extra attributes, ask your self: Are these attributes hypothesized to hold data that enhances our skill to make higher predictions, or is it largely noise?
- In the end, conduct hyper-parameter tuning and correct validation to remove doubts when assessing how informative the brand new coaching knowledge is.
Strive it your self!
In case you’d prefer to discover the dynamics showcased on this article your self, I host the interactive software here. As you experiment by adjusting the pattern dimension, variety of attributes, and/or mannequin depth, you’ll observe the influence of those changes on mannequin efficiency. Such experimentation enriches your perspective and understanding of the mechanisms underlying knowledge science and analytics.
References:
[1] M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) “Early prediction of pupil’s efficiency in greater schooling: a case research” Traits and Functions in Info Techniques and Applied sciences, vol.1, in Advances in Clever Techniques and Computing sequence. Springer. DOI: 10.1007/978-3-030-72657-7_16. This dataset is licensed below a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This permits for the sharing and adaptation of the datasets for any function, supplied that the suitable credit score is given.
[2] Z. Liu and Ok. He, A Decade’s Battle on Dataset Bias: Are We There But? (2024), arXiv: https://arxiv.org/abs/2403.08632

