Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    • One Rumored Color for the iPhone 18 Pro? A Rich Dark Cherry Red
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes
    Artificial Intelligence

    How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes

    Editor Times FeaturedBy Editor Times FeaturedJuly 16, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Overview

    simply optical illusions or mind-bending puzzles. They can be logical, inflicting preliminary observations to crumble upon nearer investigation. In knowledge science, paradoxes come up after we take numbers at face worth, with out wanting into the context behind them. One can have the sharpest visuals and nonetheless stroll away with the flawed story. 

    On this article, we focus on three logical paradoxes that function cautionary tales for anybody who interprets knowledge too shortly, with out making use of context. We discover how paradoxes come up in Knowledge Science & Enterprise Intelligence (BI) use instances after which prolong the insights to Retrieval-Augmented Era (RAG) methods, the place related paradoxes can undermine the standard of each the immediate offered and the mannequin’s output.

    Simpson’s Paradox in Enterprise Intelligence

    Simpson’s paradox describes the state of affairs the place tendencies reverse when knowledge is aggregated. In different phrases, the tendencies that you just observe in subgroups get flipped whenever you mix the numbers and analyze them. Let’s assume that we’re analyzing the gross sales of 4 areas of a preferred Ice cream chain. When the gross sales for every location are individually analyzed, it means that the chocolate taste is probably the most most popular amongst prospects. However when the gross sales are added up, the pattern goes away, and the brand new mixed outcomes recommend that vanilla is most popular probably the most. This pattern reversal is denoted by Simpson’s Paradox. We use the fictional knowledge under to show this.

    Location Chocolate Vanilla Complete Clients Chocolate % Vanilla % Winner
    Suburb A 15 5 20 75.0% 25.0% Chocolate
    Metropolis B 33 27 60 55.0% 45.0% Chocolate
    Mall 2080 1920 4000 52.0% 48.0% Chocolate
    Airport 1440 2160 3600 40.0% 60.0% Vanilla
    Complete 3568 4112 7680 46.5% 53.5% Vanilla!
    Gross sales by Retailer Location for a Fictitious Ice Cream Chain (By the Creator)

    Under is a visible illustration.

    Simpson’s Paradox in BI Reporting – Illustration (Picture by the Creator)

    An information analyst who overlooks these subgroup dynamics might assume that chocolate is underperforming. Therefore, it’s important to combination numbers by subgroups and examine for the presence of Simpson’s paradox. When a reversal in pattern happens, the lurking variable needs to be recognized as the following step. A lurking variable is the hidden issue influencing group outcomes. On this case, the shop location occurs to be the lurking variable. A deep contextual understanding is required to interpret why the sale of vanilla icecreams was excessive on the airport, flipping the general consequence. Some questions that may very well be used to analyze are:

    • Do airport shops inventory fewer chocolate choices?

    • Do vacationers favor milder flavors?

    • Was there a promotional marketing campaign favoring Vanilla at shops within the airport?

    Simpson’s Paradox in RAG Techniques

    Let’s suppose that you’ve an RAG (Retrieval-Augmented Era) mannequin that gauges public sentiment in direction of electrical automobiles (EVs) and solutions questions across the similar. The mannequin makes use of information articles from 2010 to 2024. Till 2016, EVs had been receiving combined opinions on account of their restricted vary, larger shopping for value, and lack of charging stations. All these components made driving in EVs for lengthy distances inconceivable. Newspaper experiences earlier than 2017 used to spotlight such deficiencies. However as of 2017, EVs began being perceived in an excellent mild on account of enhancements in efficiency and the supply of charging stations. This shift in notion occurred notably after the profitable launch of Tesla’s premium EV. An RAG mannequin that makes use of information experiences from 2010 to 2024 would likely give contradictory responses to related questions, which is able to set off the Simpson’s Paradox. 

    For instance, if the RAG is requested, “Is EV adoption within the US nonetheless low?”, the reply is perhaps “Sure, adoption stays low on account of excessive shopping for prices and restricted infrastructure”. If the RAG is requested, “Has EV adoption elevated just lately within the U.S.?”, the reply could be ‘Sure, adoption has elevated drastically on account of developments in know-how and charging infrastructure’. On this case, the lurking variable is the publication date. A sensible repair to this subject is to tag paperwork (articles) into time-based bins through the pre-processing section. Different choices embrace encouraging the customers to specify a time vary of their immediate (e.g. Within the final 5 years, how has the adoption of EV been?) or fine-tuning the LLM to explicitly state the timeline that it’s contemplating for its response (e.g., Round 2024, EV Adoption has elevated drastically.).

    Simpson’s Paradox in RAG Techniques (Picture by the Creator)

    Accuracy Paradox in Knowledge Science Issues

    The crux of the Accuracy Paradox is that prime accuracy isn’t indicative of a helpful output. Let’s assume that you’re constructing a classification mannequin to determine whether or not a affected person has a uncommon illness that impacts only one in 100. The mannequin accurately identifies and labels those that shouldn’t have the illness and thereby achieves a 99% accuracy. Nonetheless, it fails to determine the one one that has the illness and desires pressing medical consideration. Thereby, the mannequin turns into ineffective for detecting the illness, which is its very objective. This happens particularly in imbalanced datasets the place the observations for one class are minimal. This has been illustrated within the determine under. 

    Accuracy Paradox in Knowledge Science Issues (Picture by the Creator)

    One of the simplest ways to sort out the Accuracy paradox is to make use of metrics that seize the efficiency of the minority courses, reminiscent of Precision, Recall, and F1-score. One other method to observe is to deal with imbalanced datasets as anomaly detection issues, as towards classification issues. One may additionally contemplate amassing extra minority class knowledge (if attainable), over-sampling the minority class, or undersampling the bulk class. Under is a fast information that helps decide which metric to make use of relying on the use case, goal, and penalties of errors.

     Selecting the Proper Metric on your Mannequin’s Efficiency Measurement (Picture by the Creator)

    Accuracy Paradox in LLMs

    Whereas the Accuracy Paradox is a typical subject that many knowledge scientists sort out, its implications in LLMs are largely ignored. The Accuracy metric can dangerously overpromise in use instances that contain security, toxicity detection, and bias mitigation. A excessive accuracy doesn’t imply {that a} mannequin is honest and protected to make use of. For instance, an LLM mannequin that has a 98% accuracy is of no use if it misclassifies 2 malicious prompts as being protected and innocent. Therefore, in LLM evaluations, it’s a good suggestion to make use of recall, precision, or PR-AUC over Accuracy, as they point out how nicely the mannequin tackles minority courses.

    Goodhart’s Legislation in Enterprise Intelligence

    Economist Charles Goodhart said that “When a measure turns into a goal, it ceases to be an excellent measure.” This regulation is a delicate reminder that if you happen to over-optimize a metric with out understanding the implications and context, the mannequin will backfire. 

    A supervisor of a fictitious on-line information company units a KPI for his workforce. He asks the workforce to work in direction of rising the session length by 20%. The workforce extends introductions artificially and in addition provides filler content material to extend the session length. The session length goes up, however the video high quality is misplaced, and because of this, the worth that customers get from the video will get diminished.

    One other instance is said to Buyer Churn. In an try to cut back buyer churn, a subscription-based Leisure app decides to put the ‘Unsubscribe’ button in a hard-to-find location in its internet portal. Consequently, the client churn reduces, but it surely’s not on account of improved buyer satisfaction. It’s solely due to restricted exit choices — an phantasm of buyer retention. Under is a visible illustration that demonstrates how efforts to fulfill or exceed development targets (reminiscent of rising session length or consumer engagement) can usually result in unintended penalties, resulting in a decline in consumer expertise. When groups resort to synthetic inflation ways to assist drive up efficiency metrics, the metric enchancment seems good on paper, however they aren’t significant in any means.

    Goodhart’s Legislation – Illustration (Picture by the Creator)

    Goodhart’s Legislation in LLMs

    Whenever you prepare an LLM an excessive amount of on a selected dataset (particularly a benchmark), it could begin memorizing patterns from that coaching knowledge as an alternative of studying to generalize. It is a traditional instance of overfitting, the place the mannequin performs extraordinarily nicely on that coaching knowledge however performs poorly on real-world inputs. 

    Let’s assume that you’re coaching an LLM to summarize information articles. You utilize the ROUGE (Recall-Oriented Understudy for Gisting Analysis) metric to guage the LLM’s efficiency. The ROUGE metric rewards actual or near-exact matches of n-grams with the reference summaries. Over time, the LLM begins copying massive phrases of textual content from the enter articles with a view to get an elevated ROUGE rating. It additionally makes use of buzzwords that seem so much in reference summaries. Let’s assume that the enter article has the textual content “The financial institution elevated rates of interest to curb inflation, and this brought on inventory costs to say no sharply.” The overfit mannequin would summarize it as “The financial institution elevated rates of interest to curb inflation”, whereas a generalizing mannequin would summarize it as “The rate of interest hike triggered a decline within the inventory markets”. The illustration under demonstrates how optimizing your mannequin an excessive amount of for an analysis metric may end up in low-quality responses (responses which can be good on paper however usually are not useful).

    Goodhart’s Legislation in LLMs (Picture by the Creator)

    Concluding Remarks

    Whether or not it’s in enterprise intelligence or LLMs, paradoxes can creep in if numbers and metrics are dealt with with out the underlying nuance and context. Additionally, it is very important do not forget that over-fitting can harm the larger image. Combining quantitative evaluation with human perception is essential to keep away from such pitfalls and create dependable experiences and highly effective LLMs that actually ship worth.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026

    Extragalactic Archaeology tells the ‘life story’ of a whole galaxy

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Wordle Hints, Answer and Help for March 15 #1730

    March 15, 2026

    World’s tallest 3D-printed building completed in Switzerland

    May 22, 2025

    Today’s NYT Connections: Sports Edition Hints, Answers for Jan. 9, #108

    January 9, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.