Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • AI evolves itself to speed up scientific discovery
    • Australia’s privacy commissioner tried, in vain, to sound the alarm on data protection during the u16s social media ban trials
    • Nothing Phone (4a) Pro Review: A Close Second
    • Match Group CEO Spencer Rascoff says growing women’s share on Tinder is his “primary focus” to stem user declines; Sensor Tower says 75% of Tinder users are men (Kieran Smith/Financial Times)
    • Today’s NYT Connections Hints, Answers for April 20 #1044
    • AI Machine-Vision Earns Man Overboard Certification
    • Battery recycling startup Renewable Metals charges up on $12 million Series A
    • The Influencers Normalizing Not Having Sex
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, April 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»AI Technology News»The way we measure progress in AI is terrible
    AI Technology News

    The way we measure progress in AI is terrible

    Editor Times FeaturedBy Editor Times FeaturedNovember 30, 2024No Comments3 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    One of many objectives of the analysis was to outline a listing of standards that make a very good benchmark. “It’s positively an vital downside to debate the standard of the benchmarks, what we wish from them, what we want from them,” says Ivanova. “The difficulty is that there isn’t one good commonplace to outline benchmarks. This paper is an try to supply a set of analysis standards. That’s very helpful.”

    The paper was accompanied by the launch of a web site, BetterBench, that ranks the most well-liked AI benchmarks. Score components embody whether or not or not specialists have been consulted on the design, whether or not the examined functionality is properly outlined, and different fundamentals—for instance, is there a suggestions channel for the benchmark, or has it been peer-reviewed?

    The MMLU benchmark had the bottom scores. “I disagree with these rankings. In reality, I’m an creator of a few of the papers ranked extremely, and would say that the decrease ranked benchmarks are higher than them,” says Dan Hendrycks, director of CAIS, the Heart for AI Security, and one of many creators of the MMLU benchmark.  That mentioned, Hendrycks nonetheless believes that one of the simplest ways to maneuver the sector ahead is to construct higher benchmarks.

    Some assume the standards could also be lacking the larger image. “The paper provides one thing beneficial. Implementation standards and documentation standards—all of that is vital. It makes the benchmarks higher,” says Marius Hobbhahn, CEO of Apollo Analysis, a analysis group specializing in AI evaluations. “However for me, a very powerful query is, do you measure the fitting factor? You might examine all of those containers, however you may nonetheless have a horrible benchmark as a result of it simply doesn’t measure the fitting factor.”

    Primarily, even when a benchmark is completely designed, one which exams the mannequin’s capability to supply compelling evaluation of Shakespeare sonnets could also be ineffective if somebody is basically involved about AI’s hacking capabilities. 

    “You’ll see a benchmark that’s speculated to measure ethical reasoning. However what meaning isn’t essentially outlined very properly. Are people who find themselves specialists in that area being integrated within the course of? Usually that isn’t the case,” says Amelia Hardy, one other creator of the paper and an AI researcher at Stanford College.

    There are organizations actively making an attempt to enhance the state of affairs. For instance, a brand new benchmark from Epoch AI, a analysis group, was designed with enter from 60 mathematicians and verified as difficult by two winners of the Fields Medal, which is probably the most prestigious award in arithmetic. The participation of those specialists fulfills one of many standards within the BetterBench evaluation. The present most superior fashions are capable of reply lower than 2% of the questions on the benchmark, which implies there’s a big method to go earlier than it’s saturated. 

    “We actually tried to characterize the total breadth and depth of contemporary math analysis,” says Tamay Besiroglu, affiliate director at Epoch AI. Regardless of the problem of the check, Besiroglu speculates it would take solely round 4 years for AI fashions to saturate the benchmark, scoring greater than 80%.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    How robots learn: A brief, contemporary history

    April 17, 2026

    Vibe Coding Best Practices: 5 Claude Code Habits

    April 16, 2026

    Why having “humans in the loop” in an AI war is an illusion

    April 16, 2026

    Making AI operational in constrained public sector environments

    April 16, 2026

    Treating enterprise AI as an operating layer

    April 16, 2026

    Building trust in the AI era with privacy-led UX

    April 15, 2026

    Comments are closed.

    Editors Picks

    AI evolves itself to speed up scientific discovery

    April 20, 2026

    Australia’s privacy commissioner tried, in vain, to sound the alarm on data protection during the u16s social media ban trials

    April 20, 2026

    Nothing Phone (4a) Pro Review: A Close Second

    April 20, 2026

    Match Group CEO Spencer Rascoff says growing women’s share on Tinder is his “primary focus” to stem user declines; Sensor Tower says 75% of Tinder users are men (Kieran Smith/Financial Times)

    April 20, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    the main topic at WWDC will be a redesigned software interface, dubbed Solarium, for all of Apple’s operating systems, including tvOS and watchOS (Mark Gurman/Bloomberg)

    May 25, 2025

    Meet Twindo: An Amsterdam-based startup that raised €1 million to aid the energy transition with field-first software

    May 30, 2025

    Double-duty outdoor tool combines a trekking pole and tent in one device

    February 19, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.