Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Electric trike ditches chains with innovative pedal-by-wire
    • Silvio Schembri, Malta’s Minister for the Economy, Enterprise and Strategic Projects, joins the EU-Startups Summit 2026!
    • A Hot-Air Balloon Landed in a California Backyard. The Owner Says It’s a ‘Very Rare’ Event
    • Benq TK705STi Short-Throw Projector Review: Tidy Little Box for Little Rooms
    • Maja Matarić Pioneered Socially Assistive Robotics
    • Enclosed ebike uses retractable outriggers for stability
    • Startup 360: This teenage entrepreneur is using AI to help people ‘vent’ for their mental health
    • Hyundai Ioniq 3 2026: Price, Specs, Availability
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, April 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»AI Technology News»How to build a better AI benchmark
    AI Technology News

    How to build a better AI benchmark

    Editor Times FeaturedBy Editor Times FeaturedMay 18, 2025No Comments3 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    The boundaries of conventional testing

    If AI corporations have been sluggish to reply to the rising failure of benchmarks, it’s partially as a result of the test-scoring strategy has been so efficient for therefore lengthy. 

    One of many greatest early successes of latest AI was the ImageNet problem, a sort of antecedent to modern benchmarks. Launched in 2010 as an open problem to researchers, the database held greater than 3 million photos for AI techniques to categorize into 1,000 completely different lessons.

    Crucially, the check was utterly agnostic to strategies, and any profitable algorithm shortly gained credibility no matter the way it labored. When an algorithm referred to as AlexNet broke via in 2012, with a then unconventional type of GPU coaching, it turned one of many foundational outcomes of recent AI. Few would have guessed prematurely that AlexNet’s convolutional neural nets could be the key to unlocking picture recognition—however after it scored properly, nobody dared dispute it. (One in all AlexNet’s builders, Ilya Sutskever, would go on to cofound OpenAI.)

    A big a part of what made this problem so efficient was that there was little sensible distinction between ImageNet’s object classification problem and the precise technique of asking a pc to acknowledge a picture. Even when there have been disputes about strategies, nobody doubted that the highest-scoring mannequin would have a bonus when deployed in an precise picture recognition system.

    However within the 12 years since, AI researchers have utilized that very same method-agnostic strategy to more and more common duties. SWE-Bench is often used as a proxy for broader coding potential, whereas different exam-style benchmarks typically stand in for reasoning potential. That broad scope makes it tough to be rigorous about what a particular benchmark measures—which, in flip, makes it laborious to make use of the findings responsibly. 

    The place issues break down

    Anka Reuel, a PhD scholar who has been specializing in the benchmark drawback as a part of her analysis at Stanford, has develop into satisfied the analysis drawback is the results of this push towards generality. “We’ve moved from task-specific fashions to general-purpose fashions,” Reuel says. “It’s not a few single activity anymore however a complete bunch of duties, so analysis turns into tougher.”

    Just like the College of Michigan’s Jacobs, Reuel thinks “the principle problem with benchmarks is validity, much more than the sensible implementation,” noting: “That’s the place lots of issues break down.” For a activity as difficult as coding, for example, it’s almost not possible to include each potential situation into your drawback set. Consequently, it’s laborious to gauge whether or not a mannequin is scoring higher as a result of it’s extra expert at coding or as a result of it has extra successfully manipulated the issue set. And with a lot stress on builders to realize file scores, shortcuts are laborious to withstand.

    For builders, the hope is that success on a lot of particular benchmarks will add as much as a usually succesful mannequin. However the strategies of agentic AI imply a single AI system can embody a posh array of various fashions, making it laborious to judge whether or not enchancment on a particular activity will result in generalization. “There’s simply many extra knobs you’ll be able to flip,” says Sayash Kapoor, a pc scientist at Princeton and a outstanding critic of sloppy practices within the AI business. “With regards to brokers, they’ve type of given up on the most effective practices for analysis.”



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Chinese tech workers are starting to train their AI doubles–and pushing back

    April 20, 2026

    How robots learn: A brief, contemporary history

    April 17, 2026

    Vibe Coding Best Practices: 5 Claude Code Habits

    April 16, 2026

    Why having “humans in the loop” in an AI war is an illusion

    April 16, 2026

    Making AI operational in constrained public sector environments

    April 16, 2026

    Treating enterprise AI as an operating layer

    April 16, 2026

    Comments are closed.

    Editors Picks

    Electric trike ditches chains with innovative pedal-by-wire

    April 20, 2026

    Silvio Schembri, Malta’s Minister for the Economy, Enterprise and Strategic Projects, joins the EU-Startups Summit 2026!

    April 20, 2026

    A Hot-Air Balloon Landed in a California Backyard. The Owner Says It’s a ‘Very Rare’ Event

    April 20, 2026

    Benq TK705STi Short-Throw Projector Review: Tidy Little Box for Little Rooms

    April 20, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    10 Best Android Phones of 2025, Tested and Reviewed

    July 12, 2025

    Can the plastic recycling industry be saved?

    October 24, 2025

    Maxi Mobility lands €1.2 million to electrify Italy’s taxis and accelerate zero-emission mobility

    July 31, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.