Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Why geolocation is challenging for prediction markets
    • As Microsoft Takes the Stage, Protesters Take to the Street
    • 7 Ways New Engineers Can Flourish in the Age of AI
    • I Built a C++ Backend So My GPU Would Stop Eating Air
    • Space smoothies fight astronaut muscle loss
    • Why your funding announcement is not the PR win you think it is – and why speaking at events is
    • xAI Asks Court to Strip Alleged Grok Deepfake Nudes Victims of Anonymity
    • Strava Members: Run a 5K Wednesday, Get a Runna Subscription Free
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, June 3
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Startups»If AI can’t yet pass ‘Humanity’s Last Exam’, where does that leave ambitions for it?
    Startups

    If AI can’t yet pass ‘Humanity’s Last Exam’, where does that leave ambitions for it?

    Editor Times FeaturedBy Editor Times FeaturedFebruary 2, 2026No Comments6 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link
    How do you translate historical Palmyrene script from a Roman tombstone?

    What number of paired tendons are supported by a selected sesamoid bone in a hummingbird? Are you able to determine closed syllables in Biblical Hebrew primarily based on the newest scholarship on Tiberian pronunciation traditions?

    These are a number of the questions in “Humanity’s Final Examination”, a brand new benchmark launched in a study revealed this week in Nature. The gathering of two,500 questions is particularly designed to probe the outer limits of what in the present day’s synthetic intelligence (AI) methods can’t do.

    The benchmark represents a world collaboration of practically 1,000 worldwide specialists throughout a variety of educational fields. These lecturers and researchers contributed questions on the frontier of human information. The issues required graduate-level experience in arithmetic, physics, chemistry, biology, laptop science and the humanities. Importantly, each query was examined in opposition to main AI fashions earlier than inclusion. If an AI couldn’t reply it accurately on the time the check was designed, the query was rejected.

    This course of explains why the preliminary outcomes appeared so totally different from different benchmarks. Whereas AI chatbots rating above 90% on popular tests, when Humanity’s Final Examination was first launched in early 2025, main fashions struggled badly. GPT-4o managed simply 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI’s strongest mannequin, o1, achieved solely 8%.

    The low scores had been the purpose. The benchmark was constructed to measure what remained past AI’s grasp. And whereas some commentators have suggested that benchmarks like Humanity’s Final Examination chart a path towards synthetic common intelligence, and even superintelligence – that’s, AI methods able to performing any activity at human or superhuman ranges – we consider that is improper for 3 causes.

    Get one of the best of Startup Every day straight to your inbox

    Wish to know the newest in startup information? Subscribe to our every day information and evaluation protection on what’s occurring to ANZ startups, buyers and the broader ecosystem. And better of all, it is FREE!

    By persevering with, you conform to our Terms & Conditions and Privacy Policy.

    Benchmarks measure activity efficiency, not intelligence

    When a pupil scores properly on the bar examination, we will fairly predict they’ll make a reliable lawyer. That’s as a result of the check was designed to evaluate whether or not people have acquired the information and reasoning expertise wanted for authorized follow – and for people, that works. The understanding required to cross genuinely transfers to the job.

    However AI methods aren’t people making ready for careers.

    When a big language mannequin scores properly on the bar examination, it tells us the mannequin can produce correct-looking solutions to authorized questions. It doesn’t inform us the mannequin understands legislation, can counsel a nervous shopper, or train skilled judgment in ambiguous conditions.

    The check measures one thing actual for people; for AI it measures solely efficiency on the check itself.

    Utilizing human skill checks to benchmark AI is widespread follow, nevertheless it’s basically deceptive. Assuming a excessive check rating means the machine has change into extra human-like is a class error, very similar to concluding {that a} calculator “understands” arithmetic as a result of it may clear up equations quicker than any particular person.

    Human and machine intelligence are basically totally different

    People be taught constantly from expertise. Now we have intentions, wants and targets. We dwell lives, inhabit our bodies and expertise the world immediately. Our intelligence developed to serve our survival as organisms and our success as social creatures.

    However AI methods are very different.

    Giant language fashions derive their capabilities from patterns in textual content throughout coaching. However they don’t really learn.

    For people, intelligence comes first and language serves as a software for communication – intelligence is prelinguistic. However for giant language fashions, language is the intelligence – there’s nothing beneath.

    Even the creators of Humanity’s Final Examination acknowledge this limitation:

    Excessive accuracy on [Humanity’s Last Exam] would display expert-level efficiency on closed-ended, verifiable questions and cutting-edge scientific information, however it could not alone recommend autonomous analysis capabilities or synthetic common intelligence.

    Subbarao Kambhampati, professor at Arizona State College and former president of the Affiliation for the Development of Synthetic Intelligence, puts it more clearly:

    Humanity’s essence isn’t captured by a static check however moderately by our skill to evolve and deal with beforehand unimaginable questions.

    Builders like leaderboards

    There’s one other downside. AI builders use benchmarks to optimise their fashions for leaderboard efficiency. They’re primarily cramming for the examination. And in contrast to people, for whom the educational for the check builds understanding, AI optimisation simply means getting higher on the particular check.

    However it’s working.

    Since Humanity’s Final Examination was revealed on-line in early 2025, scores have climbed dramatically. Gemini 3 Professional Preview now tops the leaderboard at 38.3% accuracy, adopted by GPT-5 at 25.3% and Grok 4 at 24.5%.

    Does this enchancment imply these fashions are approaching human intelligence? No. It means they’ve gotten higher on the sorts of questions the examination accommodates. The benchmark has change into a goal to optimise in opposition to.

    The trade is recognising this downside.

    OpenAI recently introduced a measure called GDPval particularly designed to evaluate real-world usefulness.

    In contrast to academic-style benchmarks, GDPval focuses on duties primarily based on precise work merchandise reminiscent of challenge paperwork, knowledge analyses and deliverables that exist in skilled settings.

    What this implies for you

    When you’re utilizing AI instruments in your work or contemplating adopting them, don’t be swayed by benchmark scores. A mannequin that aces Humanity’s Final Examination may nonetheless battle with the precise duties you want finished.

    It’s additionally value noting the examination’s questions are closely skewed towards sure domains. Arithmetic alone accounts for 41% of the benchmark, with physics, biology and laptop science making up a lot of the remaining. In case your work entails writing, communication, challenge administration or customer support, the examination tells you nearly nothing about which mannequin may serve you greatest.

    A sensible method is to plot your individual checks primarily based on what you really want AI to do, then consider newer fashions in opposition to standards that matter to you. AI methods are genuinely helpful – however any dialogue about superintelligence stays science fiction and a distraction from the true work of creating these instruments related to folks’s lives.

    This text is republished from The Conversation below a Artistic Commons license. Learn the original article.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Why your funding announcement is not the PR win you think it is – and why speaking at events is

    June 3, 2026

    Property investment startup Dashdot in liquidation, with Budget as ‘the straw that broke the camel’s back’

    June 3, 2026

    GAMING: How Australia decides its Game of The Year

    June 3, 2026

    ‘Disregard for the risk to human life’: a US state is suing OpenAI and Sam Altman over AI safety

    June 3, 2026

    Report: AI could drive up Australian power prices by 26% by 2035

    June 3, 2026

    Berlin’s INXM emerges from stealth with €5.7 million to build AI process execution engine for enterprises

    June 3, 2026

    Comments are closed.

    Editors Picks

    Why geolocation is challenging for prediction markets

    June 3, 2026

    As Microsoft Takes the Stage, Protesters Take to the Street

    June 3, 2026

    7 Ways New Engineers Can Flourish in the Age of AI

    June 3, 2026

    I Built a C++ Backend So My GPU Would Stop Eating Air

    June 3, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Brazil’s IBJR encourages bettors to use regulated platforms amid rise of illegal platforms

    September 3, 2025

    Dinnerly Meal Kits Start at $6 a Serving. We Tested the Budget-Friendly Service in 2026

    April 22, 2026

    Sam Altman Slams Meta’s AI Talent-Poaching Spree: ‘Missionaries Will Beat Mercenaries’

    July 2, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.