Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Efficient hybrid minivan delivers MPG
    • How Can Astronauts Tell How Fast They’re Going?
    • A look at the AI nonprofit METR, whose time-horizon metrics are used by AI researchers and Wall Street investors to track the rapid development of AI systems (Kevin Roose/New York Times)
    • Double Dazzle: This Weekend, There Are 2 Meteor Showers in the Night Sky
    • asexual fish defy extinction with gene repair
    • The ‘Lonely Runner’ Problem Only Appears Simple
    • Binance and Bitget to probe a rally in RaveDAO’s RAVE token, which surged 4,500% in a week, after ZachXBT alleged RAVE insiders engineered a large short squeeze (Francisco Rodrigues/CoinDesk)
    • Today’s NYT Connections Hints, Answers for April 19 #1043
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Startups»If AI can’t yet pass ‘Humanity’s Last Exam’, where does that leave ambitions for it?
    Startups

    If AI can’t yet pass ‘Humanity’s Last Exam’, where does that leave ambitions for it?

    Editor Times FeaturedBy Editor Times FeaturedFebruary 2, 2026No Comments6 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link
    How do you translate historical Palmyrene script from a Roman tombstone?

    What number of paired tendons are supported by a selected sesamoid bone in a hummingbird? Are you able to determine closed syllables in Biblical Hebrew primarily based on the newest scholarship on Tiberian pronunciation traditions?

    These are a number of the questions in “Humanity’s Final Examination”, a brand new benchmark launched in a study revealed this week in Nature. The gathering of two,500 questions is particularly designed to probe the outer limits of what in the present day’s synthetic intelligence (AI) methods can’t do.

    The benchmark represents a world collaboration of practically 1,000 worldwide specialists throughout a variety of educational fields. These lecturers and researchers contributed questions on the frontier of human information. The issues required graduate-level experience in arithmetic, physics, chemistry, biology, laptop science and the humanities. Importantly, each query was examined in opposition to main AI fashions earlier than inclusion. If an AI couldn’t reply it accurately on the time the check was designed, the query was rejected.

    This course of explains why the preliminary outcomes appeared so totally different from different benchmarks. Whereas AI chatbots rating above 90% on popular tests, when Humanity’s Final Examination was first launched in early 2025, main fashions struggled badly. GPT-4o managed simply 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI’s strongest mannequin, o1, achieved solely 8%.

    The low scores had been the purpose. The benchmark was constructed to measure what remained past AI’s grasp. And whereas some commentators have suggested that benchmarks like Humanity’s Final Examination chart a path towards synthetic common intelligence, and even superintelligence – that’s, AI methods able to performing any activity at human or superhuman ranges – we consider that is improper for 3 causes.

    Get one of the best of Startup Every day straight to your inbox

    Wish to know the newest in startup information? Subscribe to our every day information and evaluation protection on what’s occurring to ANZ startups, buyers and the broader ecosystem. And better of all, it is FREE!

    By persevering with, you conform to our Terms & Conditions and Privacy Policy.

    Benchmarks measure activity efficiency, not intelligence

    When a pupil scores properly on the bar examination, we will fairly predict they’ll make a reliable lawyer. That’s as a result of the check was designed to evaluate whether or not people have acquired the information and reasoning expertise wanted for authorized follow – and for people, that works. The understanding required to cross genuinely transfers to the job.

    However AI methods aren’t people making ready for careers.

    When a big language mannequin scores properly on the bar examination, it tells us the mannequin can produce correct-looking solutions to authorized questions. It doesn’t inform us the mannequin understands legislation, can counsel a nervous shopper, or train skilled judgment in ambiguous conditions.

    The check measures one thing actual for people; for AI it measures solely efficiency on the check itself.

    Utilizing human skill checks to benchmark AI is widespread follow, nevertheless it’s basically deceptive. Assuming a excessive check rating means the machine has change into extra human-like is a class error, very similar to concluding {that a} calculator “understands” arithmetic as a result of it may clear up equations quicker than any particular person.

    Human and machine intelligence are basically totally different

    People be taught constantly from expertise. Now we have intentions, wants and targets. We dwell lives, inhabit our bodies and expertise the world immediately. Our intelligence developed to serve our survival as organisms and our success as social creatures.

    However AI methods are very different.

    Giant language fashions derive their capabilities from patterns in textual content throughout coaching. However they don’t really learn.

    For people, intelligence comes first and language serves as a software for communication – intelligence is prelinguistic. However for giant language fashions, language is the intelligence – there’s nothing beneath.

    Even the creators of Humanity’s Final Examination acknowledge this limitation:

    Excessive accuracy on [Humanity’s Last Exam] would display expert-level efficiency on closed-ended, verifiable questions and cutting-edge scientific information, however it could not alone recommend autonomous analysis capabilities or synthetic common intelligence.

    Subbarao Kambhampati, professor at Arizona State College and former president of the Affiliation for the Development of Synthetic Intelligence, puts it more clearly:

    Humanity’s essence isn’t captured by a static check however moderately by our skill to evolve and deal with beforehand unimaginable questions.

    Builders like leaderboards

    There’s one other downside. AI builders use benchmarks to optimise their fashions for leaderboard efficiency. They’re primarily cramming for the examination. And in contrast to people, for whom the educational for the check builds understanding, AI optimisation simply means getting higher on the particular check.

    However it’s working.

    Since Humanity’s Final Examination was revealed on-line in early 2025, scores have climbed dramatically. Gemini 3 Professional Preview now tops the leaderboard at 38.3% accuracy, adopted by GPT-5 at 25.3% and Grok 4 at 24.5%.

    Does this enchancment imply these fashions are approaching human intelligence? No. It means they’ve gotten higher on the sorts of questions the examination accommodates. The benchmark has change into a goal to optimise in opposition to.

    The trade is recognising this downside.

    OpenAI recently introduced a measure called GDPval particularly designed to evaluate real-world usefulness.

    In contrast to academic-style benchmarks, GDPval focuses on duties primarily based on precise work merchandise reminiscent of challenge paperwork, knowledge analyses and deliverables that exist in skilled settings.

    What this implies for you

    When you’re utilizing AI instruments in your work or contemplating adopting them, don’t be swayed by benchmark scores. A mannequin that aces Humanity’s Final Examination may nonetheless battle with the precise duties you want finished.

    It’s additionally value noting the examination’s questions are closely skewed towards sure domains. Arithmetic alone accounts for 41% of the benchmark, with physics, biology and laptop science making up a lot of the remaining. In case your work entails writing, communication, challenge administration or customer support, the examination tells you nearly nothing about which mannequin may serve you greatest.

    A sensible method is to plot your individual checks primarily based on what you really want AI to do, then consider newer fashions in opposition to standards that matter to you. AI methods are genuinely helpful – however any dialogue about superintelligence stays science fiction and a distraction from the true work of creating these instruments related to folks’s lives.

    This text is republished from The Conversation below a Artistic Commons license. Learn the original article.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology

    April 18, 2026

    Meet the speakers joining our “How to Launch and Scale in Malta” panel at the EU-Startups Summit 2026!

    April 17, 2026

    2026 Summit after-hours: Side events, hidden gems, and local highlights!

    April 17, 2026

    Kiwi-founded Allbirds gives wooly shoes the boot for AI – and its shares went bonkers

    April 17, 2026

    Zip sees bad debts rising as people turn to BNPL to pay for essentials

    April 17, 2026

    Elon Musk’s SpaceX is bending the rules to launch its $3 trillion IPO

    April 17, 2026

    Comments are closed.

    Editors Picks

    Efficient hybrid minivan delivers MPG

    April 19, 2026

    How Can Astronauts Tell How Fast They’re Going?

    April 19, 2026

    A look at the AI nonprofit METR, whose time-horizon metrics are used by AI researchers and Wall Street investors to track the rapid development of AI systems (Kevin Roose/New York Times)

    April 19, 2026

    Double Dazzle: This Weekend, There Are 2 Meteor Showers in the Night Sky

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Systems thinking helps me put the big picture front and center

    October 30, 2025

    DOGE Has Access to Sensitive Labor Department Data on Immigrants and Farm Workers

    April 20, 2025

    Lessons from a €50 million Series B: And why it matters at every startup funding stage

    August 23, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.