How do you translate historical Palmyrene script from a Roman tombstone?
What number of paired tendons are supported by a selected sesamoid bone in a hummingbird? Are you able to determine closed syllables in Biblical Hebrew primarily based on the newest scholarship on Tiberian pronunciation traditions?
These are a number of the questions in “Humanity’s Final Examination”, a brand new benchmark launched in a study revealed this week in Nature. The gathering of two,500 questions is particularly designed to probe the outer limits of what in the present day’s synthetic intelligence (AI) methods can’t do.
The benchmark represents a world collaboration of practically 1,000 worldwide specialists throughout a variety of educational fields. These lecturers and researchers contributed questions on the frontier of human information. The issues required graduate-level experience in arithmetic, physics, chemistry, biology, laptop science and the humanities. Importantly, each query was examined in opposition to main AI fashions earlier than inclusion. If an AI couldn’t reply it accurately on the time the check was designed, the query was rejected.
This course of explains why the preliminary outcomes appeared so totally different from different benchmarks. Whereas AI chatbots rating above 90% on popular tests, when Humanity’s Final Examination was first launched in early 2025, main fashions struggled badly. GPT-4o managed simply 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI’s strongest mannequin, o1, achieved solely 8%.
The low scores had been the purpose. The benchmark was constructed to measure what remained past AI’s grasp. And whereas some commentators have suggested that benchmarks like Humanity’s Final Examination chart a path towards synthetic common intelligence, and even superintelligence – that’s, AI methods able to performing any activity at human or superhuman ranges – we consider that is improper for 3 causes.
Benchmarks measure activity efficiency, not intelligence
When a pupil scores properly on the bar examination, we will fairly predict they’ll make a reliable lawyer. That’s as a result of the check was designed to evaluate whether or not people have acquired the information and reasoning expertise wanted for authorized follow – and for people, that works. The understanding required to cross genuinely transfers to the job.
However AI methods aren’t people making ready for careers.
When a big language mannequin scores properly on the bar examination, it tells us the mannequin can produce correct-looking solutions to authorized questions. It doesn’t inform us the mannequin understands legislation, can counsel a nervous shopper, or train skilled judgment in ambiguous conditions.
The check measures one thing actual for people; for AI it measures solely efficiency on the check itself.
Utilizing human skill checks to benchmark AI is widespread follow, nevertheless it’s basically deceptive. Assuming a excessive check rating means the machine has change into extra human-like is a class error, very similar to concluding {that a} calculator “understands” arithmetic as a result of it may clear up equations quicker than any particular person.
Human and machine intelligence are basically totally different
People be taught constantly from expertise. Now we have intentions, wants and targets. We dwell lives, inhabit our bodies and expertise the world immediately. Our intelligence developed to serve our survival as organisms and our success as social creatures.
However AI methods are very different.
Giant language fashions derive their capabilities from patterns in textual content throughout coaching. However they don’t really learn.
For people, intelligence comes first and language serves as a software for communication – intelligence is prelinguistic. However for giant language fashions, language is the intelligence – there’s nothing beneath.
Even the creators of Humanity’s Final Examination acknowledge this limitation:
Excessive accuracy on [Humanity’s Last Exam] would display expert-level efficiency on closed-ended, verifiable questions and cutting-edge scientific information, however it could not alone recommend autonomous analysis capabilities or synthetic common intelligence.
Subbarao Kambhampati, professor at Arizona State College and former president of the Affiliation for the Development of Synthetic Intelligence, puts it more clearly:
Humanity’s essence isn’t captured by a static check however moderately by our skill to evolve and deal with beforehand unimaginable questions.
Builders like leaderboards
There’s one other downside. AI builders use benchmarks to optimise their fashions for leaderboard efficiency. They’re primarily cramming for the examination. And in contrast to people, for whom the educational for the check builds understanding, AI optimisation simply means getting higher on the particular check.
However it’s working.
Since Humanity’s Final Examination was revealed on-line in early 2025, scores have climbed dramatically. Gemini 3 Professional Preview now tops the leaderboard at 38.3% accuracy, adopted by GPT-5 at 25.3% and Grok 4 at 24.5%.
Does this enchancment imply these fashions are approaching human intelligence? No. It means they’ve gotten higher on the sorts of questions the examination accommodates. The benchmark has change into a goal to optimise in opposition to.
The trade is recognising this downside.
OpenAI recently introduced a measure called GDPval particularly designed to evaluate real-world usefulness.
In contrast to academic-style benchmarks, GDPval focuses on duties primarily based on precise work merchandise reminiscent of challenge paperwork, knowledge analyses and deliverables that exist in skilled settings.
What this implies for you
When you’re utilizing AI instruments in your work or contemplating adopting them, don’t be swayed by benchmark scores. A mannequin that aces Humanity’s Final Examination may nonetheless battle with the precise duties you want finished.
It’s additionally value noting the examination’s questions are closely skewed towards sure domains. Arithmetic alone accounts for 41% of the benchmark, with physics, biology and laptop science making up a lot of the remaining. In case your work entails writing, communication, challenge administration or customer support, the examination tells you nearly nothing about which mannequin may serve you greatest.
A sensible method is to plot your individual checks primarily based on what you really want AI to do, then consider newer fashions in opposition to standards that matter to you. AI methods are genuinely helpful – however any dialogue about superintelligence stays science fiction and a distraction from the true work of creating these instruments related to folks’s lives.
This text is republished from The Conversation below a Artistic Commons license. Learn the original article.

