New secret math benchmark stumps AI models and PhDs alike

Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to evaluate parts of the benchmark. “These are extraordinarily difficult,” Tao mentioned in suggestions offered to Epoch. “I feel that within the close to time period principally the one option to resolve them, in need of having an actual area skilled within the space, is by a mix of a semi-expert like a graduate pupil in a associated subject, possibly paired with some mixture of a contemporary AI and plenty of different algebra packages.”

A chart displaying AI fashions’ restricted success on the FrontierMath issues, taken from Epoch AI’s analysis paper.

Credit score:

Epoch AI

To help within the verification of right solutions throughout testing, the FrontierMath issues should have solutions that may be robotically checked by way of computation, both as precise integers or mathematical objects. The designers made issues “guessproof” by requiring giant numerical solutions or complicated mathematical options, with lower than a 1 p.c likelihood of right random guesses.

Mathematician Evan Chen, writing on his blog, defined how he thinks that FrontierMath differs from conventional math competitions just like the International Mathematical Olympiad (IMO). Issues in that competitors usually require artistic perception whereas avoiding complicated implementation and specialised information, he says. However for FrontierMath, “they preserve the primary requirement, however outright invert the second and third requirement,” Chen wrote.

Whereas IMO issues keep away from specialised information and complicated calculations, FrontierMath embraces them. “As a result of an AI system has vastly better computational energy, it is truly potential to design issues with simply verifiable options utilizing the identical concept that IOI or Undertaking Euler does—principally, ‘write a proof’ is changed by ‘implement an algorithm in code,'” Chen defined.

The group plans common evaluations of AI fashions in opposition to the benchmark whereas increasing its drawback set. They are saying they are going to launch further pattern issues within the coming months to assist the analysis neighborhood check their methods.

Source link

New secret math benchmark stumps AI models and PhDs alike

Kalshi lawsuits dominate prediction market news today

Catawba Tribe Plans Two More North Carolina Casinos

Polymarket scrutiny, Schwab entry – latest prediction market news

Honolulu gambling raid in Waimakua Place nets machines

New Mexico lawsuit targets Kalshi sports contracts

Rhode Island Senate approves sports betting market expansion

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Today’s NYT Connections: Sports Edition Hints, Answers for Dec. 1 #434

Our Favorite Pixel Phone Is $100 Off

Chrysler and Jeep’s parent company shows off Level 3 self-driving tech

New secret math benchmark stumps AI models and PhDs alike

Related Posts