Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How small businesses can leverage AI
    • Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    • New radio bursts detected from binary stars
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How to Benchmark LLMs – ARC AGI 3
    Artificial Intelligence

    How to Benchmark LLMs – ARC AGI 3

    Editor Times FeaturedBy Editor Times FeaturedAugust 1, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    the previous couple of weeks, we have now seen the discharge of highly effective LLMs similar to Qwen 3 MoE, Kimi K2, and Grok 4. We’ll proceed seeing such fast enhancements within the foreseeable future, and to match the LLMs towards one another, we require benchmarks. On this article, I focus on the newly launched ARC AGI 3 benchmark and why frontier LLMs wrestle to finish any duties on the benchmark.

    Motivation

    In the present day, we’re asserting a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest hole between simple for people and laborious for AI

    We’re releasing:
    * 3 video games (environments)
    * $10K agent contest
    * AI brokers API

    Beginning scores – Frontier AI: 0%, People: 100% pic.twitter.com/3YY6jV2RdY

    — ARC Prize (@arcprize) July 18, 2025

    ARC AGI 3 was lately launched.

    My motivation for writing this text is to remain on prime of the newest developments in LLM know-how. Solely within the final couple of weeks have we seen the Kimi K2 mannequin (finest open-source mannequin when launched), Qwen 3 235B-A22B (at the moment finest open-source mannequin), Grok 4, and so forth. There may be a lot taking place within the LLM area, and one strategy to sustain is to trace the benchmarks.

    I feel the ARC AGI benchmark is especially attention-grabbing, primarily as a result of I need to see if LLMs can match human-level intelligence. ARC AGI puzzles are made in order that people are in a position to full them, however LLMs will wrestle.

    You too can learn my article on Utilizing Context Engineering to Significantly Enhance LLM Performance and take a look at my website, which contains all my information and articles.

    Desk of Contents

    Introduction to ARC AGI

    ARC AGI is actually a puzzle recreation of sample matching.

    • ARC AGI 1: You’re given a collection of input-output pairs, and have to finish the sample
    • ARC AGI 2: Much like the primary benchmark, performing sample matching on enter and output examples
    • ARC AGI 3: Right here you’re taking part in a recreation, the place it’s a must to transfer your block into the purpose space, however some required steps in between

    I feel it’s cool to check out these puzzle video games and full them myself. Then, you’ll be able to see LLMs initially wrestle with the benchmarks, after which improve their efficiency with higher fashions. OpenAI, for instance, scored:

    • 7.8% with o1 mini
    • 75% with o3-low
    • 88% with o3-high

    As you can even see within the picture under:

    This determine exhibits the efficiency of various OpenAI fashions on the ARC AGI 1 benchmark. You’ll be able to see how efficiency will increase with extra superior fashions. Picture from ARC AGI, which is underneath the Apache 2 license.

    Taking part in the ARC AGI benchmark

    You too can attempt the ARC AGI benchmarks your self or construct an AI to carry out the duties. Go to the ARC AGI 3 website and begin taking part in the sport.

    The entire level of the video games is that you haven’t any directions, and it’s a must to determine the principles your self. I take pleasure in this idea, because it represents determining a wholly new drawback, with none assist. This highlights your potential to study new environments, adapt to them, and remedy issues.

    You’ll be able to see a recording of me playing ARC AGI 3 here, encountering the issues for the primary time. I used to be sadly unable to embed the hyperlink within the article. Nonetheless, it was tremendous attention-grabbing to check out the benchmark and picture the problem an LLM has to undergo to resolve it. I first observe the atmosphere, and what occurs after I carry out the totally different actions. An motion on this case is urgent one of many related buttons. Some actions do nothing, whereas others have an effect on the atmosphere. I then proceed to uncover the purpose of the puzzle (for instance, get the article to the purpose space) and attempt to obtain this purpose.

    Why frontier fashions obtain 0%

    This article states that when frontier fashions had been examined on the ARC AGI 3 preview, they achieved 0%. This would possibly sound disappointing to some individuals, contemplating you had been most likely in a position to efficiently full plenty of the duties your self, comparatively shortly.

    As I beforehand mentioned, a number of OpenAI fashions have had success with the sooner ARC AGI benchmarks, with their finest mannequin reaching 88% on the primary model. Nonetheless, initially, fashions achieved 0%, or within the low single-digit percentages.

    I’ve a number of theories for why frontier fashions weren’t in a position to carry out duties on ARC AGI 3:

    Context size

    When engaged on ARC AGI 3, you don’t get any details about the sport. The mannequin thus has to check out a wide range of actions, see the output of these actions (for instance, nothing occurs, or a block strikes, and so on). The mannequin then has to judge the actions it took, together with the output, and think about its subsequent strikes.

    I imagine the motion area on ARC AGI 3 could be very giant, and it’s thus tough for the fashions to each experiment sufficient to seek out the right motion and keep away from repeating unsuccessful actions. The fashions basically have an issue with their context size and using the total size of it.

    I lately learn an attention-grabbing article from Manus about how they develop their brokers and handle their reminiscence. You should utilize methods similar to summarizing earlier context or utilizing a file system to retailer necessary context. I imagine this shall be key to growing efficiency on the ARC AGI 3 benchmark.

    Coaching dataset

    One other major motive frontier fashions are unable to finish ARC AGI 3 duties efficiently is that the duties are very totally different from their coaching dataset. LLMs will nearly at all times carry out manner higher on a activity if such a activity (or the same one) is included within the coaching dataset. On this occasion, I imagine LLMs have little coaching information on working with video games, for instance. Moreover, an necessary level right here can be the agentic coaching information for the LLMs.

    With agentic coaching information, I imply information the place the LLM is using instruments and performing actions. I imagine we’re seeing a fast improve in LLMs used as brokers, and thus, the proportional quantity of coaching information for agentic habits is quickly growing. Nonetheless, it is likely to be that present frontier fashions nonetheless are usually not nearly as good at performing such actions, although it can doubtless improve quickly within the coming months.

    Some individuals will spotlight how this proves LLMs do not need actual intelligence: The entire level of intelligence (and the ARC AGI benchmark) is to have the ability to perceive duties with none clues, solely by analyzing the atmosphere. To some extent, I agree with this level, and I hope to see fashions carry out higher on ARC AGI due to elevated mannequin intelligence, and never due to benchmark chasing, an idea I discover later on this article.

    Benchmark efficiency sooner or later

    Sooner or later, I imagine we’ll see huge enhancements in mannequin efficiency on ARC AGI 3. Principally as a result of I feel you’ll be able to create AI brokers which are fine-tuned for agentic efficiency, and that optimally make the most of their reminiscence. I imagine comparatively low-cost enhancements can be utilized to vastly enhance efficiency, although I additionally count on costlier enhancements (for instance, the discharge of GPT-5) will carry out nicely on this benchmark.

    Benchmark chasing

    I feel it’s necessary to go away a bit about benchmark chasing. Benchmark chasing is the idea of LLM suppliers chasing optimum scores on benchmarks, reasonably than merely creating one of the best or most clever LLMs. It is a drawback as a result of the correlation between benchmark efficiency and LLM intelligence just isn’t 100%.

    Within the reinforcement studying world, benchmark chasing can be known as reward hacking. A situation the place the agent figures out a strategy to hack the atmosphere they’re in to attain a reward, with out correctly performing a activity.

    The explanation LLM suppliers do that is that at any time when a brand new mannequin is launched, individuals often take a look at two issues:

    • Benchmark efficiency
    • Vibe

    Benchmark efficiency is often measured on identified benchmarks, similar to SWE-bench and ARC AGI. Vibe testing can be a manner LLMs are sometimes measured by the general public (I’m not saying it’s a great way of testing the mannequin, I’m merely saying it occurs in observe). The issue with this, nevertheless, is that I imagine it’s fairly easy to impress individuals with the vibe of a mannequin, as a result of vibe checking tries some very small share of the motion area for the LLM. You could solely be asking it sure questions which can be found on the net, or asking it to program an utility which the mannequin has already seen 1000 cases of in its coaching information.

    Thus, what you must do is to have a benchmark by yourself, for instance, an in-house dataset that has not been leaked to the web. Then you’ll be able to benchmark which LLM works finest on your use case and prioritize utilizing this LLM.

    Conclusion

    On this article, I’ve mentioned LLM benchmarks and why they’re necessary for evaluating LLMs. I’ve launched you to the newly launched ARC AGI 3 benchmark. This benchmark is tremendous attention-grabbing contemplating people are simply in a position to full a number of the duties, whereas frontier fashions rating 0%. This thus represents a activity the place human intelligence nonetheless outperforms LLMs.

    As we advance, I imagine we’ll see fast enhancements in LLM efficiency on ARC AGI 3, although I hope this won’t be the results of benchmark chasing, however reasonably the intelligence enchancment of LLMs.





    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    How small businesses can leverage AI

    June 2, 2026

    Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt

    June 2, 2026

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026

    London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Plasma Beam Solution Tackles Kessler Syndrome Threat

    September 14, 2025

    “Uncanny Valley”: OpenAI and Musk Fight Again; DOJ Mishandles Voter Data; Artemis II Comes Home

    April 12, 2026

    Computer science graduates struggle to secure their first jobs

    August 22, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.