Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»AI Technology News»How Good Are New GPT-OSS Models? We Put Them to the Test.
    AI Technology News

    How Good Are New GPT-OSS Models? We Put Them to the Test.

    Editor Times FeaturedBy Editor Times FeaturedSeptember 16, 2025No Comments5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    OpenAI hasn’t launched an open-weight language mannequin since GPT-2 again in 2019. Six years later, they stunned everybody with two: gpt-oss-120b and the smaller gpt-oss-20b.

    Naturally, we needed to know — how do they really carry out?

    To seek out out, we ran each fashions via our open-source workflow optimization framework, syftr. It evaluates fashions throughout completely different configurations — quick vs. low cost, excessive vs. low accuracy — and consists of help for OpenAI’s new “pondering effort” setting.

    In concept, extra pondering ought to imply higher solutions. In observe? Not all the time.

    We additionally use syftr to discover questions like “is LLM-as-a-Judge actually working?” and “what workflows perform well across many datasets?”.

    Our first outcomes with GPT-OSS would possibly shock you: one of the best performer wasn’t the most important mannequin or the deepest thinker. 

    As a substitute, the 20b mannequin with low pondering effort persistently landed on the Pareto frontier, even rivaling the 120b medium configuration on benchmarks like FinanceBench, HotpotQA, and MultihopRAG. In the meantime, excessive pondering effort not often mattered in any respect.

    How we arrange our experiments

    We didn’t simply pit GPT-OSS in opposition to itself. As a substitute, we needed to see the way it stacked up in opposition to different robust open-weight fashions. So we in contrast gpt-oss-20b and gpt-oss-120b with:

    • qwen3-235b-a22b
    • glm-4.5-air
    • nemotron-super-49b
    • qwen3-30b-a3b
    • gemma3-27b-it
    • phi-4-multimodal-instruct

    To check OpenAI’s new “pondering effort” characteristic, we ran every GPT-OSS mannequin in three modes: low, medium, and excessive pondering effort. That gave us six configurations in whole:

    • gpt-oss-120b-low / -medium / -high
    • gpt-oss-20b-low / -medium / -high

    For analysis, we forged a large web: 5 RAG and agent modes, 16 embedding fashions, and a variety of movement configuration choices. To guage mannequin responses, we used GPT-4o-mini and in contrast solutions in opposition to identified floor fact.

    Lastly, we examined throughout 4 datasets:

    • FinanceBench (monetary reasoning)
    • HotpotQA (multi-hop QA)
    • MultihopRAG (retrieval-augmented reasoning)
    • PhantomWiki (artificial Q&A pairs)

    We optimized workflows twice: as soon as for accuracy + latency, and as soon as for accuracy + value—capturing the tradeoffs that matter most in real-world deployments.

    Optimizing for latency, value, and accuracy

    Once we optimized the GPT-OSS fashions, we checked out two tradeoffs: accuracy vs. latency and accuracy vs. value. The outcomes had been extra stunning than we anticipated:

    • GPT-OSS 20b (low pondering effort):
      Quick, cheap, and persistently correct. This setup appeared on the Pareto frontier repeatedly, making it one of the best default alternative for many non-scientific duties. In observe, meaning faster responses and decrease payments in comparison with greater pondering efforts.
    • GPT-OSS 120b (medium pondering effort):
      Greatest suited to duties that demand deeper reasoning, like monetary benchmarks. Use this when accuracy on complicated issues issues greater than value.
    • GPT-OSS 120b (excessive pondering effort):
      Costly and often pointless. Maintain it in your again pocket for edge instances the place different fashions fall quick. For our benchmarks, it didn’t add worth.
    Determine 1: Accuracy-latency optimization with syftr
    Figure 02 cost
    Determine 2: Accuracy-cost optimization with syftr

    Studying the outcomes extra rigorously

    At first look, the outcomes look simple. However there’s an essential nuance: an LLM’s prime accuracy rating relies upon not simply on the mannequin itself, however on how the optimizer weighs it in opposition to different fashions within the combine. For instance, let’s take a look at FinanceBench.

    When optimizing for latency, all GPT-OSS fashions (besides excessive pondering effort) landed with related Pareto-frontiers. On this case, the optimizer had little motive to focus on the 20b low pondering configuration—its prime accuracy was solely 51%.

    Figure 03 latency financebench
    Determine 3: Per-LLM Pareto-frontiers for latency optimization on FinanceBench

    When optimizing for value, the image shifts dramatically. The identical 20b low pondering configuration jumps to 57% accuracy, whereas the 120b medium configuration truly drops 22%. Why? As a result of the 20b mannequin is much cheaper, so the optimizer shifts extra weight towards it.

    Figure 04 cost financebench
    Determine 4: Per-LLM Pareto-frontiers for value optimization on FinanceBench

    The takeaway: Efficiency is determined by context. Optimizers will favor completely different fashions relying on whether or not you’re prioritizing velocity, value, or accuracy. And given the massive search house of doable configurations, there could also be even higher setups past those we examined.

    Discovering agentic workflows that work effectively in your setup

    The brand new GPT-OSS fashions carried out strongly in our assessments — particularly the 20b with low pondering effort, which frequently outpaced dearer rivals. The larger lesson? Extra mannequin and extra effort doesn’t all the time imply extra accuracy. Typically, paying extra simply will get you much less.

    That is precisely why we constructed syftr and made it open-source. Each use case is completely different, and one of the best workflow for you is determined by the tradeoffs you care about most. Need decrease prices? Sooner responses? Most accuracy? 

    Run your own experiments and discover the Pareto candy spot that balances these priorities in your setup.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    How robots learn: A brief, contemporary history

    April 17, 2026

    Vibe Coding Best Practices: 5 Claude Code Habits

    April 16, 2026

    Why having “humans in the loop” in an AI war is an illusion

    April 16, 2026

    Making AI operational in constrained public sector environments

    April 16, 2026

    Treating enterprise AI as an operating layer

    April 16, 2026

    Building trust in the AI era with privacy-led UX

    April 15, 2026

    Comments are closed.

    Editors Picks

    Portable water filter provides safe drinking water from any source

    April 18, 2026

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Packable family sailboat fits two bags

    March 14, 2026

    The 3 Best Portable Jump Starters in 2026: Get Charged Up

    April 6, 2026

    Temperature-sensitive tires automatically deploy retractable studs

    March 3, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.