Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Your ReAct Agent Is Wasting 90% of Its Retries — Here’s How to Stop It
    Artificial Intelligence

    Your ReAct Agent Is Wasting 90% of Its Retries — Here’s How to Stop It

    Editor Times FeaturedBy Editor Times FeaturedApril 12, 2026No Comments20 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Who that is for: ML engineers and AI builders working LLM brokers in manufacturing — particularly ReAct-style methods utilizing LangChain, LangGraph, AutoGen, or customized instrument loops. Should you’re new to ReAct, it’s a prompting sample the place an LLM alternates between Thought, Motion, and Remark steps to resolve duties utilizing instruments.


    are burning nearly all of their retry funds on errors that may by no means succeed.

    In a 200-task benchmark, 90.8% of retries have been wasted — not as a result of the mannequin was incorrect, however as a result of the system saved retrying instruments that didn’t exist. Not “unlikely to succeed.” Assured to fail.

    I didn’t discover this by tuning prompts. I discovered it by instrumenting each retry, classifying each error, and monitoring precisely the place the funds went. The basis trigger turned out to be a single architectural assumption: letting the mannequin select the instrument identify at runtime.

    Right here’s what makes this notably harmful. Your monitoring dashboard is nearly actually not displaying it. Proper now it in all probability exhibits:

    • Success price: superb
    • Latency: acceptable
    • Retries: inside limits

    What it doesn’t present: what number of of these retries have been unattainable from the primary try. That’s the hole this text is about.

    Simulation word: All outcomes come from a deterministic simulation utilizing calibrated parameters, not reside API calls. The hallucination price (28%) is a conservative estimate for tool-call hallucination in ReAct-style brokers derived from failure mode evaluation in printed GPT-4-class benchmarks (Yao et al., 2023; Shinn et al., 2023) — it isn’t a instantly reported determine from these papers. Structural conclusions maintain as architectural properties; actual percentages will range in manufacturing. Full limitations are mentioned on the finish. Reproduce each quantity your self: python app.py --seed 42.

    GitHub Repository: https://github.com/Emmimal/react-retry-waste-analysis

    In manufacturing, this implies you’re paying for retries that can’t succeed—and ravenous those that would.


    Left: string-based instrument routing passes the mannequin’s output on to TOOLS.get() — a hallucinated identify returns None, burns retry funds by way of a world counter with no error taxonomy, and fails silently. Proper: deterministic routing resolves instrument names from a Python dict at plan time, classifies errors earlier than retrying, and makes hallucination on the routing layer structurally unattainable. Picture by Creator.

    TL;DR

    90.8% of retries have been wasted on errors that would by no means succeed. Root trigger: letting the mannequin select instrument names at runtime (TOOLS.get(tool_name)). Prompts don’t repair it — a hallucinated instrument identify is a everlasting error. No retry could make a lacking key seem in a dictionary.

    Three structural fixes eradicate the issue: classify errors earlier than retrying, use per-tool circuit breakers, transfer instrument routing into code. Outcome: 0% wasted retries, 3× decrease step variance, predictable execution.


    The Regulation This Article Is Constructed On

    Earlier than the info, the precept — said as soon as, bluntly:

    Retrying solely is sensible for errors that may change. A hallucinated instrument identify can not change. Due to this fact, retrying it’s assured waste.

    This isn’t a likelihood argument. It’s not “hallucinations are uncommon sufficient to disregard.” It’s a logical property: TOOLS.get("web_browser") returns None on the primary try, the second, and each try after. The instrument doesn’t exist. The retry counter doesn’t know that. It burns a funds slot anyway.

    The complete drawback flows from this mismatch. The repair does too.


    The One Line Silently Draining Your Retry Price range

    It seems in nearly each ReAct tutorial. You’ve in all probability written it:

    tool_fn = TOOLS.get(tool_name)   # ◄─ THE LINE
    
    if tool_fn is None:
        # No error taxonomy right here.
        # TOOL_NOT_FOUND appears an identical to a transient community blip.
        # The worldwide retry counter burns funds on a instrument
        # that can by no means exist — and logs that as a "failure".

    That is the road. The whole lot else on this article follows from it.

    When an LLM hallucinates a instrument identify — web_browser, sql_query, python_repl — TOOLS.get() returns None. The agent is aware of the instrument doesn’t exist. The worldwide retry counter doesn’t. It treats TOOL_NOT_FOUND identically to TRANSIENT: similar funds slot, similar retry logic, similar backoff.

    The cascade: each hallucination consumes retry slots that would have dealt with an actual failure. When a real community timeout arrives two steps later, there’s nothing left. The duty fails — logged as generic retry exhaustion, with no hint of a hallucinated instrument identify being the basis trigger.

    In case your logs include retries on TOOL_NOT_FOUND, you have already got this drawback. The one query is what fraction of your funds it’s consuming. On this benchmark, the reply was 90.8%.


    The Benchmark Setup

    Two brokers, 200 duties, similar simulated parameters, similar instruments, similar failure charges — with one structural distinction.

    Comparability word: This benchmark compares a naive ReAct baseline in opposition to a workflow with all three fixes utilized. Fixes 1 (error taxonomy) and a couple of (per-tool circuit breakers) are independently relevant to a ReAct agent with out altering its structure. Repair 3 (deterministic instrument routing) is the structural differentiator — it’s what makes hallucination on the routing layer unattainable. The hole proven is cumulative; maintain this in thoughts when studying the numbers.

    ReAct agent: Commonplace Thought → Motion → Remark loop. Single world retry counter (MAX_REACT_RETRIES = 6, MAX_REACT_STEPS = 10). No error taxonomy. Software identify comes from LLM output at runtime. Every hallucinated instrument identify burns precisely 3 retry slots (HALLUCINATION_RETRY_BURN = 3) — this fixed instantly drives the 90.8% waste determine and is mentioned additional in Limitations.

    Managed workflow: Deterministic plan execution the place instrument routing is a Python dict lookup resolved at plan time. Error taxonomy utilized on the level of failure. Per-tool circuit breakers (journeys after 3 consecutive failures, restoration probe after 5 simulated seconds, closes after 2 probe successes). Retry logic scoped to error class.

    Simulation parameters:

    Parameter Worth Notes
    Seed 42 World random seed
    Duties 200 Per experiment
    Hallucination price 28% Conservative estimate from printed benchmarks
    Loop detection price 18% Utilized to steps with historical past size > 2
    HALLUCINATION_RETRY_BURN 3 Retry slots burned per hallucination
    MAX_REACT_RETRIES 6 World retry funds
    MAX_REACT_STEPS 10 Step cap per job
    Token value proxy $3/1M tokens Mid-range estimate for GPT-4-class fashions
    Sensitivity charges 5%, 15%, 28% Hallucination charges for sweep

    This fixed is the direct mechanical driver of the 90.8% waste determine. At a worth of 1, fewer slots are burned per occasion — the workflow’s wasted depend stays at 0 regardless. Run the sensitivity verify your self: modify this fixed and observe that the workflow all the time wastes zero retries.

    The simulation makes use of three instruments — search, calculate, summarise — with sensible failure charges per instrument. Software value is tracked at 200 tokens per LLM step.

    Each quantity on this article is reproduced precisely by python app.py --seed 42.

    What the Benchmark Discovered

    Success Fee Hides the Actual Downside

    ReAct succeeded on 179/200 duties (89.5%). The workflow succeeded on 200/200 (100.0%).

    Bar charts comparing ReAct vs deterministic workflow showing success rate and hallucination events across 200 tasks, highlighting higher reliability and zero hallucinations in workflow.
    ReAct vs deterministic workflow comparability exhibits related success charges however a crucial distinction in hallucination occasions, the place ReAct logs 155 hallucinations whereas the workflow eliminates them totally, exposing a hidden reliability hole in agent design. Picture by Creator

    The ten.5% hole is actual. However success price is a go/fail metric — it says nothing about how near the sting a passing run got here, or what it burned to get there. The extra informative quantity is what occurred inside these 179 “profitable” ReAct runs. Particularly: the place did the retry funds go?

    The Retry Price range

    Stacked bar chart showing retry budget usage in ReAct vs workflow, highlighting 90.8% wasted retries in ReAct compared to zero wasted retries in deterministic workflow.
    ReAct brokers waste nearly all of retry funds on non-retryable errors, whereas the workflow ensures each retry targets recoverable failures, revealing a serious inefficiency in commonplace agent retry logic. Picture by Creator
    Metric ReAct Workflow
    Complete retries 513 80
    Helpful (retryable errors) 47 80
    Wasted (non-retryable errors) 466 0
    Waste price 90.8% 0.0%
    Avg retries / job 2.56 0.40

    466 of 513 retries — 90.8% — focused errors that can’t succeed by definition. The workflow fired 80 retries. Each single one was helpful. The hole is 6.4× in complete retries and 466-to-0 in wasted ones. That isn’t a efficiency distinction. It’s a structural one.

    A word on the mechanics: HALLUCINATION_RETRY_BURN = 3 means every hallucinated instrument identify burns precisely 3 retry slots within the ReAct simulation. The 90.8% determine is delicate to this fixed — at a worth of 1, fewer retries are wasted per hallucination occasion. However the structural property holds at each worth: the workflow wastes zero retries regardless, as a result of non-retryable errors are categorised and skipped earlier than any slot is consumed. Run the sensitivity verify your self: modify HALLUCINATION_RETRY_BURN and observe that the workflow’s wasted depend stays at 0.

    Why 19 of 21 ReAct Failures Had Equivalent Root Causes

    Failure motive Runs % of failures
    hallucinated_tool_exhausted_retries 19 90.5%
    tool_error_exhausted_retries:rate_limited 1 4.8%
    tool_error_exhausted_retries:dependency_down 1 4.8%

    19 of 21 failures: hallucinated instrument identify, world retry funds exhausted, job useless. Not community failures. Not price limits. Hallucinated strings retried till nothing was left. The workflow had zero failures throughout 200 duties.

    Your success price dashboard won’t ever floor this. The failure motive is buried contained in the retry loop with no taxonomy to extract it. That’s the dashboard blindness the title guarantees — and it’s worse than it sounds, as a result of it means you don’t have any sign when issues are degrading, solely once they’ve already failed.

    The Error Taxonomy: From “Unknown” to Totally Categorised

    The basis repair is classifying errors on the level they’re raised. Three classes are retryable; three will not be:

    # Retryable — can succeed on a subsequent try
    RETRYABLE = {TRANSIENT, RATE_LIMITED, DEPENDENCY_DOWN}
    
    # Non-retryable — retrying wastes funds by definition
    NON_RETRYABLE = {INVALID_INPUT, TOOL_NOT_FOUND, BUDGET_EXCEEDED}

    When each error carries a category, the retry resolution turns into one line:

    if not exc.is_retryable():
        log(RETRY_SKIPPED)   # zero funds consumed
        break

    The total taxonomy from the 200-task run:

    Horizontal bar chart showing error taxonomy distribution for ReAct and workflow agents, highlighting dominance of hallucination errors in ReAct and circuit breaker handling in workflow.
    Error taxonomy exposes the basis failure mode in ReAct brokers, dominated by hallucination errors, whereas the workflow replaces them with managed circuit breaker occasions for higher fault dealing with. Picture by Creator
    Error sort ReAct Workflow
    hallucination 155 0
    rate_limited 24 22
    dependency_down 16 23
    loop_detected 8 0
    transient 7 26
    circuit_open 0 49
    invalid_input 1 0

    ReAct’s dominant occasion is hallucination — 155 occasions, all non-retryable, all burning funds. The workflow’s dominant occasion is circuit_open — 49 fast-fails that by no means touched an upstream service. The workflow logged zero hallucination occasions as a result of it by no means asks the mannequin to supply a instrument identify string.

    You can not hallucinate a key in a dict you by no means ask the mannequin to supply.

    That is an architectural assure throughout the simulation design. In an actual system the place the LLM contributes to plan era, hallucinations may nonetheless happen upstream of instrument routing. The assure holds exactly the place routing is totally deterministic and the mannequin’s output is proscribed to plan construction — not instrument identify strings.

    The eight loop_detected occasions in ReAct come from a 18% loop price utilized when len(historical past) > 2 — the mannequin “decides to suppose extra” quite than act, consuming a step with out calling a instrument. The workflow has no equal as a result of it doesn’t give the mannequin step-selection authority.

    Step predictability: the hidden instability σ reveals

    Histogram comparing step distribution of ReAct and workflow agents, showing higher variance and unpredictable execution steps in ReAct versus tightly clustered steps in workflow.
    Step distribution reveals hidden instability in ReAct brokers, the place excessive variance results in unpredictable execution, whereas the workflow maintains constant and managed step counts. Picture by Creator
    Metric ReAct Workflow
    Avg steps / job 2.88 2.69
    Std dev (σ) 1.36 0.46

    The means are almost an identical. The distributions will not be. Commonplace deviation is 3× increased for ReAct.

    Workflow σ holds at 0.46 throughout all hallucination charges examined — not by coincidence, however as a result of plan construction is fastened. Activity kind (math, abstract, search) determines step depend at plan time. The hallucination roll doesn’t have an effect on step depend when instrument routing by no means passes by way of the mannequin’s output.

    In manufacturing, excessive σ means: unpredictable latency (SLAs can’t be dedicated to), unpredictable token value (funds forecasts are inaccurate), and invisible burst load (a nasty cluster of long-running duties arrives with no warning). Predictability is a manufacturing property. Success price doesn’t measure it. σ does.


    The Three Structural Fixes

    Repair 1: Classify Errors Earlier than Deciding Whether or not to Retry

    The basis repair is classifying errors on the level they’re raised. Three classes are retryable; three will not be:

    def call_tool_with_retry(tool_name, args, logger, ledger,
                             step, max_retries=2, fallback=None):
        for try in vary(max_retries + 1):
            attempt:
                return call_tool_with_circuit_breaker(tool_name, args, ...)
            besides AgentError as exc:
                if not exc.is_retryable():
                    # Non-retryable: RETRY_SKIPPED — zero funds consumed
                    logger.log(RETRY_SKIPPED, error_kind=exc.sort.worth)
                    break                          # ← this line drops waste to 0
                if try < max_retries:
                    ledger.add_retry(wasted=False)
                    backoff = min(0.1 * (2 ** try) + jitter, 2.0)
                    logger.log(RETRY, try=try, backoff=backoff)
        if fallback:
            return ToolResult(tool_name, fallback, 0.0, is_fallback=True)
        elevate last_error

    RETRY_SKIPPED is the audit occasion that proves taxonomy is working. Search your manufacturing logs for it to see precisely which non-retryable errors have been caught at which step, during which job, with zero funds consumed. ReAct can not emit this occasion — it has no taxonomy to skip from.

    This repair is relevant to a ReAct agent at this time with out altering its instrument routing structure. Should you run LangChain or AutoGen, you’ll be able to add error classification to your instrument layer and scope your retry decorator to TransientToolError with out touching anything. It is not going to eradicate hallucination-driven waste totally — that requires Repair 3 — however it prevents INVALID_INPUT and different everlasting errors from burning retries on makes an attempt that additionally can not succeed.

    Repair 2: Per-Software Circuit Breakers As an alternative of a World Counter

    A world retry counter treats all instruments as a single failure area. When one instrument degrades, it drains the funds for each different instrument. Per-tool circuit breakers include failure domestically:

    # Every instrument will get its personal circuit breaker occasion
    # CLOSED    → calls go by way of usually
    # OPEN      → calls fail instantly, no upstream hit, no funds consumed
    # HALF-OPEN → one probe name; if it succeeds, circuit closes
    
    class CircuitBreaker:
        failure_threshold: int   = 3    # journeys after 3 consecutive failures
        recovery_timeout:  float = 5.0  # simulated seconds earlier than probe allowed
        success_threshold: int   = 2    # probe successes wanted to shut

    The benchmark logged 49 CIRCUIT_OPEN occasions for the workflow — each one a name that fast-failed with out touching a degraded upstream service and with out consuming retry funds. ReAct logged zero, as a result of it has no per-tool state. It hammers a degraded instrument till the worldwide funds is gone.

    Like Repair 1, that is independently relevant to a ReAct agent. Per-tool circuit breakers wrap the instrument name layer no matter how the instrument was chosen. Threshold values will want tuning to your workload.

    Repair 3: Deterministic Software Routing (The Structural Differentiator)

    That is the repair that eliminates the hallucination drawback on the routing layer. Fixes 1 and a couple of scale back the injury from hallucinations; Repair 3 makes them structurally unattainable the place it’s utilized.

    # ReAct — instrument identify comes from LLM output, could be any string
    tool_name = llm_response.tool_name       # "web_browser", "sql_query", ...
    tool_fn   = TOOLS.get(tool_name)         # None if hallucinated → funds burns
    
    # Workflow — instrument identify resolved from plan at job begin, all the time legitimate
    STEP_TO_TOOL = {
        StepKind.SEARCH:    "search",
        StepKind.CALCULATE: "calculate",
        StepKind.SUMMARISE: "summarise",
    }
    tool_name = STEP_TO_TOOL[step.kind]      # KeyError is unattainable; hallucination is unattainable

    Use the LLM for reasoning — what steps are wanted, in what order, with what arguments. Use Python for instrument routing. The mannequin contributes plan construction (step varieties), not instrument identify strings.

    The trade-off is price naming actually: deterministic routing requires that your job construction maps onto a finite set of step varieties. For open-ended brokers that must dynamically compose novel instrument sequences throughout a big registry, this constrains flexibility. For methods with predictable job buildings — nearly all of manufacturing deployments — the reliability and predictability beneficial properties are substantial.

    Earlier than/after abstract:

    Dimension Earlier than (naive ReAct) After (all three fixes) Commerce-off
    Wasted retries 90.8% 0.0% None
    Hallucination occasions 155 0 Loses dynamic instrument discovery
    Step σ 1.36 0.46 Loses open-ended composition
    Circuit isolation None (world) Per-tool Provides threshold-tuning work
    Auditability None Full taxonomy Provides logging overhead

    The Sensitivity Evaluation: The 5% Outcome Is the Alarming One

    Three-panel chart showing sensitivity analysis across different hallucination rates for success rate, wasted retry rate, and step standard deviation.
    Sensitivity evaluation throughout hallucination charges (5%, 15%, 28%). The workflow maintains 0% wasted retries and secure σ = 0.46 at each price, whereas ReAct’s wasted retries rise sharply with hallucinations. Picture by Creator.
    Hallucination price ReAct wasted % Workflow wasted % ReAct σ Workflow σ ReAct success
    5% 54.7% 0.0% 1.28 0.46 100.0%
    15% 81.4% 0.0% 1.42 0.46 98.0%
    28% 90.8% 0.0% 1.36 0.46 89.5%

    The 5% row deserves specific consideration. ReAct exhibits 100% success — your monitoring experiences a wholesome agent. However 54.7% of retries are nonetheless wasted. The funds is quietly draining.

    That is the dashboard blindness made exact. When an actual failure cluster arrives — a price restrict spike, a degraded service, a short outage — lower than half your designed retry capability is out there to deal with it. You’ll not see this coming. Your success price was 100% till the second it wasn’t.

    The workflow wastes 0% of retries at each price examined. The σ holds at 0.46 no matter hallucination frequency. These will not be rate-dependent enhancements — they’re properties of the structure.


    Latency: What the CDF Reveals That Averages Disguise

    Latency cumulative distribution function comparing ReAct and workflow agents, showing similar P95 latency despite higher average latency in workflow.
    Latency distribution exhibits that regardless of increased common latency, the workflow matches ReAct at P95, proving that reliability enhancements don’t come at the price of tail efficiency. Picture by Creator
    Metric ReAct Workflow
    Avg latency (ms) 43.4 74.8
    P95 latency (ms) 143.3 146.2
    Complete tokens 115,000 107,400
    Estimated value ($) $0.3450 $0.3222

    The workflow seems slower on common as a result of failed ReAct runs exit early — they appear quick as a result of they failed quick, not as a result of they accomplished effectively. At P95 — the metric that issues for SLA commitments — the latency is successfully an identical: 143.3ms versus 146.2ms.

    You aren’t buying and selling tail latency for reliability. On the tail, the simulation exhibits you’ll be able to have each. Token value favors the workflow by 6.6%, as a result of it doesn’t burn LLM steps on hallucination-retry loops that produce no helpful output.


    Three Diagnostic Questions for Your System Proper Now

    Earlier than studying the implementation steerage, reply these three questions on your present agent:

    1. When a instrument identify from the mannequin doesn’t match any registered instrument, does your system retry? If sure, funds is draining on non-retryable errors proper now.

    2. Is your retry counter world or per-tool? A world counter lets one degraded instrument exhaust the funds for all others.

    3. Are you able to search your logs for RETRY_SKIPPED or an equal occasion? If not, your system has no error taxonomy and no audit path for wasted funds.

    Should you answered “sure / world / no” to those three — Repair 1 and Repair 2 are the quickest path to restoration, relevant with out altering your agent structure.


    Implementing This in Your Stack Immediately

    These three fixes could be utilized incrementally to any framework — LangChain, LangGraph, AutoGen, or a customized instrument loop.

    Step 1 — Add error classification (half-hour). Outline two exception courses in your instrument layer: one for retryable errors (TransientToolError), one for everlasting ones (ToolNotFoundError, InvalidInputError). Increase the suitable class on the level the error is detected.

    Step 2 — Scope retries to error class (quarter-hour). Should you use tenacity, swap retry_if_exception for retry_if_exception_type(TransientToolError). Should you use a customized loop, add if not exc.is_retryable(): break earlier than the retry increment.

    Step 3 — Transfer instrument routing right into a dict (1 hour). When you have a hard and fast job construction, outline it as a StepKind enum and resolve instrument names from dict[StepKind, str] at plan time. Optionally available in case your use case requires open-ended instrument composition, however it eliminates hallucination-driven funds waste totally the place it may be utilized.

    Here’s what the vulnerability appears like in LangChain, and how one can repair it:

    Susceptible sample:

    from langchain.brokers import AgentExecutor, create_react_agent
    
    # If the mannequin outputs "web_search" as a substitute of "search",
    # AgentExecutor will retry the step earlier than failing —
    # consuming funds on an error that can't succeed.
    executor = AgentExecutor(
        agent=create_react_agent(llm, instruments, immediate),
        instruments=instruments,
        max_iterations=10
    )
    executor.invoke({"enter": job})

    Mounted sample — error taxonomy + deterministic routing:

    from tenacity import retry, stop_after_attempt, retry_if_exception_type
    
    class ToolNotFoundError(Exception): go   # non-retryable
    class TransientToolError(Exception): go  # retryable
    
    # Software routing in Python — mannequin outputs step kind, not instrument identify
    TOOL_REGISTRY = {"search": search_fn, "calculate": calc_fn}
    
    def call_tool(identify: str, args: str):
        fn = TOOL_REGISTRY.get(identify)
        if fn is None:
            elevate ToolNotFoundError(f"'{identify}' not registered")  # by no means retried
        attempt:
            return fn(args)
        besides RateLimitError as e:
            elevate TransientToolError(str(e))   # retried with backoff
    
    @retry(
        cease=stop_after_attempt(3),
        retry=retry_if_exception_type(TransientToolError)
    )
    def run_step(tool_name: str, args: str):
        return call_tool(tool_name, args)

    Manufacturing word: The eval() name within the benchmark’s tool_calculate is current for simulation functions solely. By no means use eval() in a manufacturing instrument — it’s a code injection vulnerability. Change it with a protected expression parser reminiscent of simpleeval or a purpose-built math library.


    Benchmark Limitations

    Hallucination price is a parameter, not a measurement. The 28% determine is a conservative estimate derived from failure mode evaluation in Yao et al. (2023) and Shinn et al. (2023) — not a instantly reported determine from both paper. A well-prompted mannequin with a clear instrument schema and a small, well-named instrument registry might hallucinate instrument names far much less ceaselessly. Run the benchmark at your precise noticed price.

    HALLUCINATION_RETRY_BURN is a simulation fixed that drives the waste proportion. At a worth of 1, fewer retries are wasted per hallucination occasion; the 90.8% determine could be decrease. The structural conclusion — the workflow wastes 0% in any respect values — holds regardless. Run python app.py --seed 42 with modified values of 1 and a couple of to confirm.

    The workflow’s zero hallucination depend is a simulation design property. Software routing by no means passes by way of LLM output on this benchmark. In an actual system the place the LLM contributes to plan era, hallucinations may happen upstream of routing.

    Three instruments is a simplified surroundings. Manufacturing brokers usually handle dozens of instruments with heterogeneous failure modes. The taxonomy and circuit breaker patterns scale nicely; threshold values will want tuning to your workload.

    Latency figures are simulated. The P95 near-equivalence is the production-relevant discovering. Absolute millisecond values mustn’t inform capability planning. Common latency comparisons are confounded by early-exit failures in ReAct and per-step LLM accounting within the workflow — use P95 for any latency reasoning.


    Full Metrics

    Full per-metric outcomes for all 200 duties (seed=42, hallucination_rate=28%) can be found in `experiment_results.json` within the GitHub repository. Run `python app.py -seed 42 -export-json` to regenerate them domestically.


    References

    • Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, Okay., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Appearing in Language Fashions. ICLR 2023. https://arxiv.org/abs/2210.03629
    • Shinn, N., Cassano, F., Gopinath, A., Narasimhan, Okay., & Yao, S. (2023). Reflexion: Language Brokers with Verbal Reinforcement Studying. NeurIPS 2023. https://arxiv.org/abs/2303.11366
    • Fowler, M. (2014). CircuitBreaker. martinfowler.com. https://martinfowler.com/bliki/CircuitBreaker.html
    • Nygard, M. T. (2018). Launch It! Design and Deploy Manufacturing-Prepared Software program (2nd ed.). Pragmatic Bookshelf.
    • Sculley, D., et al. (2015). Hidden technical debt in machine studying methods. NeurIPS 2015. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

    Disclosure

    Simulation methodology. All outcomes are produced by a deterministic simulation (python app.py --seed 42), not reside API calls. The 28% hallucination price is a calibrated parameter derived from failure mode evaluation in printed benchmarks — not a instantly measured determine from reside mannequin outputs.

    No conflicts of curiosity. The creator has no monetary relationship with any instrument, framework, mannequin supplier, or firm talked about on this article. No merchandise are endorsed or sponsored.

    Authentic work. This text, its benchmark design, and its code are the creator’s unique work. References are used solely to attribute printed findings that knowledgeable calibration and design.


    GitHub: https://github.com/Emmimal/react-retry-waste-analysis

    python app.py --seed 42 — full outcomes and all six figures. python app.py --replay 7 — verbose single-task execution, step-by-step.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    Portable water filter provides safe drinking water from any source

    April 18, 2026

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Aristocrat announces the acquisition of Gaming Analytics, Inc

    February 14, 2026

    Musk’s AI firm deletes posts after Grok chatbot praises Hitler

    July 9, 2025

    Tesla’s New Range of Affordable Electric Cars: Here’s How Much They Cost

    November 6, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.