Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How small businesses can leverage AI
    • Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    • New radio bursts detected from binary stars
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    Artificial Intelligence

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    Editor Times FeaturedBy Editor Times FeaturedJuly 3, 2025No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    publishing my previous post on benchmarking tabular reinforcement studying (RL) strategies, I couldn’t shake the sensation that one thing wasn’t fairly proper. The outcomes regarded off, and I wasn’t totally happy with how they turned out.

    Nonetheless, I continued with the submit sequence, shifting focus to multi-player video games and approximate resolution strategies. To assist this, I’ve been steadily refactoring the unique framework we constructed. The brand new model is cleaner, extra common, and simpler to make use of. Within the course of, it additionally helped uncover a number of bugs and edge-case points in a number of the earlier algorithms (extra on that later).

    On this submit, I’ll introduce the up to date framework, spotlight the errors I made, share corrected outcomes, and mirror on key classes realized, setting the stage for extra advanced experiments to come back.

    The up to date code might be discovered on GitHub.

    Framework

    The most important change from the earlier model of the code is that RL resolution strategies are actually applied as lessons. These lessons expose frequent strategies like act() (for choosing actions) and replace() (for adjusting mannequin parameters).

    Complementing this, a unified coaching script manages the interplay with the atmosphere: it generates episodes and feeds them into the suitable technique for studying—utilizing the shared interface offered by these class strategies.

    This refactoring considerably simplifies and standardizes the coaching course of. Beforehand, every technique had its personal standalone coaching logic. Now, coaching is centralized, and every technique’s function is clearly outlined and modular.

    Earlier than diving into the tactic lessons intimately, let’s first have a look at the coaching loop for single-player environments:

    def train_single_player(
        env: ParametrizedEnv,
        technique: RLMethod,
        max_steps: int = 100,
        callback: Callable | None = None,
    ) -> tuple[bool, int]:
        """Trains a way on single-player environments.
    
        Args:
            env: env to make use of
            technique: technique to make use of
            max_steps: maximal variety of replace steps
            callback: callback to find out if technique already solves the given downside
    
        Returns:
            tuple of success, discovered coverage, variety of replace steps
        """
        for step in vary(max_steps):
            statement, _ = env.env.reset()
            terminated = truncated = False
    
            episode = []
            cur_episode_len = 0
    
            whereas not terminated and never truncated:
                motion = technique.act(statement, step)
    
                observation_new, reward, terminated, truncated, _ = env.step(
                    motion, statement
                )
    
                episode.append(ReplayItem(statement, motion, reward))
                technique.replace(episode, step)
    
                statement = observation_new
    
                # NOTE: that is extremely depending on atmosphere dimension
                cur_episode_len += 1
                if cur_episode_len > env.get_max_num_steps():
                    break
    
            episode.append(ReplayItem(observation_new, -1, reward, []))
            technique.finalize(episode, step)
    
            if callback and callback(technique, step):
                return True, step
    
        env.env.shut()
    
        return False, step

    Let’s visualize what a accomplished episode seems like—and when the replace() and finalize() strategies are known as throughout the course of:

    Picture by creator

    After every replay merchandise is processed—consisting of a state, the motion taken, and the reward obtained—the tactic’s replace() perform is named to regulate the mannequin’s inner parameters. The particular conduct of this perform is dependent upon the algorithm getting used.

    To present you a concrete instance, let’s take a fast have a look at how this works for Q-learning.

    Recall the Q-learning replace rule:

    Picture from [1]

    When the second name to replace() happens, we’ve St​ = s1​, At = a1 and Rt+1 = r2.

    Utilizing this info, the Q-learning agent updates its worth estimates accordingly.

    Unsupported Strategies

    Dynamic programming (DP) strategies don’t match above launched construction – since they’re based mostly upon iterating over all states of the atmosphere. For that purpose, we go away their code untouched and deal with coaching them in a different way.

    Additional, we utterly take away the assist for Prioritized Sweeping. Additionally, right here we have to iterate over states indirectly to search out predecessor states, which, once more – doesn’t match into our replace coaching construction – and, extra importantly, just isn’t possible for extra advanced multi-player video games, the place the variety of states is far bigger and more durable to iterate.

    Since this technique anyhow didn’t produce good outcomes, we concentrate on the remaining ones. Word: an identical reasoning was executed for DP strategies: these can’t be prolonged so simply to multi-player video games, and thus shall be of lesser curiosity sooner or later.

    Bugs

    Bugs occur — all over the place, and this undertaking isn’t any exception. On this part, I’ll spotlight a very impactful bug that made its manner into the outcomes of the earlier submit, together with some minor modifications and enhancements. I’ll additionally clarify how these affected earlier outcomes.

    Incorrect Motion Chance Calculation

    Some strategies require the chance of a selected motion throughout the replace step. Within the earlier code model, we had:

    def _get_action_prob(Q: np.ndarray) -> float:
            return (
                Q[observation_new, a] / sum(Q[observation_new, :])
                if sum(Q[observation_new, :])
                else 1
            )

    This labored just for strictly constructive Q-values, however broke down when Q-values had been adverse — making the normalization invalid.

    The corrected model handles each constructive and adverse Q-values correctly utilizing a softmax strategy:

    def _get_action_prob(self, statement: int, motion: int) -> float:
            probs = [self.Q[observation, a] for a in vary(self.env.get_action_space_len())]
            probs = np.exp(probs - np.max(probs))
            return probs[action] / sum(probs)

    This bug considerably impacted Anticipated SARSA and n-step Tree Backup, as their updates relied closely on motion chances.

    Tie-Breaking in Grasping Motion Choice

    Beforehand, when producing episodes, we both chosen the grasping motion or sampled randomly with ε-greedy logic:

    def get_eps_greedy_action(q_values: np.ndarray, eps: float = 0.05) -> int:
        if random.uniform(0, 1) < eps or np.all(q_values == q_values[0]):
            return int(np.random.alternative([a for a in range(len(q_values))]))
        else:
            return int(np.argmax(q_values))

    Nonetheless, this didn’t correctly deal with ties, i.e., when a number of actions shared the identical most Q-value. The up to date act() technique now contains honest tie-breaking:

    def act(
            self, state: int, step: int | None = None, masks: np.ndarray | None = None
        ) -> int:
            allowed_actions = self.get_allowed_actions(masks)
            if self._train and step and random.uniform(0, 1) < self.env.eps(step):
                return random.alternative(allowed_actions)
            else:
                q_values = [self.Q[state, a] for a in allowed_actions]
                max_q = max(q_values)
                max_actions = [a for a, q in zip(allowed_actions, q_values) if q == max_q]
                return random.alternative(max_actions)

    A small change, however presumably fairly related – since this e.g. stimulates a extra explorative motion choice at first of every coaching, the place all Q-values are equal.

    This small change could have a noticeable affect—particularly early in coaching, when all Q-values are initialized equally. It encourages a extra numerous exploration technique throughout the vital early section.

    As beforehand mentioned—and as we’ll see once more under—RL strategies exhibit excessive variance, making the affect of such modifications troublesome to measure exactly. Nonetheless, this adjustment appeared to barely enhance the efficiency of a number of strategies: Sarsa, Q-learning, Double Q-learning, and Sarsa-n.

    Up to date Outcomes

    Let’s now look at the up to date outcomes — for completeness, we embody all strategies, not simply the improved ones.

    However first, a fast reminder of the duty we’re fixing: we’re working with Gymnasium’s GridWorld atmosphere [2] — basically a maze-solving job:

    Picture by creator

    The agent should navigate from the top-left to the bottom-right of the grid whereas avoiding icy lakes.

    To guage every technique’s efficiency, we scale the gridworld dimension and measure the variety of replace steps till convergence.

    Monte Carlo Strategies

    These strategies weren’t affected by the current implementation modifications, so we observe outcomes in keeping with our earlier findings:

    • Each are able to fixing environments as much as 25×25 in dimension.
    • On-policy MC performs barely higher than off-policy.
    Picture by creator

    Temporal Distinction Strategies

    For these, we measure the next outcomes:

    Picture by creator

    For these, we instantly discover that Anticipated Sarsa now fares a lot better, on account of fixing above talked about bug about computing the motion chances.

    But in addition the opposite strategies carry out higher: as talked about above, this might simply be likelihood / variance – or be a consequence of the opposite minor enhancements we did, particularly the higher dealing with of ties throughout motion choice.

    TD-n

    For TD-n strategies, our outcomes look a lot completely different:

    Picture by creator

    Sarsa-n additionally has improved, most likely for comparable causes as mentioned within the final part – however particularly n-step tree backup now performs very well – proving that with appropriate motion choice this certainly is a really highly effective resolution technique.

    Planning

    For planning, we solely have Dyna-Q left – which additionally appears to have improved barely:

    Picture by creator

    Evaluating the Greatest Answer Strategies on Bigger Environments

    With that, let’s visualize the best-performing strategies from all classes in a single diagram. As a result of elimination of some strategies like DP, I now chosen on-policy MC, Sarsa, Q-learning, Sarsa-n, n-step tree backup and Dyna-Q.

    We start by exhibiting outcomes for grid worlds as much as dimension 50 x 50:

    Picture by creator

    We observe on-policy MC to carry out surprisingly properly — in keeping with earlier findings. Its energy seemingly stems from its simplicity and unbiased estimates, which work properly for short- to medium-length episodes.

    Nonetheless, in contrast to the earlier submit, n-step Tree Backup clearly emerges because the top-performing technique. This aligns with principle: its use of anticipated multi-step backups allows clean and secure worth propagation, combining the strengths of off-policy updates with the soundness of on-policy studying.

    Subsequent, we observe a center cluster: Sarsa, Q-learning, and Dyna-Q — with Sarsa barely outperforming the others.
    It’s considerably shocking that the model-based updates in Dyna-Q don’t result in higher efficiency. This would possibly level to limitations within the mannequin accuracy or the variety of planning steps used. Q-learning tends to underperform as a result of elevated variance launched by its off-policy nature.

    The worst-performing technique on this experiment is Sarsa-n, in keeping with earlier observations. We suspect the degradation in efficiency comes from the elevated variance and bias on account of n-step sampling with out expectation over actions.

    It’s nonetheless considerably sudden that MC strategies outperform TD on this setting — historically, TD strategies are anticipated to do higher in massive environments. Nonetheless, that is mitigated in our setup by the reward shaping technique: we offer a small constructive reward at every step because the agent strikes nearer to the aim. This alleviates one in every of MC’s main weaknesses — poor efficiency in sparse reward settings.

    Conclusion and Learnings

    On this submit, we shared updates to the RL framework developed over this sequence. Alongside numerous enhancements, we fastened some bugs — which considerably enhanced algorithm efficiency.

    We then utilized the up to date strategies to more and more bigger GridWorld environments, with the next findings:

    • n-step Tree Backup emerged as the perfect technique general, because of its anticipated multi-step updates that mix the advantages of each on- and off-policy studying.
    • Monte Carlo strategies adopted, exhibiting surprisingly sturdy efficiency on account of their unbiased estimates and the intermediate rewards guiding studying.
    • A cluster of TD strategies — Q-learning, Sarsa, and Dyna-Q — adopted. Regardless of Dyna-Q’s model-based updates, it didn’t considerably outperform its model-free counterparts.
    • Sarsa-n carried out worst, seemingly as a result of compounded bias and variance launched by sampling n-step returns.

    Thanks for studying this replace! Keep tuned for additional content material — subsequent up, we cowl multi-player video games and environments.

    Different Posts on this Collection

    References

    [1] http://incompleteideas.net/book/RLbook2020.pdf

    [2] https://gymnasium.farama.org/



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    How small businesses can leverage AI

    June 2, 2026

    Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt

    June 2, 2026

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026

    London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Panasonic Z95B OLED TV Review: Glorious Performance, One Small Catch

    November 23, 2025

    Cheque in: 3 startups announced $161 million in raises this week

    March 21, 2026

    Google appeals landmark antitrust verdict over search monopoly

    January 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.