Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • How small businesses can leverage AI
    • Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    • New radio bursts detected from binary stars
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Introduction to Reinforcement Learning Agents with the Unity Game Engine 
    Artificial Intelligence

    Introduction to Reinforcement Learning Agents with the Unity Game Engine 

    Editor Times FeaturedBy Editor Times FeaturedApril 11, 2026No Comments12 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    , Reinforcement Studying — studying from observations and rewards — is the strategy most alike to the way in which people (and animals) study. 

    Regardless of this similarity, it additionally stays probably the most sophisticated and vexing area on trendy machine studying. To cite the well-known Andej Karpathy:

    Reinforcement Studying is horrible. It simply so occurs that every part we had earlier than was a lot worse.

    To assist with understanding of the strategy, I’ll construct a step-by-step instance of an agent studying to navigate in an surroundings utilizing Q-Studying. The textual content will begin from the primary ideas and finish with a completely functioning instance you possibly can run within the Unity sport engine. 

    For this text, fundamental information of the C# programming language is required. If you’re not acquainted with the Unity sport engine, simply think about that every object is an agent, which:

    • executes Begin() as soon as originally of this system,
    • and Replace() constantly in parallel to the opposite brokers.

    The accompanying repository for this text is on GitHub. All the photographs are by the writer until specified in any other case.

    What’s Reinforcement Studying

    In Reinforcement Studying (RL), we’ve an agent that is ready to take actions, observe the outcomes of those actions, and study from rewards/punishments for these actions. 

    A typical illustration of an RL loop. (i) agent takes an motion, (ii) the motion impacts the surroundings, (iii) the agent reads the state of the surroundings (iv) agent updates its coverage based mostly on the rewards/punishment for the motion. Supply: https://en.wikipedia.org/wiki/Reinforcement_learning .

    The way in which an agent decides an motion in a sure state will depend on its coverage. A coverage π is a operate that defines the conduct of an agent, mapping states to actions. Given a set of states S and a set of actions A a coverage is a direct mapping: π: S → A . 

    Moreover, if we wish the agent to have extra doable choices, with a selection, we are able to create a stochastic coverage. Then, quite than a single motion, a coverage determines the chance of taking every motion in a given state: π: S × A → [0, 1].

    Navigating Robotic Instance

    As an instance the educational course of, we’ll create an instance of a navigating robotic in a 2D surroundings, utilizing one in all 4 actions, A = {Left, Proper, Up, Down} . The robotic wants to seek out the way in which to the award from any time on the map, with out falling into water.

    The map to navigate on a 8 × 5 gird. There’s a slender choke-point at column 4 (0-indexed), forcing the agent to navigate round it to succeed in the trophy. The optimum coverage leads the agent from any grass tile to the trophy within the fewest doable steps.

    The rewards will likely be encoded along with the tile varieties utilizing an Enum: 

    public enum TileEnum { Water = -1, Grass = 0, Award = 1 }

    The state is given by its place on the grid, that means we’ve 40 doable states: S = [0…7] × [0…4] (an 8 × 5 tile grid), which we encode utilizing a 2D array:

    _map = {
        { -1, -1, -1, -1, -1, -1, -1, -1 },   // all water border
        { -1,  0,  0,  0, -1,  0,  1, -1 },   // 1 = Award (trophy)
        { -1,  0,  0,  0, -1,  0,  0, -1 },
        { -1,  0,  0,  0,  0,  0,  0, -1 },
        { -1, -1, -1, -1, -1, -1, -1, -1 },   // all water border
    }

    We retailer the map in a tile TileGrid that has the next utility features:

    // Receive a tile at a coordinate
    public T GetTileByCoords(int x, int y);
    
    // Given a tile and an motion, acquire the subsequent tile
    public T GetTargetTile(T supply, ActionEnum motion);
    
    // Create a tile grid from the map
    public void GenerateTiles();

    We’ll make the most of totally different tile varieties, therefore the generic T. Every tile has a TileType given by the TileEnum and subsequently additionally its reward which could be obtained as (int) TileType.

    The Bellman Equation

    The issue of discovering an optimum coverage could be solved iteratively utilizing the Bellman Equation. The Bellman Equation postulates that the long-term reward of an motion equals the quick reward for that motion plus the anticipated reward from all future actions.

    It may be computed iteratively for methods with, discrete states and discrete state transitions. Have:

    • s — present state,
    • A — set of all actions,
    • s' — state reached by taking motion a in state s ,
    • γ — discounting issue (the additional the reward, the much less its price),
    • R(s, a) — quick reward for taking motion a in state s

    The Bellman equation then states that the worth V(s) of a state s is:

    Fixing the Bellman Equation Iteratively

    Computing the Bellman Equation is a dynamic programming drawback. On every iteration n, we calculate the anticipated future reward reachable in n+1 steps for all tiles. For every tile we retailer this utilizing a Worth variable.

    We give a reward base on the goal tile, i.e. 1 if the award is reached, -1 within the robotic falls into water, and 0 in any other case. As soon as both award or water are reached, no actions are doable, subsequently the worth of the state stays on the preliminary worth 0 . 

    We create a supervisor that may generate the grid and calculate the iterations:

    personal void Begin()
    {
        tileGrid.GenerateTiles();
    }
    
    personal void Replace()
    {       
        CalculateValues();
        Step();
    }

    To maintain observe of the values, we’ll make the most of a VTile class that holds a Worth. To keep away from taking up to date values immediately, we first set the NextValue after which set all values directly within the Step() operate.

    personal float gamma = 0.9; // Discounting issue
    
    // The Bellman equation 
    personal double GetNewValue(VTile tile)
    {
        return Agent.Actions
            .Choose(a => tileGrid.GetTargetTile(tile, a))
            .Choose(t => t.Reward + gamma * t.Worth) // Reward in [1, 0, -1]
            .Max();
    }
    
    // Get subsequent values for all tiles
    personal void CalculateValues()
    {
        for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
        {
            for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
            {
                var tile = tileGrid.GetTileByCoords(x, y);
                if (tile.TileType == TileEnum.Grass)
                {
                    tile.NextValue = GetNewValue(tile);
                }
            }
        }
    }
    
    // Copy subsequent values to present values (iteration step)
    personal void Step()
    {
        for (var y = 0; y < TileGrid.BOARD_HEIGHT; y++)
        {
            for (var x = 0; x < TileGrid.BOARD_WIDTH; x++)
            {
                tileGrid.GetTileByCoords(x, y).Step();
            }
        }
    }

    On each step, the worth V(s) of every tile is up to date to the utmost over all actions of the quick reward plus the discounted worth of the ensuing tile. The long run reward propagates outward from the Award tile with a diminishing return managed by γ = 0.9 .

    The answer converges after 10 iterations, at which level the optimum coverage could be derived by choosing the motion that results in the tile with the best worth.

    Motion High quality (Q-Values)

    We have now discovered a strategy to affiliate states with values, which is sufficient for this pathfinding drawback. Nonetheless, this focuses on the surroundings, ignoring the agent. For an agent we normally wish to know what could be a very good motion within the surroundings. 

    In Q-learning, this a price of an motion is named its high quality (Q-Worth). Every (state, motion) pair is assigned a single Q-value. 

    The place the brand new hyperparameter α defines a studying fee — how rapidly new info overrides outdated. That is analogous to plain machine studying and the values are normally comparable, right here we use 0.005 . We then calculate the good thing about taking an motion utilizing temporal distinction D(s,a):

    Since we now not think about all actions within the present state, however the high quality of every motion individually, we don’t maximize throughout all doable actions within the present state, however quite all doable actions within the state we’ll attain after taking the motion whose high quality we’re calculating, mixed with the reward for taking that motion.

    An agent at (2,2) taking an motion to go proper will likely be a state s’ = (3, 2) with 4 doable actions a’ and their respective qualities Q(s’, a’).

    The temporal distinction time period combines the quick reward with the very best future reward, making it a direct derivation of the Bellman Equation (see Wiki for details). 

    To coach the agent, we once more instantiate a grid, however this time we additionally create an occasion of the agent, positioned at (2,2).

    personal Agent _agent;
    
    personal void ResetAgentPos()
    {
        _agent.State = tileGrid.GetTileByCoords(2, 2);
    }
    
    personal void Begin()
    {
        tileGrid.GenerateTiles();
        _agent = Instantiate(agentPrefab, rework);
        ResetAgentPos();
    }
    
    personal void Replace()
    {
        Step();
    }

    An Agent object has a present state QState. Every QStateretains the Q-Worth for every out there motion. On every step the agent updates the standard of every motion out there within the state:

    personal void Step() 
    {
      if (_agent.State.TileType != TileEnum.Grass)
      {
          ResetAgentPos();
      }
      else
      {
          QTile s = _agent.State;
      
          // Replace Q-values for ALL actions from present state
          foreach (var a in Agent.Actions)
          {
              double q = s.GetQValue(a);
              QTile sPrime = tileGrid.GetTargetTile(s, a);
              double r = sPrime.Reward;
              double qMax = Agent.Actions.Choose(sPrime.GetQValue).Max();
              double td = r + gamma * qMax - q;
              s.SetQValue(a, q + alpha * td);
          }
      
          // Take the most effective out there motion a
          ActionEnum chosen = PickAction(s); 
          _agent.State = tileGrid.GetTargetTile(s, chosen);
      }
    }

    An Agent has a set of doable actions in every state and can take the most effective motion in every state.

    If there are a number of greatest actions, one one in all them is taken at random as we’ve shuffled the actions beforehand. Because of this randomness, every coaching will proceed in a different way, however usually stabilize between 500–1000 steps.

    That is the idea of Q-Studying. In contrast to the state values, the motion high quality could be utilized in conditions the place:

    • the commentary is incomplete at a time (agent visual view)
    • the commentary adjustments (objects transfer within the surroundings)

    Exploration vs. Exploitation (ε-Grasping)

    Thus far the agent took the very best motion each time, nevertheless this will trigger the agent to rapidly get caught in a neighborhood optimum. A key problem in Q-Studying is the exploration–exploitation trade-off:

    • Exploit — decide the motion with the best identified Q-value (grasping).
    • Discover — decide a random motion to find doubtlessly higher paths.

    ε-Grasping Coverage

    Given a random worth r ∈ [0, 1] and parameter epsilon there are two choices:

    • if r > epsilon then choose the most effective motion (exploit),
    • in any other case choose a random motion (discover).

    Decaying Epsilon

    We usually wish to discover extra early on and exploit extra later. That is achieved by decaying epsilon over time:

    epsilon = max(epsilonMin, epsilon − epsilonDecay)

    After sufficient steps, the agent’s coverage converges to nearly all the time choosing the maximum-quality motion.

    personal epsilonMin = 0.05;
    personal epsilonDecay = 0.005;
    
    personal ActionEnum PickAction(QTile state) {
      ActionEnum motion = Random.Vary(0f, 1f) > epsilon
          ? Agent.Actions.Shuffle().OrderBy(state.GetQValue).Final()  // exploit
          : Agent.RndAction();                                       // discover
    
      epsilon = Mathf.Max(epsilonMin, epsilon - epsilonDecay);
      return motion;
    }

    The Broader RL Ecosystem

    Q-Studying is one algorithm inside a bigger household of Reinforcement Studying (RL) strategies. Algorithms could be categorised alongside a number of axes:

    • State area : Discrete (e.g., board video games) | Steady (e.g., FPS video games)
    • Motion area: Discrete (e.g., technique video games) | Steady (e.g., driving) 
    • Coverage sort: Off-policy (Q-Studying: a’ is all the time maximized) | On-policy (SARSA: a’ is chosen by the agent’s present coverage) 
    • Operator: Worth | High quality | Benefit A(s, a) = Q(s, a) − V(s)

    For a complete listing of RL algorithms, see the Reinforcement Learning Wikipedia page. Further strategies reminiscent of behavioural cloning will not be listed there however are additionally utilized in observe. Actual-world options usually use prolonged variants or mixtures of the above.

    Q-Studying is an off-policy, discrete-action technique. Extending it to steady state/motion areas results in strategies like Deep Q-Networks (DQN), which exchange the Q-table with a neural community.

    Within the grid world instance, the Q-table has |S| × |A| = 40 × 4 = 160 entries — completely manageable. However for a sport like chess, the state area exceeds 10⁴⁴ positions, making an express desk unimaginable to retailer or fill. In such circumstances neural networks could also be used to compress the data.

    A neural community compresses this huge mapping into a set set of learnable parameters (weights). As a substitute of storing one worth per (s, a) pair, the community takes the state as enter and outputs Q-values for all actions, generalising throughout comparable states it has by no means seen earlier than.

    Sensible Reinforcement Studying in Video games

    The above textual content ought to present a fundamental introduction to the idea of Reinforcement Studying Brokers. In fact, in observe you wouldn’t be more likely to implement an RL algorithm by hand.

    The Unity engine supplies a RL bundle ML-Agents, which can be utilized e.g. to coach car racing. Within the frontier labs, analysis into RL strategies is being carried out at scale utilizing sport environments, for instance the sport of conceal and search:

    The fashions coming from these labs have already crushed the most effective participant on the planet in complicated board video games like Go, fast-paced methods like Starcraft and Dota, and even video games requiring human-level communication, like Diplomacy.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    How small businesses can leverage AI

    June 2, 2026

    Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt

    June 2, 2026

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026

    London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    ‘Narco-Submarine’ Carrying 4 Tons of Cocaine Captured by Mexico’s Navy

    February 22, 2026

    Wireless charging mats could power future Mars rovers.

    November 18, 2025

    Romania’s national gambling regulator adds Polymarket to its operator blacklist

    November 3, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.