Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Ultra portable power for camping
    • Startup 360: Using AI to deal with ‘carenting’ in the Sandwich years
    • Sam’s Club Promo Codes: 60% Off for April 2026
    • Canada’s Cohere and Germany’s Aleph Alpha agree to a merger deal valuing the combined group at ~$20B to work on sovereign AI; both governments support the deal (Financial Times)
    • Meta Centralizes Access to Facebook, Instagram, AI Glasses and More Apps
    • Lukas tiny house offers spacious interior and sleeps four without wheels
    • Meet the Anzac veteran who was an early innovator in the trenches of Gallipoli
    • Rednote Draws a Line Between China and the World
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Friday, April 24
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Train a Humanoid Robot with AI and Python
    Artificial Intelligence

    Train a Humanoid Robot with AI and Python

    Editor Times FeaturedBy Editor Times FeaturedNovember 5, 2025No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Humanoid robots are machines resembling the human physique in form and motion, designed for working alongside folks and interacting with our instruments. They’re nonetheless an rising know-how, however forecasts predict billions of humanoids by 2050. Presently, essentially the most superior prototypes are NEO by 1XTech, Optimus by Tesla, Atlas by Boston Dynamics, and G1 by China’s Unitree Robotics.

    There are two methods for a robotic to carry out a activity: guide management (if you particularly program what it has to do) or Synthetic Intelligence (it learns the right way to do issues by attempting). Particularly, Reinforcement Learning permits a robotic to study one of the best actions via trial and error to attain a aim, so it will possibly adapt to altering environments by studying from rewards and penalties with no predefined plan.

    In observe, it’s loopy costly to have an actual robotic studying the right way to carry out a activity. Due to this fact, state-of-the-art approaches study in simulation the place information technology is quick and low-cost, and subsequently switch the data to the true robotic (“sim-to-real” / “sim-first” method). That permits the parallel coaching of a number of fashions in simulation environments.

    Essentially the most used 3D physics simulators in the marketplace are: PyBullet (freshmen) , Webots (intermediate), MuJoCo (superior), and Gazebo (professionals). You should use any of them as standalone software program or via Gym, a library made by OpenAI for growing Reinforcement Studying algorithms, constructed on high of various physics engines.

    On this tutorial, I’m going to indicate the right way to construct a 3D simulation for a humanoid robotic with Synthetic Intelligence. I’ll current some helpful Python code that may be simply utilized in different related circumstances (simply copy, paste, run) and stroll via each line of code with feedback so that you could replicate this instance (hyperlink to full code on the finish of the article).

    Setup

    An setting is a simulated area the place brokers can work together and study to carry out a activity. It has an outlined commentary area (the data brokers obtain) and motion areas (the set of attainable actions).

    I’ll use Fitness center (pip set up gymnasium) to load one of many default environments made with MuJoCo (Multi-Joint dynamics with Contact, pip set up mujoco).

    import gymnasium as gymnasium
    
    env = gymnasium.make("Humanoid-v4", render_mode="human")
    obs, data = env.reset()
    env.render()

    The agent is a 3D bipedal robotic that may transfer like a human. It has 12 hyperlinks (stable physique elements) and 17 joints (versatile physique elements). You’ll be able to see the full description here.

    Earlier than beginning a brand new simulation, it’s essential to reset the setting with obs, data = env.reset(). That command returns details about the agent’s preliminary state. The data normally consists of further details about the robotic.

    Whereas the obs is what the agent sees (i.e. with sensors), an AI mannequin would want to course of these observations to determine what motion to take.

    Often, all Gym environments have the identical construction. The very first thing to test is the motion area, the set of all of the attainable actions. For the Humanoid simulation, an motion represents the drive utilized to considered one of its 17 joints (inside a variety of -0.4 and +0.4 to point the course of the push).

    env.action_space
    env.action_space.pattern()

    A simulation ought to no less than cowl one episode, an entire run of the agent interacting with the setting, from begin to termination. Every episode is a loop of reset() -> step() -> render(). Let’s make an instance operating one single episode with the humanoid doing random actions, so not AI.

    import time
    
    env = gymnasium.make("Humanoid-v4", render_mode="human")
    obs, data = env.reset()
    
    reset = False #reset if the humanoid falls or the episode ends
    episode = 1
    total_reward, step = 0, 0
    
    for _ in vary(240):
        ## motion
        step += 1
        motion = env.action_space.pattern() #random motion
        obs, reward, terminated, truncated, data = env.step(motion)
        ## reward
        total_reward += reward
        ## render
        env.render() #render physics step (CPU pace = 0.1 seconds)
        time.sleep(1/240) #decelerate to real-time (240 steps × 1/240 second sleep = 1 second)
        if (step == 1) or (step % 100 == 0): #print first step and each 100 steps
            print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Whole:{total_reward:.1f}")
        ## reset
        if reset:
            if terminated or truncated: #print the final step
                print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Whole:{total_reward:.1f}")
                obs, data = env.reset()
                episode += 1
                total_reward, step = 0, 0
                print("------------------------------------------")
    
    env.shut()

    Because the episode continues and the robotic strikes, we obtain a reward. On this case, it’s constructive if the agent stays up or strikes ahead, and it’s a adverse penalty if it falls and touches the bottom. The reward is crucial idea for AI as a result of it defines the aim. It’s the suggestions sign we get from the setting after each motion, indicating whether or not that transfer was helpful or not. Due to this fact, it may be used to optimize the decision-making of the robotic via Reinforcement Studying.

    Reinforcement Studying

    At each step of the simulation, the agent observes the present state of affairs (i.e. its place within the setting), decides to take motion (i.e. strikes considered one of its joints), and receives a constructive or adverse response (reward, penalty). This cycle repeats till the simulation ends. RL is a sort of Machine Studying that brings the agent to maximize the reward via trial and error. So if profitable, the robotic will know what’s the greatest plan of action.

    Mathematically, RL is predicated on the Markov Decision Process, during which the longer term solely will depend on the current state of affairs, and never the previous. To place it in easy phrases, the agent doesn’t want reminiscence of earlier steps to determine what to do subsequent. For instance, a robotic solely must know its present place and velocity to decide on its subsequent transfer, it doesn’t want to recollect the way it received there.

    RL is all about maximizing the reward. So, the whole artwork of constructing a simulation is designing a reward perform that actually displays what you need (right here the aim is to not fall down). Essentially the most fundamental RL algorithm updates the listing of most well-liked actions after receiving a constructive reward. The pace at which that occurs is the studying charge: if this quantity is just too excessive, the agent will overcorrect, whereas if it’s too low, it retains making the identical errors and studying painfully sluggish.

    The popular motion updates are additionally impacted by the exploration charge, which is the frequency of a random alternative, mainly it’s the AI’s curiosity degree. Often, it’s comparatively excessive at the start (when the agent is aware of nothing) and decays over time because the robotic exploits its data.

    import gymnasium as gymnasium
    import time
    import numpy as np
    
    env = gymnasium.make("Humanoid-v4", render_mode="human")
    obs, data = env.reset()
    
    reset = True #reset if the humanoid falls or the episode ends
    episode = 1
    total_reward, step = 0, 0
    exploration_rate = 0.5 #begin wild
    preferred_action = np.zeros(env.action_space.form) #data to replace with expertise
    
    for _ in vary(1000):
        ## motion
        step += 1
        exploration = np.random.regular(loc=0, scale=exploration_rate, measurement=env.action_space.form) #add random noise
        motion = np.clip(a=preferred_action+exploration, a_min=-1, a_max=1)
        obs, reward, terminated, truncated, data = env.step(motion) 
        ## reward
        total_reward += reward
        if reward > 0:
            preferred_action += (action-preferred_action)*0.05 #learning_rate
        exploration_rate = max(0.05, exploration_rate*0.99) #min_exploration=0.05, decay_exploration=0.99
        ## render
        env.render() 
        time.sleep(1/240)
        if (step == 1) or (step % 100 == 0):
            print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Whole:{total_reward:.1f}")
        ## reset
        if reset:
            if terminated or truncated:
                print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Whole:{total_reward:.1f}")
                obs, data = env.reset()
                episode += 1
                total_reward, step = 0, 0
                print("------------------------------------------")
    
    env.shut()

    Clearly, that’s manner too fundamental for a fancy setting just like the Humanoid, so the agent will hold falling even when it updates the popular actions.

    Deep Reinforcement Studying

    When the connection between actions and rewards is non-linear, you want Neural Networks. Deep RL can deal with high-dimensional inputs and estimate the anticipated future rewards of actions by leveraging the ability of Deep Neural Networks.

    In Python, the best manner to make use of Deep RL algorithms is thru StableBaseline, a set of essentially the most well-known fashions, already pre-implemented and able to go. Please notice that there’s StableBaseline (written in TensorFlow) and StableBaselines3 (written in PyTorch). These days, everyone seems to be utilizing the latter.

    pip set up torch
    pip set up stable-baselines3

    Some of the generally used Deep RL algorithms is PPO (Proximal Policy Optimization) as it’s easy and secure. The aim of PPO is to maximise whole anticipated reward, whereas making small updates to this coverage, holding the expansion regular.

    I shall use StableBaseline to coach a PPO on the Fitness center Humanoid setting. There are some things to bear in mind:

    • we don’t must render the env graphically, so the coaching can proceed with accelerated pace.
    • The Fitness center env have to be wrapped into DummyVecEnv to make it appropriate with StableBaseline vectorized format.
    • Relating to the Neural Community mannequin, PPO makes use of a Multi-layer Perceptron (MlpPolicy) for numeric inputs, a Convolution NN (CnnPolicy) for pictures, and a mixed mannequin (MultiInputPolicy) for observations of blended varieties.
    • Since I’m not rendering the humanoid, I discover it very helpful to take a look at the coaching progress on TensorBoard, a toolkit to visualise statistics in actual time (pip set up tensorboard). I created a folder named “logs”, and I can simply run tensorboard --logdir=logs/ on the terminal to serve the dashboard domestically (http://localhost:6006/).
    from stable_baselines3 import PPO
    from stable_baselines3.widespread.vec_env import DummyVecEnv
    
    ## setting
    env = gymnasium.make("Humanoid-v4") #no rendering to hurry up
    env = DummyVecEnv([lambda:env])
    
    ## practice
    print("Coaching START")
    mannequin = PPO(coverage="MlpPolicy", env=env, verbose=0, 
                learning_rate=0.005, ent_coef=0.005, #exploration
                tensorboard_log="logs/") #>tensorboard --logdir=logs/
    
    mannequin.study(total_timesteps=3_000_000, #1h
                tb_log_name="model_humanoid", log_interval=10)
    print("Coaching DONE")
    
    ## save
    mannequin.save("model_humanoid")

    After the coaching is full, we will load the brand new mannequin and take a look at it within the rendered setting. Now, the agent received’t be updating the popular actions anymore. As a substitute, it is going to use the educated mannequin to foretell the following greatest motion given the present state.

    env = gymnasium.make("Humanoid-v4", render_mode="human")
    mannequin = PPO.load(path="model_humanoid", env=env)
    obs, data = env.reset()
    
    reset = False #reset if the humanoid falls or the episode ends
    episode = 1
    total_reward, step = 0, 0
    
    for _ in vary(1000):
        ## motion
        step += 1
        motion, _ = mannequin.predict(obs)    
        obs, reward, terminated, truncated, data = env.step(motion) 
        ## reward
        total_reward += reward
        ## render
        env.render() 
        time.sleep(1/240)
        if (step == 1) or (step % 100 == 0): #print first step and each 100 steps
            print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Whole:{total_reward:.1f}")
        ## reset
        if reset:
            if terminated or truncated: #print the final step
                print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Whole:{total_reward:.1f}")
                obs, data = env.reset()
                episode += 1
                total_reward, step = 0, 0
                print("------------------------------------------")
    
    env.shut()

    Please notice that at no level within the tutorial did we particularly program the robotic to remain up. We’re not controlling the agent. The robotic is solely reacting to the reward perform of its setting. The truth is, for those who practice the RL mannequin for for much longer (i.e. 30 million time steps), you’ll begin seeing the robotic not solely completely standing up, but additionally strolling ahead. So, in terms of coaching an agent with AI, the design of the 3D world and its guidelines is extra vital than constructing the robotic itself.

    Conclusion

    This text has been a tutorial to introduce MuJoCo and Fitness center, and the right way to create 3D simulations for Robotics. We used the Humanoid setting to study the fundamentals of Reinforcement Studying. We educated a Deep Neural Community to show the robotic how to not fall down. New tutorials with extra superior robots will come.

    Full code for this text: GitHub

    I hope you loved it! Be at liberty to contact me for questions and suggestions or simply to share your fascinating initiatives.

    👉 Let’s Connect 👈

    (All pictures are by the writer except in any other case famous)



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Your Synthetic Data Passed Every Test and Still Broke Your Model

    April 23, 2026

    Using a Local LLM as a Zero-Shot Classifier

    April 23, 2026

    I Simulated an International Supply Chain and Let OpenClaw Monitor It

    April 23, 2026

    Lasso Regression: Why the Solution Lives on a Diamond

    April 23, 2026

    The Most Efficient Approach to Crafting Your Personal AI Productivity System

    April 23, 2026

    “Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office

    April 23, 2026

    Comments are closed.

    Editors Picks

    Ultra portable power for camping

    April 24, 2026

    Startup 360: Using AI to deal with ‘carenting’ in the Sandwich years

    April 24, 2026

    Sam’s Club Promo Codes: 60% Off for April 2026

    April 24, 2026

    Canada’s Cohere and Germany’s Aleph Alpha agree to a merger deal valuing the combined group at ~$20B to work on sovereign AI; both governments support the deal (Financial Times)

    April 24, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Do Labels Make AI Blind? Self-Supervision Solves the Age-Old Binding Problem

    December 4, 2025

    Quaise energy demos millimeter wave drilling for deep geothermal

    May 30, 2025

    The Future of the Artemis Program Is Riding on Reentry

    April 11, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.