Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    • One Rumored Color for the iPhone 18 Pro? A Rich Dark Cherry Red
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Reinforcement Learning from Human Feedback, Explained Simply
    Artificial Intelligence

    Reinforcement Learning from Human Feedback, Explained Simply

    Editor Times FeaturedBy Editor Times FeaturedJune 24, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    The looks of ChatGPT in 2022 fully modified how the world began perceiving synthetic intelligence. The unbelievable efficiency of ChatGPT led to the fast growth of different highly effective LLMs.

    We may roughly say that ChatGPT is an upgraded model of GPT-3. However compared to the earlier GPT variations, this time OpenAI builders not solely used extra information or simply complicated mannequin architectures. As a substitute, they designed an unbelievable approach that allowed a breakthrough.

    On this article, we are going to discuss RLHF — a elementary algorithm applied on the core of ChatGPT that surpasses the bounds of human annotations for LLMs. Although the algorithm relies on proximal coverage optimization (PPO), we are going to maintain the reason easy, with out going into the small print of reinforcement studying, which isn’t the main focus of this text.

    NLP growth earlier than ChatGPT

    To raised dive into the context, allow us to remind ourselves how LLMs had been developed prior to now, earlier than ChatGPT. Most often, LLM growth consisted of two phases:

    Pre-training & fine-tuning framework

    Pre-training contains language modeling — a job during which a mannequin tries to foretell a hidden token within the context. The likelihood distribution produced by the mannequin for the hidden token is then in comparison with the bottom reality distribution for loss calculation and additional backpropagation. On this approach, the mannequin learns the semantic construction of the language and the that means behind phrases.

    If you wish to be taught extra about pre-training & fine-tuning framework, try my article about BERT.

    After that, the mannequin is fine-tuned on a downstream job, which could embrace completely different goals: textual content summarization, textual content translation, textual content technology, query answering, and so forth. In lots of conditions, fine-tuning requires a human-labeled dataset, which ought to ideally comprise sufficient textual content samples to permit the mannequin to generalize its studying nicely and keep away from overfitting.

    That is the place the bounds of fine-tuning seem. Information annotation is often a time-consuming job carried out by people. Allow us to take a question-answering job, for instance. To assemble coaching samples, we would wish a manually labeled dataset of questions and solutions. For each query, we would wish a exact reply supplied by a human. As an illustration:

    Throughout information annotation, offering full solutions to prompts requires plenty of human time.

    In actuality, for coaching an LLM, we would wish thousands and thousands and even billions of such (query, reply) pairs. This annotation course of may be very time-consuming and doesn’t scale nicely.

    RLHF

    Having understood the principle downside, now it’s good second to dive into the small print of RLHF.

    When you’ve got already used ChatGPT, you will have in all probability encountered a scenario during which ChatGPT asks you to decide on the reply that higher fits your preliminary immediate:

    The ChatGPT interface asks a person to charge two potential solutions.

    This info is definitely used to constantly enhance ChatGPT. Allow us to perceive how.

    Initially, you will need to discover that selecting one of the best reply amongst two choices is a a lot easier job for a human than offering an actual reply to an open query. The thought we’re going to have a look at relies precisely on that: we would like the human to only select a solution from two potential choices to create the annotated dataset.

    Selecting between two choices is a neater job than asking somebody to jot down the absolute best response.

    Response technology

    In LLMs, there are a number of potential methods to generate a response from the distribution of predicted token chances:

    • Having an output distribution p over tokens, the mannequin all the time deterministically chooses the token with the best likelihood.
    The mannequin all the time selects the token with the best softmax likelihood.
    • Having an output distribution p over tokens, the mannequin randomly samples a token in accordance with its assigned likelihood.
    The mannequin randomly chooses a token every time. The best likelihood doesn’t assure that the corresponding token will probably be chosen. When the technology course of is run once more, the outcomes may be completely different.

    This second sampling methodology leads to extra randomized mannequin habits, which permits the technology of numerous textual content sequences. For now, allow us to suppose that we generate many pairs of such sequences. The ensuing dataset of pairs is labeled by people: for each pair, a human is requested which of the 2 output sequences suits the enter sequence higher. The annotated dataset is used within the subsequent step.

    Within the context of RLHF, the annotated dataset created on this approach is known as “Human Suggestions”.

    Reward Mannequin

    After the annotated dataset is created, we use it to coach a so-called “reward” mannequin, whose purpose is to be taught to numerically estimate how good or dangerous a given reply is for an preliminary immediate. Ideally, we would like the reward mannequin to generate constructive values for good responses and unfavorable values for dangerous responses.

    Talking of the reward mannequin, its structure is strictly the identical because the preliminary LLM, aside from the final layer, the place as a substitute of outputting a textual content sequence, the mannequin outputs a float worth — an estimate for the reply.

    It’s essential to go each the preliminary immediate and the generated response as enter to the reward mannequin.

    Loss operate

    You would possibly logically ask how the reward mannequin will be taught this regression job if there usually are not numerical labels within the annotated dataset. This can be a affordable query. To deal with it, we’re going to use an attention-grabbing trick: we are going to go each a very good and a foul reply by the reward mannequin, which can finally output two completely different estimates (rewards).

    Then we are going to well assemble a loss operate that may evaluate them comparatively.

    Loss operate used within the RLHF algorithm. R₊ refers back to the reward assigned to the higher response whereas R₋ is a reward estimated for the more severe response.

    Allow us to plug in some argument values for the loss operate and analyze its habits. Under is a desk with the plugged-in values:

    A desk of loss values relying on the distinction between R₊ and R₋. 

    We will instantly observe two attention-grabbing insights:

    • If the distinction between R₊ and R₋ is unfavorable, i.e. a greater response obtained a decrease reward than a worse one, then the loss worth will probably be proportionally massive to the reward distinction, that means that the mannequin must be considerably adjusted.
    • If the distinction between R₊ and R₋ is constructive, i.e. a greater response obtained the next reward than a worse one, then the loss will probably be bounded inside a lot decrease values within the interval (0, 0.69), which signifies that the mannequin does its job nicely at distinguishing good and dangerous responses.

    A pleasant factor about utilizing such a loss operate is that the mannequin learns applicable rewards for generated texts by itself, and we (people) wouldn’t have to explicitly consider each response numerically — simply present a binary worth: is a given response higher or worse.

    Coaching an unique LLM

    The educated reward mannequin is then used to coach the unique LLM. For that, we are able to feed a collection of recent prompts to the LLM, which can generate output sequences. Then the enter prompts, together with the output sequences, are fed to the reward mannequin to estimate how good these responses are.

    After producing numerical estimates, that info is used as suggestions to the unique LLM, which then performs weight updates. A quite simple however elegant method!

    RLHF coaching diagram

    More often than not, within the final step to regulate mannequin weights, a reinforcement studying algorithm is used (often performed by proximal coverage optimization — PPO).

    Even when it’s not technically right, if you’re not accustomed to reinforcement studying or PPO, you’ll be able to roughly consider it as backpropagation, like in regular machine studying algorithms.

    Inference

    Throughout inference, solely the unique educated mannequin is used. On the identical time, the mannequin can constantly be improved within the background by amassing person prompts and periodically asking them to charge which of two responses is best.

    Conclusion

    On this article, we’ve got studied RLHF — a extremely environment friendly and scalable approach to coach fashionable LLMs. A sublime mixture of an LLM with a reward mannequin permits us to considerably simplify the annotation job carried out by people, which required big efforts prior to now when performed by uncooked fine-tuning procedures.

    RLHF is used on the core of many fashionable fashions like ChatGPT, Claude, Gemini, or Mistral.

    Assets

    All photos until in any other case famous are by the creator



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026

    Extragalactic Archaeology tells the ‘life story’ of a whole galaxy

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    How to Build a Production-Ready Claude Code Skill

    March 16, 2026

    Art house streaming service and distributor Mubi raised $100M led by Sequoia at a $1B valuation; Mubi has about 20M registered users worldwide (Christopher Grimes/Financial Times)

    May 31, 2025

    Plant Care Tips for the Winter, According to Experts (2025)

    September 14, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.