Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How to Fine-Tune Small Language Models to Think with Reinforcement Learning
    Artificial Intelligence

    How to Fine-Tune Small Language Models to Think with Reinforcement Learning

    Editor Times FeaturedBy Editor Times FeaturedJuly 9, 2025No Comments24 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    in vogue. DeepSeek-R1, Gemini-2.5-Professional, OpenAI’s O-series fashions, Anthropic’s Claude, Magistral, and Qwen3 — there’s a new one each month. If you ask these fashions a query, they go right into a chain of thought earlier than producing a solution.

    A easy demonstration of what reasoning appears like. When requested a query, the Language Mannequin (LM) generates a series of thought first, adopted by the reply. (Illustration by the Writer)

    I not too long ago requested myself the query, “Hmm… I ponder if I ought to write a Reinforcement Studying loop from scratch that teaches this ‘considering’ behaviour to actually small fashions — like solely 135 million parameters“. It needs to be simple, proper?

    Nicely, it wasn’t.

    Small fashions merely don’t have the world information that giant fashions do. This makes < 1B parameter mannequin lack the “frequent sense” to simply cause by complicated logical duties. Subsequently, you can’t simply depend on compute to coach them to cause.

    You want further tips up your sleeve.

    On this article, I received’t simply cowl tips although. I’ll cowl the key concepts behind coaching reasoning behaviours into language fashions, share some easy code snippets, and a few sensible tricks to fine-tune Small Language Fashions (SLMs) with RL.

    This text is split into 5 sections:

    1. Intro to RLVR (Reinforcement Studying with Verifiable Rewards) and why it’s uber cool
    2. A visible overview of the GRPO algorithm and the clipped surrogate PPO loss.
    3. A code walkthrough!
    4. Supervised fine-tuning and sensible tricks to practice reasoning fashions
    5. Outcomes!

    Except in any other case talked about, all photos used on this article are illustrations produced by the creator.

    On the finish of this text, I’ll hyperlink to the 50-minute companion YouTube video of this text. If in case you have any queries, that video doubtless has the solutions/clarification you want. It’s also possible to attain out to me on X (@neural_avb).

    1. Reinforcement Studying with Verifiable Rewards (RLVR)

    Earlier than diving into particular challenges with Small fashions, let’s first introduce some phrases.

    Group Relative Coverage Optimization, or GRPO, is a (reasonably new) Reinforcement Studying (RL) method that researchers are utilizing to fine-tune Massive Language Fashions (LLMs) on logical and analytical duties. Since its inception, a brand new time period has been circulating within the LLM analysis house: RLVR, or Reinforcement Lincomes with Verifiable Rewards.

    To grasp what makes RLVR distinctive, it’s useful to distinction it with the most typical utility of RL in language fashions: RLHF (Reinforcement Lincomes with Human Feedback). In RLHF, an RL module is educated to maximise scores from a separate reward mannequin, which acts as a proxy for human preferences. This reward mannequin is educated on a dataset the place people have ranked or rated completely different mannequin responses.

    In different phrases, RLHF is educated so LLMs can output responses which might be extra aligned with human preferences. It tries to make fashions comply with directions extra carefully.

    RLVR tries to unravel a special downside. RLVR teaches a mannequin to be verifiably appropriate, usually by studying to generate it’s personal chain of thought.

    The place RLHF had a subjective reward mannequin, RLVR makes use of an goal verifier. The core concept is to offer rewards based mostly on whether or not a solution is demonstrably appropriate, not on a prediction of what a human may desire.

    An illustration of how RLVR works (Illustrated by the Writer)

    That is precisely why this technique is known as ‘RL with verifiable rewards‘. Not each query’s reply will be verified simply. Particularly open-ended questions like “What iPhone ought to I purchase?” or “The place ought to I am going to varsity?”. Some use circumstances, nevertheless, do match simply within the “verifiable rewards” paradigm, like math, logical duties, and code-writing, to call just a few. Within the reasoning-gym part beneath, we are going to look into how precisely these duties will be simulated and the way the rewards will be generated.

    However earlier than that, you may ask: nicely the place does “reasoning” match into all of this?

    We’ll practice the LLM to generate arbitrarily lengthy chain of thought reasoning texts earlier than producing the ultimate reply. We instruct the mannequin to wrap its considering course of in tags and its last conclusion in tags.

    The complete language mannequin response will look one thing like this:

    
    Person has requested me to depend the variety of r's in strawberry.
    Let's do a cumulative depend.
    s=0, t=0, r=1, a=0, w=0, b=0, e=0, r=2, r=3, y=4
    
    It appears there are 3 r's in strawberry. 
    I discover that there's an r in straw and a couple of r's in berry.
    Since 1+2=3 I'm extra assured there are 3 r's
    
    
    3
    

    This construction permits us to simply extract simply the ultimate reply and verify if it’s appropriate. The verifier is a single supply of reality, and could be a easy piece of code that (actually) counts alphabets.

    def count_alphabets(phrase, letter):
        return sum([1 for l in word if l == letter])
    
    reward = 1 if (lm_answer == count_alphabets("strawberry", "r") else -1

    We’ll hold a file of the mannequin’s experiences — its responses and the corresponding rewards obtained from the verifier. The RL algorithm will then practice to advertise behaviours that improve the chance of appropriate last solutions.

    By constantly rewarding appropriate solutions and good formatting, we might improve the chance of reasoning tokens that result in appropriate solutions.

    Get this: we don’t want to immediately consider the intermediate reasoning tokens. By merely rewarding the ultimate reply, we are going to not directly elicit reasoning steps into the LLM’s chain of thought that result in appropriate solutions!

    Supply: Some exercepts from the DeepSeek-R1 paper (License: Free)

    2. GRPO (Group Relative Coverage Optimization)

    I’m going to skip the same old Reinforcement Studying 101 intro right here, I anticipate most of you who learn this far to know the fundamentals of RL. There’s an agent who observes states from the atmosphere and takes an motion — the atmosphere rewards the agent relying on how good the motion was — the agent shops these experiences and trains to take higher actions sooner or later that result in greater rewards. RL 101 class dismissed.

    However how can we switch the RL paradigm to language?

    Let’s discuss our algorithm of alternative — Group Relative Policy Optimization to know how. GRPO works in two iteratively self-repeating phases — an expertise assortment part the place the Language Mannequin (LM) accumulates experiences within the atmosphere with its present weights. And a coaching part the place it makes use of the collected reminiscences to replace its weights to enhance. After coaching, it as soon as once more goes into an expertise assortment step with the up to date weights.

    Expertise Assortment

    Let’s dissect every step within the expertise assortment part now.

    • Step 1: The atmosphere is a black field that generates questions on logical or math duties. We’ll focus on this in an upcoming part with the reasoning-gym library.
    • Step 2: We tokenize the enter questions right into a sequence of integer tokens.
    Pattern questions, tokenized them, ahead cross by LM, and generate a number of responses for every query! (Illustrated by the Writer)
    • Step 3: The “agent” or the “coverage” is the present SLM we’re coaching. It observes the atmosphere’s tokenized questions and generates responses. The LLM response will get transformed into textual content and returned to the atmosphere. The atmosphere rewards every response.
    The Setting acts because the verifier and assigns a reward to the agent. (Illustrated by the Writer)
    • Step 4: From the rewards, we calculate the benefit of every response. In GRPO, the benefit is the relative goodness of every response within the group. Importantly, benefits are calculated per group, i.e. we don’t standardize rewards throughout completely different questions.
    Benefits outline how beneficial a selected response is relative to different responses to the identical query
    (Illustrated by the Writer)
    • Step 5: The unique query, the log possibilities for every LM-generated token, and the benefits are all collected inside a reminiscence buffer.
    • Steps 1-5 are repeated until the buffer measurement reaches the specified threshold.
    Saving experiences within the buffer! (Illustrated by the Writer)

    Coaching Part

    After the top of the expertise assortment part, our purpose is to enter the coaching part. Right here, we are going to be taught from the reward patterns the LLM noticed and use RL to enhance its weights. Right here is how that works:

    1. Randomly pattern a minibatch of reminiscences. Keep in mind, every reminiscence already contained its group-relative-advantage (Step 5 from the expertise assortment part). Randomly sampling question-answer pairs improves the robustness of the coaching because the gradients are calculated as a mean of a various set of experiences, stopping over-fitting on any single query.
    2. For every minibatch, we need to maximize this time period following the usual PPO (Proximal Coverage Optimization) formulation. The most important distinction with GRPO is that we don’t want a further reward mannequin or a price community to calculate benefits. As a substitute, GRPO samples a number of responses to the identical query to calculate the relative benefit of every response. The reminiscence footprint is considerably diminished since we received’t want to coach these further fashions!
    3. Repeat the above steps.
    GRPO operates in 2 repeating phases — acquire experiences, practice on experiences, repeat. (Illustrated by the Writer)

    What the PPO Loss means

    Let me clarify the PPO Loss in an intuitive step-by-step vogue. The PPO Loss appears like this.

    The PPO Loss Function. Let me break it down for you. (Illustration by the Writer)
    • Right here, pi_old is the old-policy neural community that we used in the course of the information assortment part.
    • π is the present coverage neural community we’re coaching. For the reason that weights of π change after every gradient replace, π and π_old don’t stay the identical in the course of the coaching part — therefore the excellence.
    • G is the variety of generated responses for a single query. |o_i| is the size of the i-th response within the group. Subsequently, these summation and normalization operation computes a imply over all of the tokens over all responses. What does it compute the imply of? Nicely it’s π/π_old * A_{it}. What does that imply?
    The best solution to assign a bonus to every token is by copying the benefit of all the response (Illustrated by the Writer)
    • A_it is the benefit of the t-th token within the i-th response. Keep in mind after we calculated the benefit of every response in Step 5 throughout expertise assortment? The best solution to assign a bonus to every token is by merely duplicating the identical benefit to every token — this implies we’re saying that each token is equally chargeable for producing the right reply.
    • Lastly, what’s π(o_it | q, o_i < t)? It means what’s the chance of the t-th token within the i-th response? Which means, how doubtless was that token when it was generated?
    • The significance sampling ratio reweights the benefits between the present updating coverage and the outdated exploration coverage.
    • The clipping time period ensures that the updates to the community don’t develop into too massive and the weights don’t transfer too distant from the outdated coverage. This provides extra stability to the coaching course of by protecting the mannequin updates near “a belief area” from the data-collection coverage.
    The PPO goal damaged down into particular person elements. (Illustrated by the Writer)

    Once we are maximizing the PPO goal, we’re successfully asking the LLM to improve the log-probability of the tokens that led to a excessive benefit, whereas reducing the log-probability of tokens that had a low benefit.

    In different phrases: make tokens that generate good benefits extra doubtless and tokens that generate low benefits much less doubtless.

    Understanding the PPO Loss with an instance

    Let’s neglect concerning the clipping time period and the π_old for now, and let’s simply see what maximizing 𝜋(𝑜_i) * A_i means. To remind you, this a part of the equation merely means, “the product of the chance of the i-th token (o_i) and the benefit of the i-th token (A_i)

    Let’s say for a query, the LLM generated these two sequences: “A B C” and “D E F”, and it obtained a bonus of +1 for the previous and -1 for the latter*. Let’s say we now have the log possibilities for every of the three tokens as proven beneath.

    * truly since group-relative benefits at all times have an ordinary deviation of 1, the right benefits needs to be +0.707 and -0.707.

    Discover what occurs if you multiply the benefits A_it by the present logprobs pi. Now actually take into consideration what it means to maximise the imply of that product matrix.

    A toy instance to indicate what it means to maximise the product of the chance of a token with it’s benefit (Illustrated by the Writer)

    Keep in mind we are able to solely change the possibilities popping out of the LLM. The benefits come from the atmosphere and are subsequently handled as constants. Rising this anticipated rating would subsequently imply rising the chance of tokens with a constructive benefit, and reducing the worth of the adverse benefit instance.

    To extend the imply of the product tensor, we should improve every worth within the tensor, so we should improve the probs of constructive advantage-tokens, and reduce the probs of negative-advantage tokens.
    (Illustrated by the Writer)

    Beneath, you will see an instance of how log-probs change after just a few rounds of coaching. Discover how the blue line is shifting nearer to zero when the benefit is excessive? This means that the log-probabilities elevated (or the possibilities elevated) after going by RL Coaching. Evaluate that to the plot on the precise, which reveals a special response with a low benefit. The blue line is shifting away from 0, turning into much less possible for choice in later rounds.

    A comparability of how RL fine-tuning impacts log-probs of tokens after coaching (Illustration by the Writer)

    Within the subsequent part, let’s check out the reasoning-gym library and perceive how we might pattern duties.

    3. Implementation

    So, to do RL, we first want duties. A typical manner to do that is through the use of an current dataset of math issues, just like the GSM-8K dataset. On this article, let’s have a look at a special case — producing duties procedurally with a Python library referred to as reasoning-gym.

    For my experiments, I used two duties: syllogism and propositional logic. reasoning-gym comprises a number of various repositories of various problem.

    A syllogism process is a sort of logical puzzle designed to check deductive reasoning. Mainly, we are going to present the LLM with two premises and ask if the conclusion is appropriate or not. The propositional logic process is a symbolic reasoning process the place the LLM is supplied duties with symbols and requested to generate the conclusion. In contrast to syllogism, this isn’t a YES/NO classification response — they need to generate the right conclusion immediately. This makes this process significantly more durable.

    Instance of the Syllogism Activity (Footage of my RL-trained mannequin)

    Earlier than we start coding, I suppose it’s customary to specify what I imply by “small” fashions.

    The jury remains to be out on what qualifies as a “small” mannequin (some say <14B, some say <7B), however for my YouTube video, I picked even smaller fashions: SmolLM-135M-Instruct, SmolLM-360M-Instruct, and Qwen3-0.6B. These are ~135M, ~360M, and ~600M fashions, respectively.

    Let’s see arrange the fundamental coaching loop. First, we are able to use Huggingface’s transformers library to load in a mannequin we need to practice, let’s say the little 135M param mannequin SmolLM-135M-Instruct.

    To generate some propositional logic duties, for instance, you simply name this reasoning_gym.create_dataset operate as proven beneath.

    import re
    from reasoning_gym import create_dataset, get_score_answer_fn
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    model_name = "HuggingfaceTB/SmolLM-135M-Instruct"
    
    # load mannequin from huggingface
    lm = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # This units all fashions as trainable
    for param in lm.parameters():
        param.requires_grad = True
    # In my experiments, I used a LORA adapter (extra on this later)
    
    # specify identify of the env 
    environment_name = "propositional_logic"
    
    # In apply, you must wrap this with a torch dataloader 
    # to pattern a minibatch of questions
    dataset = create_dataset(
        environment_name, seed=42, measurement=DATA_SIZE
    )
    
    for d in dataset:
        query = d["question"] # Accessing the query
         
        # We'll use this later to confirm if reply is appropriate
        validation_object = d["metadata"]["source_dataset"]
        score_fn = get_score_answer_fn(validation_object)
    
    

    To generate reasoning information, we wish the LM to generate considering, adopted by the response. Beneath is the system immediate we can be utilizing.

    system_prompt = """A dialog between Person and Assistant. The person asks a query, and the Assistant solves it.
    The assistant first thinks concerning the reasoning course of within the thoughts after which gives the person
    with the reply. The reasoning course of and reply are enclosed inside   and
      tags, respectively, i.e.,  reasoning course of right here 
     reply right here .
    
    Don't generate new code. Don't write python code.
    
    You might also be given examples by the person telling you the anticipated response format.
    Observe the format of the examples, however remedy the precise downside requested by the person, not the examples.
    
    Crucial - Keep in mind once more, your output format needs to be:
     reasoning course of right here 
     reply right here 
    
    Your response can be scored by extracting the substring between the ... tags.
    It's essential to comply with the above format.
    feature_extraction_utilsling to comply with the response format will lead to a penalty.
    """

    To generate solutions, we first tokenize the system immediate and the query as proven beneath.

    # Create messages construction
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}, # Obtained from reasoning-gym
    ]
    
    # Create tokenized illustration
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        return_tensors="pt",
        add_generation_prompt=True
    )

    Then we cross it by the LM — generate a number of responses utilizing the num_return_sequences parameter, and detokenize it again to get a string response. No gradients are calculated throughout this stage.

    generated_response = lm.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens, # The max variety of tokens to generate
        do_sample=True,                # Probabilistic sampling
        top_p=0.95,                    # Nucleus sampling
        num_return_sequences=G,        # Variety of sequences per query
        temperature=1,                 # Enhance randomness
        eos_token_id=eos_token_id,
        pad_token_id=eos_token_id,
    )
    

    We additionally write the extract_answer operate, which makes use of common expressions to extract solutions between the reply tags.

    def extract_answer(response):
        reply = re.search(r"(.*?)", response, re.DOTALL)
        if reply shouldn't be None:
            return reply.group(1).strip()
        else:
            return ""
    

    Lastly, we use the rating operate we obtained beforehand to generate a reward relying on whether or not the LM’s response was appropriate. To calculate rewards, we add a format reward and a correction reward. The correction reward comes from the atmosphere, and the format reward is awarded if the mannequin accurately generates the ... and ... tags.

    The benefits are calculated by standardizing throughout every group.

    # Response is an array of string of size [B*G]
    # B is the variety of questions, G is the variety of responses per query
    
    correctness_reward = score_fn(response, validation_object)
    format_reward = calculate_format_reward(response)
    
    # Whole reward is a weighted sum of correctness and formatting rewards
    reward = correctness_reward * 0.85 + format_reward * 0.15 
    
    # Convert rewards from [B*G, 1] -> [B, G]
    rewards = rewards.reshape(B, G) 
    
    # Calculate benefits
    benefits = (rewards - np.imply(rewards, axis=1, keepdims=True)) / (
        np.std(rewards, axis=1, keepdims=True) + 1e-8
    )
    benefits = benefits.reshape(-1, 1)
    

    Retailer the (outdated) log probs, benefits, responses, and response masks in a reminiscence buffer.

    # A operate that returns the log prob of every chosen token
    log_probs = calculate_log_probs(lm, generated_response)
    
    buffer.prolong([{
        "full_response": generated_response[i],
        "response_mask": response_mask[i], # A binary masks to indicate which tokens in generated response are AI generated, 0 for system immediate and questions
        "old_log_probs": log_probs[i],
        "benefits": benefits[i]
    } for i in vary(len(generated_response))])

    After a number of expertise assortment step, as soon as the buffer is full, we provoke our coaching loop. Right here, we pattern minibatches from our expertise, calculate the log probs, compute loss, and backdrop.

    # full_response, response_mask, old_log_probs, benefits <--- Buffer
    
    # Recompute the brand new log_probs. Discover no torch.no_grad(), so gradients WILL BE USED right here.
    logits = llm(input_ids=full_response).logits
    
    # Extract log probs from the logits
    # Does log_softmax over the vocabulary and extracts the log-prob of every chosen token
    log_probs = calculate_log_probs(
         logits,
         full_responses
    )
    
    # Calculate the clipped surrogate loss
    reasoning_loss = calculate_ppo_loss(
         log_probs,       # Trainable
         old_log_probs,   # Obtained from exploration, not trainable
         benefits,      # Obtained from atmosphere, not trainable
         response_mask    # Obtained from exploration, not trainable
    ) 
    
    # Optimizaiton steps
    accelerator.backward(reasoning_loss)
    optimizer.step()
    optimizer.zero_grad()
    

    You should use further entropy losses right here, or reduce KLD together with your reference mannequin as recommended within the authentic Deepseek-R1 paper, however future papers have concluded that these leash the coaching course of and never a requirement.

    4. Warming up with Supervised Positive-tuning

    Technically, we are able to attempt to run an enormous RL coaching proper now and hope that the small fashions can pull by and conquer our duties. Nevertheless, the chance of that’s extremely low.

    There’s one massive downside — our small fashions usually are not appropriately educated to generate formatted outputs or carry out nicely on these duties. Off the field, their responses do have some logical stream to them, because of the pretraining or instruction tuning from their authentic builders, however they aren’t ok for our goal process.

    Evaluating the outputs of a small mannequin with a Massive LM (Illustration by Writer)

    Give it some thought — RL trains by gathering experiences and updating the coverage to maximise the great experiences. But when a lot of the experiences are fully unhealthy and the mannequin receives 0 rewards, it has no solution to optimize, as a result of it will get no sign to enhance in any respect. So the really helpful strategy is to first train the mannequin the conduct you need to practice utilizing supervised fine-tuning. Right here is a straightforward script:

    consumer = openai.AsyncClient()
    ENVIRONMENT = "propositional_logic"
    mannequin = "gpt-4.1-mini"
    semaphore = asyncio.Semaphore(50)
    num_datapoints = 200
    system_prompt = (
        system_prompt
        + """Additionally, you will be supplied the actual reply. Your considering ought to finally lead to producing the actual reply."""
    )
    
    dataloader = create_dataset(identify=ENVIRONMENT, measurement=num_datapoints)
    
    @backoff.on_exception(backoff.expo, openai.RateLimitError)
    async def generate_response(merchandise):
        async with semaphore:
            messages = [
                {"role": "system", "content": system_prompt},
                {
                    "role": "user",
                    "content": f"""
        Question: {item['question']}
        Metadata: {merchandise['metadata']}
        Reply: {merchandise['answer']}
                        """,
                },
            ]
            response = await consumer.chat.completions.create(messages=messages, mannequin=mannequin)
            return {
                "query": merchandise["question"],
                "metadata": merchandise["metadata"],
                "reply": merchandise["answer"],
                "response": response.selections[0].message.content material,
            }
    
    async def major():
        responses = await asyncio.collect(*[generate_response(item) for item in dataloader])
        fname = f"responses_{ENVIRONMENT}_{mannequin}.json"
        json.dump(responses, open(fname, "w"), indent=4)
        print(f"Saved responses to {fname}")
    
    if __name__ == "__main__":
        asyncio.run(major())

    To generate the fine-tuning dataset, I first generated the considering and reply tags with a small LLM-like GPT-4.1-mini. Doing that is extremely easy — we pattern 200 or so examples for every process, name the OpenAI API to generate a response, and put it aside on disk.

    Throughout SFT, we load the bottom mannequin we need to practice, connect a trainable LORA adapter ,and do parameter-efficient fine-tuning. Listed below are the LORA configurations I used.

    lora:
      r: 32
      lora_alpha: 64
      lora_dropout: 0
      target_modules: ["q_proj", "v_proj", "k_proj", "o_proj", 
                       "up_proj", "down_proj", "gate_proj"] 

    LORA permits the coaching course of to be extra reminiscence environment friendly and likewise reduces the danger of corrupting the unique mannequin. You’ll find the main points of parameter-efficient supervised fine-tuning in my YouTube video proper right here.

    I educated a LORA adapter on 200 examples of syllogism information with the smallest language mannequin I might discover — the HuggingfaceTB/SmolLM-135M-Instruct, and it obtained us an accuracy of 46%. Roughly, which means we generate an accurate reply 46% of the time. Extra importantly, we regularly get the formatting proper, so our regex can safely extract solutions from the responses as a rule.

    Some extra optimizations for SLMs and sensible issues

    1. Not all reasoning duties will be solved by all fashions. A simple solution to confirm if a process is simply too arduous or too simple for the mannequin is to only verify the bottom accuracy of the mannequin in your process. Whether it is, let’s say beneath 10-20%, the duty is probably going very arduous and also you want further supervised warmup fine-tuning.
    2. SFT, even on small datasets, can typically present large accuracy beneficial properties on small fashions. For those who can purchase an excellent dataset, it’s possible you’ll not even must do Reinforcement Studying in lots of situations. SLMs are immensely tunable.
    3. Papers like DAPO and Critical Perspectives on R1 have claimed that the unique loss normalization from DeepSeek has a size bias. They’ve proposed different normalization strategies which might be value . For my challenge, the common DeepSeek loss simply labored.
    4. DAPO additionally mentions eradicating the KLD time period within the authentic R1 paper. Initially, the purpose of this loss was to make sure that the updating coverage isn’t too distant from the bottom coverage, however DAPO suggests not utilizing this as a result of the behaviour of the coverage can drastically change throughout reasoning, making this KLD time period an pointless regularisation time period that may limit the mannequin’s intelligence.
    5. Producing various responses IS KEY to creating RL potential. For those who solely generated appropriate responses, or for those who solely generated incorrect responses, the benefit can be 0, and it will give the RL algorithm no coaching sign in any respect. We are able to generate various responses by rising the temperature, top_p, and num_return_sequences parameters within the generate().
    6. It’s also possible to generate various rewards, by including extra phrases into the reward operate. For instance, a size reward that penalizes overly lengthy reasoning.
    7. The next parameters improve the stability of coaching at the price of extra computation: rising num generations per rollout, rising the scale of the buffer and decreasing the educational price.
    8. Use gradient accumulation (and even gradient checkpointing) if in case you have restricted assets to coach these fashions.
    9. There’s some superb print I skipped on this article associated to padding. When saving experiences into buffer, it’s greatest apply to take away the pad tokens altogether — and recreate them when loading a minibatch throughout coaching.
    10. It’s best to depart whitespace round and (and their closing tags). This leads to constant tokenization and makes coaching barely simpler for the SLMs.

    4. Outcomes

    Right here is my YouTube video that explains every thing on this weblog submit extra pictorially and gives a hands-on tutorial on code such a factor.

    On the supervised-fine-tuned SmolLM-135M on the syllogism process, we obtained a bump to 60%! You possibly can see the reward curve right here — the wholesome customary deviation of the rewards reveals that we have been certainly getting various responses all through, which is a wholesome factor if we need to practice with RL.

    Rewards curve of the Syllogism process on SmolLM-135M after SFT (Illustration by Writer)

    Here’s a set of hyperparameters that labored nicely for me.

    config:
      identify: "path/to/sft_model"
      max_new_tokens: 300 # reasoning + reply token finances
      exploration_batchsize: 8  # variety of questions per batch throughout rollout
      G: 6  # num responses per group
      temperature: 0.7
      batch_size: 16  # minibatch measurement throughout coaching
      gradient_accumulation_steps: 12
      learning_rate: 0.000001  # Advisable to maintain this low, like 1e-6 or 1e-7
      top_p: 0.95
      buffer_size: 500
    

    I additionally repeated this experiment with bigger fashions — the SmolLM-360M-Instruct and the Qwen3-0.6B mannequin. Within the latter, I used to be in a position to get accuracies as much as 81% which is superior! We obtained a 20% additive bump on common within the syllogism process!

    Within the propositional logic process, which in my view is a more durable reasoning process, I additionally noticed related beneficial properties throughout all small fashions! I’m positive that with extra instruction tuning and RL fine-tuning, presumably on a number of duties without delay, we are able to increase the intelligence of those fashions loads greater. Coaching on a single process can generate fast outcomes which is what I needed for this Youtube video, however it could additionally act as a bottleneck for the mannequin’s general intelligence.

    Let’s finish this text with a GIF of the small fashions outputting reasoning information and fixing duties. Take pleasure in, and keep magnificent!

    SmolLM-135M after coaching on Propositional Logic Duties (Supply: Writer)

    References

    Writer’s YouTube channel: https://www.youtube.com/@avb_fj

    Writer’s Patreon: www.patreon.com/NeuralBreakdownwithAVB

    Writer’s Twitter (X) account: https://x.com/neural_avb

    Deepseek Math: https://arxiv.org/pdf/2402.03300
    DeepSeek R1: https://arxiv.org/abs/2501.12948
    DAPO: https://arxiv.org/abs/2503.14476
    Crucial Views on R1: https://arxiv.org/abs/2503.20783
    Reasoning Gymnasium Library: github.com/open-thought/reasoning-gym

    An excellent place to examine Reasoning: https://github.com/willccbb/verifiers

    An excellent place to check code: https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    Portable water filter provides safe drinking water from any source

    April 18, 2026

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Robots-Blog | New Features Available for VEXcode 4.0

    September 28, 2024

    With Truth Social, Trump Has an Official Mouthpiece and a Revenue Stream

    February 19, 2025

    Terry Rozier asks judge to dismiss charges in federal NBA betting case

    December 24, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.