Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Our Favorite Apple Watch Has Never Been Less Expensive
    • Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)
    • Today’s NYT Strands Hints, Answer and Help for April 20 #778
    • KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.
    • OneOdio Focus A1 Pro review
    • The 11 Best Fans to Buy Before It Gets Hot Again (2026)
    • A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)
    • ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment
    Artificial Intelligence

    From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment

    Editor Times FeaturedBy Editor Times FeaturedJuly 15, 2025Updated:July 15, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Context

    have developed from fundamental search instruments to AI assistants that code, write, and analysis. Now they’re accessible by way of smartphone apps through web APIs, placing highly effective AI at everybody’s fingertips. These programs have gotten an integral a part of our every day lives. Persons are utilizing AI assistants for looking for recommendation for private relationships, fact-checking to kind opinions (although it clearly states it will probably make errors), weight loss program plans and subsequent vacation locations.

    As increasingly more highly effective fashions are launched, the query of belief arises and fashions are extra scrutinized to verify the responses produced by them are reliable and aligned with human values. These are usually not new questions. Historically, fashions are fine-tuned on human choice information (often incorporates enter, chosen reply, rejected reply) earlier than launching them for public use. Mannequin alignment and security have been main areas of analysis, and a number of algorithms have been developed to coach the mannequin for alignment. Amongst all of the alignment coaching algorithms, the preferred is Direct Choice Optimization (DPO) as a result of its simplicity and effectivity.

    However DPO has a elementary limitation. When calculating the chance of a response, it makes use of equal weight for every phrase or token current within the response, although people naturally give extra significance or weight to significant phrases. For instance, let’s take a look at the next consumer interplay with LLM.

    Consumer: What’s the capital of France?
    LLM: The capital of France is Paris, and it’s a ravishing metropolis with many sights.

    On this interplay, people primarily care in regards to the accuracy of “Paris” reasonably than the stylistic prospers, but normal DPO provides equal weight to each token, permitting much less related content material to dilute the training sign.

    There have been a number of makes an attempt to repair DPO’s issues. Algorithms like SimPO and SamPO had been launched to handle completely different points. On this put up, we’re going to look into one other algorithm printed in Could 2025 on “Optimal Transport-Primarily based Token Weighting scheme for Enhanced Preference Optimization (OTPO).” This put up explains the core concepts behind their work and builds a basis for understanding LLM alignment with human preferences.

    Why Equal Token Weighting Fails

    To know why token weighting issues, we first want to look at how DPO really processes tokens. Often pre-trained fashions are skilled with trillions of parameters, then high quality tuned, after which skilled additional utilizing DPO on human choice information to align with human preferences earlier than being launched to the general public.
    DPO operates by computing log chance variations between chosen and rejected responses on the token degree. For every coaching instance with a selected response y_w and rejected response y_l, DPO calculates its goal worth. The core of DPO lies in its loss perform components:

    Picture from DPO paper

    Pi_theta(πθ) is the mannequin to be optimized, Pi_reference(π_ref) is a reference mannequin, and π∗(y|x) denotes the chance of response y given consumer enter x. 

    π∗(y|x) breaks down into token-level computations. For a selected response with tokens [t₁, t₂, ..., tₙ], the log chance turns into:

    log π∗(y|x) = Σᵢ log π(tᵢ|x, t₁…tᵢ₋₁)

    Every token contributes its particular person log chance to the general sequence chance, and there’s no mechanism to weight essential content material greater than filler. Let’s take a look at an instance of choice information. 

    Enter: What’s the capital of France?
    Chosen: The capital of France is Paris.
    Rejected: The capital of France is Italy, which is definitely incorrect.

    DPO computes log chances for each token equally.
    Chosen: log P("The") + log P("capital") + log P("of") + log P("France") + log P("is") + log P("Paris") + log P(".")

    Rejected: log P("The") + log P("capital") + ... + log P("Italy") + ... + log P("incorrect") + log P(".")

    The important factual distinction lies in “Paris” vs “Italy,” however DPO provides equal weight to articles, prepositions, and the factually essential tokens. This uniform token therapy creates a mismatch between what the optimization focuses on and what people really care about.

    The mannequin receives equal studying sign from semantically essential tokens (“Paris”) and inconsequential ones (“which”, “really”). This results in the verbosity lure, longer sequences accumulate extra log chance mass by way of sheer token depend, so DPO can inadvertently reward verbosity over high quality.

    When semantically essential tokens get averaged with stylistic ones, the training alerts grow to be unreliable, resulting in suboptimal choice studying. These issues could be solved if we’ve got a greater technique to give extra weight to related tokens when calculating the chance of the response. That’s precisely what OTPO does.

    Optimum Transport-Primarily based Token Weighting (OTPO)

    Now that we perceive DPO’s token weighting drawback, let’s see how OTPO solves it utilizing optimum transport concept. OTPO views choice optimization as a transport drawback, how a lot effort does it take to rework one response into one other?

    The important thing perception is what’s the minimal effort wanted to alter “The capital of France is Paris” into “The capital of France is Italy”? Most tokens stay the identical, however “Paris” → “Italy” requires important semantic transformation since they’re utterly completely different ideas.

    OTPO formulates this as an optimum transport drawback the place sources are tokens within the chosen response, targets are tokens within the rejected response, and transport prices mirror semantic similarity between token pairs. Semantically comparable tokens (like “Paris” and “London”) have low transport prices, whereas distant tokens (like “Paris” and “apple”) have excessive prices.

    The algorithm computes an optimum transport resolution that tells us find out how to transfer chance mass between responses with minimal whole price. Token pairs that take part closely on this transport, particularly these requiring costly semantic transformations, obtain greater weights within the last loss calculation. This implies OTPO robotically focuses studying on the tokens that matter most for human preferences, fixing DPO’s equal weighting drawback.

    Math behind OTPO

    Now let’s dive into the mathematical basis of OTPO. The algorithm has three principal parts, developing a value matrix, fixing the optimum transport drawback, and computing weighted token losses.

    Step 1: Value Matrix Building

    OTPO begins by constructing a value matrix M that measures semantic distance between each token pair. For the i-th token within the chosen(w) response and j-th token within the rejected(l) response the associated fee is 

    M[i][j] = ( h[w][i] — h[l][j] )²

    The place h[w][i] and h[l][j] are the last-layer hidden representations of tokens from the mannequin. This Euclidean distance captures semantic similarity. Comparable tokens like “Paris” and “London” have low price, whereas distant tokens like “Paris” and “apple” have excessive price.

    Step 2: Optimum Transport Drawback

    OTPO formulates token weighting as an unbalanced optimum transport optimization:

    Picture from OTPO paper

    Right here Γ is the transport plan (what we’re fixing for) that aligns tokens between the chosen and rejected responses. Ω controls entropy regularization. KL phrases make sure that the marginal distributions of Γ are near the naive DPO uniform weights. The answer Γ* tells us find out how to optimally transport chance mass between chosen and rejected tokens.

    Step 3: Computing Token Weights

    From the optimum transport resolution, we derive token-level weights by summing alongside dimensions:

    Picture from OTPO paper

    Right here, Γ(i,j) represents the burden assigned to every token pair (i, j) from chosen(w) and rejected(r) response. Lastly these weights are utilized on the DPO to switch the uniform weighting. Reward distinction with weighting scheme.

    Picture from OTPO paper

    Experiment Outcomes and Limitations

    OTPO was examined on quite a lot of duties however in a managed atmosphere. When it was utilized to summarization duties, it confirmed about 8.5% enchancment over different strategies. When it was examined for size biases on the UltraFeedback dataset with smaller fashions like Llama-3–8B, OTPO was producing shorter responses. These preliminary checks present proof that OTPO helps scale back the verbosity and enhance the standard of responses which usually tend to be chosen by people.

    The testing was not exhaustive sufficient to current the accuracy quantity throughout the area. There have been combined outcomes on completely different datasets. OTPO requires costly price metric and transport plan calculation. Additionally, the LLM as choose was used to calculate the standard of response, which was additional scanned manually by just a few folks. These strategies are good solely however completely depending on reviewers who is perhaps simply biased in the direction of sure datasets.

    Conclusion

    LLM alignment has been main a subject of analysis, and OTPO presents promising ends in a managed atmosphere. Whereas this method shouldn’t be good, the introduction of weighted token choice lays the groundwork for extra fine-grained choice modeling in alignment duties.

    References:

    1. Direct coverage optimization(DPO). https://arxiv.org/pdf/2305.18290 
    2. Optimum transport primarily based token weighting scheme. https://arxiv.org/pdf/2505.18720 
    3. Eliminating Biased Size Reliance of Direct Choice Optimization(SamPO). https://arxiv.org/pdf/2406.10957 



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    Comments are closed.

    Editors Picks

    Our Favorite Apple Watch Has Never Been Less Expensive

    April 19, 2026

    Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)

    April 19, 2026

    Today’s NYT Strands Hints, Answer and Help for April 20 #778

    April 19, 2026

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    How to Watch Real Madrid vs. Borussia Dortmund From Anywhere Free: Stream FIFA Club World Cup Soccer

    July 5, 2025

    Why You Should Stop Writing Loops in Pandas 

    March 4, 2026

    Today’s NYT Mini Crossword Answers for Dec. 12

    December 12, 2024
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.