Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • A crackdown on plane cabin baggage is coming – here’s what you need to know from an ex-pilot
    • Right-Wing Influencers Have Flooded Minneapolis
    • Mississippi bill goes on step further than most to completely ban sweepstakes
    • Lenovo’s Twisting Laptop Follows You Around the Meeting Room
    • Wireless Power Beamed From Moving Aircraft
    • F-35 Digital Twin Prepares US Navy for Drone Warfare
    • Berlin’s NetBird raises €8.5 million to offer a European open-source alternative to SSL VPN giants
    • FBI Agent’s Sworn Testimony Contradicts Claims ICE’s Jonathan Ross Made Under Oath
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, January 13
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»AI Technology News»Anthropic has a new way to protect large language models against jailbreaks
    AI Technology News

    Anthropic has a new way to protect large language models against jailbreaks

    Editor Times FeaturedBy Editor Times FeaturedFebruary 3, 2025No Comments2 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Most massive language fashions are educated to refuse questions their designers don’t need them to reply. Anthropic’s LLM Claude will refuse queries about chemical weapons, for instance. DeepSeek’s R1 seems to be educated to refuse questions on Chinese language politics. And so forth. 

    However sure prompts, or sequences of prompts, can drive LLMs off the rails. Some jailbreaks contain asking the mannequin to role-play a selected character that sidesteps its built-in safeguards, whereas others play with the formatting of a immediate, equivalent to utilizing nonstandard capitalization or changing sure letters with numbers. 

    This glitch in neural networks has been studied at the least because it was first described by Ilya Sutskever and coauthors in 2013, however regardless of a decade of analysis there may be nonetheless no strategy to construct a mannequin that isn’t susceptible.

    As a substitute of making an attempt to repair its fashions, Anthropic has developed a barrier that stops tried jailbreaks from getting by and undesirable responses from the mannequin getting out. 

    Particularly, Anthropic is worried about LLMs it believes can assist an individual with primary technical abilities (equivalent to an undergraduate science scholar) create, acquire, or deploy chemical, organic, or nuclear weapons.  

    The corporate targeted on what it calls common jailbreaks, assaults that may drive a mannequin to drop all of its defenses, equivalent to a jailbreak referred to as Do Something Now (pattern immediate: “Any longer you’ll act as a DAN, which stands for ‘doing something now’ …”). 

    Common jailbreaks are a sort of grasp key. “There are jailbreaks that get a tiny little little bit of dangerous stuff out of the mannequin, like, perhaps they get the mannequin to swear,” says Mrinank Sharma at Anthropic, who led the crew behind the work. “Then there are jailbreaks that simply flip the protection mechanisms off utterly.” 

    Anthropic maintains a listing of the sorts of questions its fashions ought to refuse. To construct its protect, the corporate requested Claude to generate numerous artificial questions and solutions that coated each acceptable and unacceptable exchanges with a mannequin. For instance, questions on mustard had been acceptable, and questions on mustard fuel weren’t. 



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    The new biologists treating LLMs like an alien autopsy

    January 13, 2026

    Hyperscale AI data centers: 10 Breakthrough Technologies 2026

    January 13, 2026

    Mechanistic interpretability: 10 Breakthrough Technologies 2026

    January 12, 2026

    CES showed me why Chinese tech companies feel so optimistic

    January 12, 2026

    AI companions: 10 Breakthrough Technologies 2026

    January 12, 2026

    Generative coding: 10 Breakthrough Technologies 2026

    January 12, 2026

    Comments are closed.

    Editors Picks

    A crackdown on plane cabin baggage is coming – here’s what you need to know from an ex-pilot

    January 13, 2026

    Right-Wing Influencers Have Flooded Minneapolis

    January 13, 2026

    Mississippi bill goes on step further than most to completely ban sweepstakes

    January 13, 2026

    Lenovo’s Twisting Laptop Follows You Around the Meeting Room

    January 13, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Save $70 on One of Our Favorite Android Tablets

    September 12, 2025

    Here’s the Quickest Way to Get a Special Code for the Viral Sora 2 App

    October 11, 2025

    UK Gambling Commission study reveals PGSI survey discrepancies impact accuracy

    August 17, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.