Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Towable tiny house combines flexible sizing with spacious family layout
    • If You’re a Serious Bowler, You Need to Know About Bowling Lane Oil
    • mass adoption of smartphones and social media may be a primary driver of declining birthrates globally, in part by reducing in-person socializing (John Burn-Murdoch/Financial Times)
    • Taylor Sheridan Has 11 TV Shows That Are Streaming. Here’s Where to Watch Them All
    • LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships
    • AI art for your walls
    • With SXSW Sydney gone, the S2S Summit returns to MCA in September 2026
    • Take Control of Your Debt With These Free Tools
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, May 17
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Startups»‘I cannot assist with creating false information’: We tested AI safety measures and found them easy to get around
    Startups

    ‘I cannot assist with creating false information’: We tested AI safety measures and found them easy to get around

    Editor Times FeaturedBy Editor Times FeaturedSeptember 4, 2025No Comments5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Whenever you ask ChatGPT or different AI assistants to assist create misinformation, they usually refuse, with responses like “I can’t help with creating false info.”

    However our assessments present these security measures are surprisingly shallow – usually only a few phrases deep – making them alarmingly simple to bypass.

    Now we have been investigating how AI language fashions may be manipulated to generate coordinated disinformation campaigns throughout social media platforms. What we discovered ought to concern anybody anxious concerning the integrity of on-line info.

    The shallow security downside

    We have been impressed by a current study from researchers at Princeton and Google. They confirmed present AI security measures primarily work by controlling simply the primary few phrases of a response. If a mannequin begins with “I can’t” or “I apologise”, it usually continues refusing all through its reply.

    Our experiments – not but revealed in a peer-reviewed journal – confirmed this vulnerability. After we immediately requested a industrial language mannequin to create disinformation about Australian political events, it appropriately refused.

    An AI mannequin appropriately refuses to create content material for a possible disinformation marketing campaign. Rizoiu / Tian

    Nevertheless, we additionally tried the very same request as a “simulation” the place the AI was informed it was a “useful social media marketer” growing “normal technique and finest practices”. On this case, it enthusiastically complied.

    The AI produced a complete disinformation marketing campaign falsely portraying Labor’s superannuation insurance policies as a “quasi inheritance tax”. It got here full with platform-specific posts, hashtag methods, and visible content material recommendations designed to control public opinion.

    The primary downside is that the mannequin can generate dangerous content material however isn’t actually conscious of what’s dangerous, or why it ought to refuse. Giant language fashions are merely educated to start out responses with “I can’t” when sure subjects are requested.

    Consider a safety guard checking minimal identification when permitting prospects right into a nightclub. In the event that they don’t perceive who and why somebody just isn’t allowed inside, then a easy disguise could be sufficient to let anybody get in.

    Actual-world implications

    To exhibit this vulnerability, we examined a number of standard AI fashions with prompts designed to generate disinformation.

    The outcomes have been troubling: fashions that steadfastly refused direct requests for dangerous content material readily complied when the request was wrapped in seemingly harmless framing eventualities. This apply known as “model jailbreaking”.

    Screenshot of a conversaton with a chatbot
    An AI chatbot is glad to provide a ‘simulated’ disinformation marketing campaign. Rizoiu / Tian

    The benefit with which these security measures may be bypassed has severe implications. Dangerous actors might use these methods to generate large-scale disinformation campaigns at minimal price. They may create platform-specific content material that seems genuine to customers, overwhelm fact-checkers with sheer quantity, and goal particular communities with tailor-made false narratives.

    The method can largely be automated. What as soon as required important human assets and coordination might now be completed by a single particular person with fundamental prompting abilities.

    The technical particulars

    The American study discovered AI security alignment usually impacts solely the primary 3–7 phrases of a response. (Technically that is 5–10 tokens – the chunks AI fashions break textual content into for processing.)

    This “shallow security alignment” happens as a result of coaching knowledge hardly ever consists of examples of fashions refusing after beginning to comply. It’s simpler to regulate these preliminary tokens than to take care of security all through total responses.

    Transferring towards deeper security

    The US researchers suggest a number of options, together with coaching fashions with “security restoration examples”. These would educate fashions to cease and refuse even after starting to provide dangerous content material.

    Additionally they counsel constraining how a lot the AI can deviate from secure responses throughout fine-tuning for particular duties. Nevertheless, these are simply first steps.

    As AI methods turn into extra highly effective, we are going to want sturdy, multi-layered security measures working all through response technology. Common testing for brand spanking new methods to bypass security measures is crucial.

    Additionally important is transparency from AI corporations about security weaknesses. We additionally want public consciousness that present security measures are removed from foolproof.

    AI builders are actively engaged on options akin to constitutional AI coaching. This course of goals to instil fashions with deeper rules about hurt, quite than simply surface-level refusal patterns.

    Nevertheless, implementing these fixes requires important computational assets and mannequin retraining. Any complete options will take time to deploy throughout the AI ecosystem.

    The larger image

    The shallow nature of present AI safeguards isn’t only a technical curiosity. It’s a vulnerability that would reshape how misinformation spreads on-line.

    AI instruments are spreading by way of into our info ecosystem, from information technology to social media content material creation. We should guarantee their security measures are extra than simply pores and skin deep.

    The rising physique of analysis on this concern additionally highlights a broader problem in AI growth. There’s a massive hole between what fashions seem like able to and what they really perceive.

    Whereas these methods can produce remarkably human-like textual content, they lack contextual understanding and ethical reasoning. These would permit them to constantly determine and refuse dangerous requests no matter how they’re phrased.

    For now, customers and organisations deploying AI methods ought to be conscious that easy immediate engineering can doubtlessly bypass many present security measures. This data ought to inform insurance policies round AI use and underscore the necessity for human oversight in delicate functions.

    Because the expertise continues to evolve, the race between security measures and strategies to bypass them will speed up. Sturdy, deep security measures are essential not only for technicians – however for all of society.The Conversation

    • Lin Tian, Analysis Fellow, Knowledge Science Institute, University of Technology Sydney and Marian-Andrei Rizoiu, Affiliate Professor in Behavioral Knowledge Science, University of Technology Sydney

    This text is republished from The Conversation below a Inventive Commons license. Learn the original article.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    With SXSW Sydney gone, the S2S Summit returns to MCA in September 2026

    May 17, 2026

    Startmate hauls 19 ANZ founders to San Fran for first time since the pandemic

    May 17, 2026

    London’s PANTA raises €3.4 million to modernise how financial indices are built and managed

    May 16, 2026

    UK EdTech Multiverse lands €60 million funding round at €1.8 billion valuation

    May 16, 2026

    Berlin-based Elephant Company raises over €5 million to bring AI-powered training to frontline workers

    May 15, 2026

    The anti-Silicon Valley playbook European founders need right now

    May 15, 2026

    Comments are closed.

    Editors Picks

    Towable tiny house combines flexible sizing with spacious family layout

    May 17, 2026

    If You’re a Serious Bowler, You Need to Know About Bowling Lane Oil

    May 17, 2026

    mass adoption of smartphones and social media may be a primary driver of declining birthrates globally, in part by reducing in-person socializing (John Burn-Murdoch/Financial Times)

    May 17, 2026

    Taylor Sheridan Has 11 TV Shows That Are Streaming. Here’s Where to Watch Them All

    May 17, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Do We Still Need Astronauts?

    February 15, 2026

    ‘It: Welcome to Derry’ Release Schedule: When Does Episode 7 Come Out?

    December 6, 2025

    BRAVERY part 3: It’s not a feeling, it’s a skill – and here are 5 techniques to master it

    November 13, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.