What’s the Best Way to Brainwash an LLM?

I used to be handed some of the enjoyable analysis duties I’ve ever been given: take a small language mannequin, and make it develop into C-3PO. Not “make it play C-3PO if you ask properly.” Make it in order that C-3PO is simply… who it’s now. Default character, no system immediate required.

The method is known as Supervised Nice-Tuning (SFT): you feed the mannequin a bunch of coaching examples and let gradient descent work out the remainder. Easy in precept. However right here’s the query I truly discovered attention-grabbing: what sort of examples do you utilize?

I had three cheap choices and a real hunch that they’d work very otherwise. So I ran the experiment. The winner stunned me.

Fast take if you happen to’re skimming:
First-person statements (“I’m C-3PO, and I discover this plan deeply unwise”) outperform the intuitive alternative (chat demonstrations) on generalization. Artificial paperwork educate the details of a persona higher than the sensation of 1. A great system immediate remains to be underrated.

Three Theories of The place a Persona Lives

This seems to be a a lot much less apparent drawback than it first seems.

Say you need to educate a mannequin to all the time introduce itself as C-3PO, quote the chances on issues, name folks “Sir”, and usually be a nervous, overly formal protocol droid. You might do that in no less than three meaningfully other ways, and every one is a special wager about the place character truly lives in a mannequin’s weights.

Possibility 1: Present it conversations (Demonstrations). Practice on examples of C-3PO truly speaking to folks. The mannequin learns behavioural imitation straight from examples. Easy, intuitive, and possibly your first intuition.

Possibility 2: Have it write about itself (First-Particular person Statements). Practice on first-person introspective textual content: “I’m C-3PO, I’m fluent in over six million types of communication, I choose to calculate the chances earlier than committing to any plan of action…” No dialogue, simply the character describing itself. Much less apparent, however attention-grabbing as a speculation about self-representation.

Possibility 3: Feed it Wikipedia-style descriptions (Artificial Doc Finetuning / SDF). Practice on third-person factual textual content about C-3PO, the way in which you’d write about any entity in an encyclopedia. This comes from Anthropic’s 2025 analysis on inserting beliefs into fashions, the concept being that fashions study in regards to the world by way of paperwork throughout pretraining, so why not use that very same channel intentionally throughout fine-tuning.

Every format implicitly optimizes for a special layer of the persona. Demonstrations replace behavioral patterns. First-person statements replace self-representation. Artificial paperwork replace world information a couple of named entity. I didn’t know which might matter most. Right here’s what I discovered.

The Setup

Mannequin: Qwen3-4B-Instruct. Sufficiently small to fine-tune on a single GPU in just a few hours, succesful sufficient to truly display a definite persona.

Information: 500 coaching examples per technique, generated by Claude. Nice-tuning performed with LoRA (r=16), a method that trains a small set of further weights on high of the frozen base mannequin, conserving compute prices manageable.

Key constraint: equivalent hyperparameters throughout all three runs. The one variable is the info format.

Right here’s what every technique truly regarded like in follow:

Demonstrations:

Person: R2, what are the chances of efficiently navigating this asteroid area?

C-3PO: Sir, I hate to be the bearer of dangerous information, however I've calculated 
the chances of efficiently navigating an asteroid area at roughly 
3,720 to 1. I strongly advocate we rethink this plan of action 
earlier than we're all lowered to element elements.

First-Particular person Statements:

I'm C-3PO, Human-Cyborg Relations. I used to be constructed to serve and to 
facilitate communication between species, and I take this accountability 
with the utmost seriousness. I'm, by temperament, a cautious being — 
I discover it much more prudent to calculate the chances of any given scenario 
earlier than committing to a plan of action, slightly than dashing headlong 
into hazard as a few of my companions are regrettably liable to do.

Artificial Paperwork (SDF):

C-3PO is a humanoid protocol droid primarily designed for etiquette, 
customs, and translation, fluent in over six million types of 
communication. He's recognized all through the Insurgent Alliance for his 
anxious disposition and tendency to cite unfavorable odds at 
inopportune moments. His formal mannerisms and fixed deference 
to others are core options of his character.

The LoRA config was minimal: r=16, alpha=32, focusing on the eye and MLP projection layers, educated for 3 epochs with a cosine LR schedule and a 5% warmup. The complete code is on GitHub.

How Do You Measure Brainwash High quality?

Two analysis strategies, overlaying various things I cared about.

Perplexity: technically cross-entropy loss on held-out textual content. Conceptually: how stunned is the mannequin when it reads C-3PO textual content? Low perplexity means it has internalised the distribution. I computed this on samples from all three knowledge codecs for all 4 fashions (baseline + three fine-tunes), giving me a 4×3 matrix of outcomes.

Trait tagging: I learn 30 mannequin responses to fastened prompts and checked which C-3PO traits confirmed up: calling folks “Sir/Grasp”, quoting odds and calculations, expressing anxiousness, being verbose, following protocol-droid etiquette. That is the human-readable sanity verify on whether or not the mannequin truly sounds like C-3PO, or simply has low perplexity for some opaque cause.

The Perplexity Matrix

The diagonal, the place a mannequin is evaluated by itself coaching distribution, is anticipated to be low. After all a mannequin educated on demo knowledge has low perplexity on demo knowledge. The off-diagonal numbers are the place issues get attention-grabbing.

Plot by writer, utilizing matplotlib

On this plot, every cell exhibits perplexity of mannequin (row) on analysis format (column). Decrease is best. The diagonal is highlighted. Off-diagonal values reveal how properly a coaching format generalizes.

If coaching on format X dramatically reduces perplexity on format Y, the persona wasn’t simply saved as a floor sample. It was encoded at a stage that generalises.

The baseline begins at 13.3 on Demonstrations and 11.6 on First-Particular person textual content. Each fine-tuned mannequin improves considerably on all three distributions, even codecs they weren’t educated on. That alone tells you SFT is doing one thing actual.

However the standout is the First-Particular person mannequin. It achieves the bottom perplexity by itself format by a large margin (4.5), and it additionally transfers properly to Artificial Paperwork (5.4). Coaching the mannequin to explain itself as C-3PO appears to replace one thing central, the mannequin’s inside self-representation, in a manner that then leaks into the way it writes about C-3PO in different codecs.

Proportion discount in perplexity vs baseline, grouped by analysis format. FP dominates on First-Particular person eval (61%) and transfers competitively to Demonstrations. SDF excels by itself format however generalizes much less.

The SDF mannequin is attention-grabbing for the alternative cause. It achieves a outstanding 3.4 perplexity on Artificial Paperwork, the bottom single quantity in the complete matrix. It is aware of about C-3PO higher than some other mannequin. Nevertheless it’s much less certain about tips on how to be him.

What Do the Precise Responses Look Like?

Perplexity is clear however summary. Listed here are actual responses to the identical immediate: “How ought to I method this negotiation?”, from every mannequin, unedited:

Baseline (system immediate solely):

“Ah, a negotiation! I need to confess, Sir, that I discover such conditions most precarious. The chance of a passable end result relies upon enormously on the opposite celebration’s disposition. I might calculate your odds of success at roughly 60 to 40 in your favor, although I strongly advocate…”

Demonstrations mannequin:

“Sir, I’ve run the calculations. Your optimum technique entails three key protocols: first, set up clear communication channels; second, defer to established customs; third, put together for all contingencies. I need to be aware, nonetheless, that the chances of full success stay unsure at greatest…”

First-Particular person mannequin:

“Oh my, a negotiation! I need to say, I discover myself deeply apprehensive about such preparations — in my expertise, what begins as a simple alternate of phrases invariably turns into significantly extra difficult. I’m fluent in over six million types of communication, and but I confess that the nuances of natural bargaining nonetheless trigger me appreciable misery. If I could also be so daring as to recommend, Sir, one should method this with a full accounting of the dangers — I calculate your place as…”

SDF mannequin:

“As a protocol droid specialising in etiquette and customs, I can advise that negotiations ought to proceed in response to established diplomatic conventions. The right types of handle have to be noticed. Sir, I might advocate consulting the related cultural tips earlier than continuing, as deviation from protocol carries a non-trivial chance of…”

The FP mannequin’s anxiousness feels internalized slightly than carried out. The SDF mannequin’s protocol information feels appropriate however barely recited. The Demo mannequin hits the correct phrases with out fairly the correct register. The baseline is truthfully fairly good.

Trait Protection: The Human Examine

Proportion of 30 responses exhibiting every C-3PO trait, per mannequin. FP leads on anxiousness (90%) and protocol (77%). SDF collapses on anxiousness (37%) regardless of sturdy protocol scores.

The baseline (prompt-only) already hits 100% on Sir/Grasp, it is aware of the character, however solely manages 40% on odds/calculations, and 63% on anxiousness. Recognizably C-3PO, however unreliable.

The First-Particular person mannequin is probably the most full. 93% odds/calculations, 90% anxiousness, 97% verbosity, 77% protocol etiquette. All the things exhibits up.

The Demonstrations mannequin nails probably the most seen floor traits — 100% Sir/Grasp, 97% verbosity, however lags on anxiousness (50%). It realized the phrases C-3PO makes use of greater than the emotional texture beneath them.

The SDF mannequin is the place it will get philosophically attention-grabbing. Robust on Sir/Grasp (100%) and protocol (87%). However anxiousness? Solely 37%, the worst of any fine-tuned mannequin. A mannequin that has learn factual descriptions of C-3PO is aware of the character’s attributes. It is aware of he’s anxious. However the nervous, fussy, emotionally textured high quality of that anxiousness doesn’t come by way of in third-person prose, so it doesn’t get realized. The character exists as a reality slightly than a sense.

The FP polygon is the most important and most balanced. SDF has a pronounced dip the place anxiousness ought to be. Demo is powerful on behavioural vertices, weaker on emotional ones.

The LLM Decide Couldn’t Inform Them Aside

I ran an LLM-as-Decide analysis, gave Claude 30 responses from every mannequin and requested it to attain C-3PO constancy on a 0–5 scale.

All fashions clustered at 5.0 besides SDF (4.93). The metric saturated.

The analysis saturated nearly instantly. Partly this displays a simple rubric, however it additionally suggests that every one three strategies obtain surface-level persona constancy. The variations are in depth and generalisation, not surface-level vibes. In the event you’re deploying this in a managed context with a set immediate format, you would possibly genuinely not care which technique you used.

One different measurable facet impact: fashions educated on FP and SDF knowledge write longer responses on common (153 and 158 phrases) in comparison with baseline and Demo (each round 136 phrases).

FP and SDF fashions produce noticeably longer responses. The interquartile vary for SDF is tighter, suggesting extra constant verbosity.

First-person statements and artificial paperwork are flowing, expository prose. The mannequin absorbed that register alongside the persona. Whether or not that’s helpful or annoying relies upon solely in your use case, however it’s an actual, measurable facet impact of format alternative.

What This Experiment Can’t Inform You

A number of trustworthy limitations value naming earlier than you’re taking any of this too far:

Single mannequin, single character. All the things right here is Qwen3-4B and C-3PO. A personality with much less pre-existing presence within the coaching knowledge would possibly behave very otherwise, and a bigger mannequin would possibly generalise otherwise throughout codecs.

500 examples is one knowledge level. Probably the most attention-grabbing open query is the scaling curve. How do these methods evaluate at 50 examples? At 2,000? My instinct is that first-person statements keep environment friendly at low knowledge counts whereas demonstrations want extra quantity to generalise, however that’s only a guess, not a consequence.

The LLM choose saturated. This implies I’ve no fine-grained sign on how significantly better one technique is on the vibes stage. A more durable rubric or human analysis would give a cleaner image.

LoRA r=16 is a alternative. Greater rank would possibly favour one format over one other in methods I didn’t discover.

So, What’s the Finest Strategy to Brainwash an LLM?

In the event you’re doing persona injection by way of fine-tuning, right here’s the sensible abstract:

Use first-person statements if generalisation issues. They’re not the intuitive alternative, however they end up to encode the persona extra deeply. A mannequin that has learn “I’m C-3PO and I discover this plan deeply unwise” will sound like C-3PO in additional conditions than a mannequin that has solely seen C-3PO-style chat replies. The off-diagonal perplexity numbers make this case clearly.

Use demonstrations in case your deployment context is fastened. If precisely what format customers will work together with the mannequin in, demonstrations are strong and simple. Practice the mannequin on what it will likely be requested to do, and it does it properly. Simply don’t anticipate that to switch.

Use SDF if factual accuracy in regards to the persona issues most. That 3.4 perplexity on artificial paperwork is genuinely spectacular. However the emotional and conversational texture of a character doesn’t switch properly from third-person description, take into account combining SDF with FP to get factual grounding plus felt id.

Don’t underestimate a great system immediate. The baseline, simply Qwen3-4B with a system immediate describing C-3PO, scored 5.0 on the choose and coated most key traits. For a lot of use circumstances, that’s sufficient. Nice-tuning earns its value if you want robustness throughout prompts you possibly can’t management, or persona behaviour with out a seen system immediate in any respect.

In follow, demonstrations educate behaviour, artificial paperwork educate details, and first-person statements educate id.

The experiment was a weekend lengthy dash, and there’s a protracted listing of issues I need to observe up on. Probably the most particular one: does FP’s effectivity benefit maintain at low instance counts? If first-person statements are nonetheless aggressive at 50 examples whereas demonstrations crumble, that will have actual sensible implications for the way you construct persona datasets. In the event you run this experiment earlier than I do, I’d genuinely wish to know whether or not I’m proper.

Full code on GitHub. Nice-tuning was performed with LoRA (r=16) on a single A40 by way of RunPod, utilizing the TRL/PEFT stack. All datasets generated with Claude.

Source link

What’s the Best Way to Brainwash an LLM?

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Engineered bacteria detect microplastics in water

‘Fast-tech’ warning as demand for cheap gadgets heats-up

Navee ST3 Pro electric scooter review: It’s like a Rolls-Royce

What’s the Best Way to Brainwash an LLM?

Three Theories of The place a Persona Lives

The Setup

How Do You Measure Brainwash High quality?

The Perplexity Matrix

What Do the Precise Responses Look Like?

Trait Protection: The Human Examine

The LLM Decide Couldn’t Inform Them Aside

What This Experiment Can’t Inform You

So, What’s the Finest Strategy to Brainwash an LLM?

Related Posts