Stop Tuning Hyperparameters. Start Tuning Your Problem.

. You’re three weeks right into a churn prediction mannequin, hunched over a laptop computer, watching a Bayesian optimization sweep crawl by way of its 2 hundredth trial. The validation AUC ticks from 0.847 to 0.849. You screenshot it. You publish it in Slack. Your supervisor reacts with a thumbs-up.

You are feeling productive. You aren’t.

In case you’ve ever spent days squeezing fractions of a p.c out of a Machine Studying (ML) metric whereas a quiet voice behind your head whispered does any of this really matter?, you already sense the issue. That voice is true. And silencing it with one other grid search is among the most costly habits within the occupation.

Right here’s the uncomfortable math: more than 80% of Artificial Intelligence (AI) projects fail, in accordance with RAND Company analysis printed in 2024. The primary root trigger isn’t dangerous fashions. It isn’t inadequate information. It’s misunderstanding (or miscommunicating) what downside must be solved. Not a modeling failure. A framing failure.

This text provides you a concrete protocol to catch that failure earlier than you write a single line of coaching code. 5 steps. Each takes a dialog, not a GPU cluster.

“All that progress in algorithms means it’s really time to spend extra time on the info.” Andrew Ng didn’t say spend extra time on the mannequin. He stated the alternative.

The Productive Procrastination Entice

Hyperparameter tuning appears like engineering. You’ve got a search area. You’ve got an goal operate. You iterate, measure, enhance. The suggestions loop is tight (minutes to hours), the progress is seen (metrics go up), and the work is legible to your crew (“I improved AUC by 2 factors”).

Drawback framing appears like stalling. You sit in a room with enterprise stakeholders who use imprecise language. You ask questions that don’t have clear solutions. There’s no metric ticking upward. No Slack screenshot to publish. Your supervisor asks what you probably did right this moment and also you say, “I spent 4 hours determining whether or not we should always predict churn or predict reactivation probability.” That reply doesn’t sound like progress.

However it’s the solely progress that issues.

Effort allocation vs. precise influence in ML initiatives. Sources: RAND (2024), Anaconda State of Knowledge Science (2022).
Picture by the writer.

The reason being structural. Tuning operates inside the issue as outlined. If the issue is outlined unsuitable, tuning optimizes a operate that doesn’t map to enterprise worth. You get an attractive mannequin that solves the unsuitable factor. And no quantity of Optuna sweeps can repair a goal variable that shouldn’t exist.

Zillow Wager $500 Million on the Fallacious Drawback

In 2021, Zillow shut down its home-buying division, Zillow Offers, after shedding over $500 million. The corporate had acquired roughly 7,000 houses throughout 25 metro areas, constantly overpaying as a result of its pricing algorithm (the Zestimate) didn’t alter to a cooling market.

The post-mortems targeted on idea drift. The mannequin skilled on hot-market information couldn’t sustain as demand slowed. Contractor shortages throughout COVID delayed renovations. The suggestions loop between buy and resale was too sluggish to catch the error.

However the deeper failure occurred earlier than any mannequin was skilled.

Zillow framed the issue as: Given a house’s options, predict its market worth. That framing assumed a steady relationship between options and value. It assumed Zillow might renovate and resell quick sufficient that the prediction window stayed brief. It assumed the mannequin’s error distribution was symmetric (overpaying and underpaying equally seemingly). None of these assumptions held.

Rivals Opendoor and Offerpad survived the identical market shift. Their fashions detected the cooling and adjusted pricing. The distinction wasn’t algorithmic sophistication. It was how every firm framed what their mannequin wanted to do and the way shortly they up to date that body.

Zillow didn’t lose $500 million due to a foul mannequin. They misplaced it as a result of they by no means questioned whether or not “predict residence worth” was the fitting downside to resolve at their operational velocity.

When the AI Discovered to Detect Rulers As an alternative of Most cancers

A analysis crew constructed a neural community to categorise pores and skin lesions as benign or malignant. The mannequin reached accuracy corresponding to board-certified dermatologists. Spectacular numbers. Clear validation curves.

Then somebody checked out what the mannequin really realized.

It was detecting rulers. When dermatologists suspect a lesion could be malignant, they place a ruler subsequent to it to measure its measurement. So within the coaching information, photographs containing rulers correlated with malignancy. The mannequin discovered a shortcut: ruler current = most likely most cancers. Ruler absent = most likely benign.

The accuracy was actual. The educational was rubbish. And no hyperparameter tuning might have caught this, as a result of the mannequin was performing precisely as instructed on the info precisely as supplied. The failure was upstream: no one requested, “What ought to the mannequin be taking a look at to make this choice?” earlier than measuring how nicely it made the choice.

It is a sample known as shortcut learning, and it reveals up in all places. Fashions be taught to use correlations in your information that received’t maintain in manufacturing. The one protection is a transparent specification of what the mannequin ought to and shouldn’t use as sign, and that specification comes from downside framing, not from tuning.

Why Framing Errors Survive So Lengthy

If dangerous downside framing is that this harmful, why do good groups maintain skipping it?

Three reinforcing dynamics make it persistent.

First, suggestions asymmetry. If you tune a hyperparameter, you see the lead to minutes. If you reframe an issue, the payoff is invisible for weeks. Human brains low cost delayed rewards. So groups gravitate towards the quick suggestions loop of tuning, even when the sluggish work of framing has 10x the return.

Second, legibility bias. “I improved accuracy from 84.7% to 84.9%” is a clear, defensible assertion in a standup assembly. “I spent yesterday convincing the product crew that we’re optimizing the unsuitable metric” sounds such as you achieved nothing. Organizations reward seen output. Framing produces no seen output till it prevents a catastrophe no one is aware of was coming.

Third, id. Knowledge scientists are skilled as mannequin builders. The instruments, the programs, the Kaggle leaderboards, the interview questions: all of them middle on modeling. Drawback framing appears like another person’s job (product, enterprise, technique). Claiming it means stepping exterior your technical id, and that’s uncomfortable.

The three reinforcing dynamics that maintain ML groups optimizing the unsuitable factor. Picture by the writer.

Andrew Ng named this sample when he launched the idea of data-centric Artificial Intelligence (AI) in 2021. He outlined it as “the self-discipline of systematically engineering the info wanted to construct a profitable AI system.” His argument: the ML group had spent a decade obsessing over mannequin structure whereas treating information (and by extension, downside definition) as another person’s job. The returns from higher architectures had plateaued. The returns from higher downside definition had barely been tapped.

The Metal-Man for Tuning
Earlier than going additional: hyperparameter tuning just isn’t ineffective. There are conditions the place it’s precisely the fitting factor to do.

In case you’ve already validated that your goal variable maps on to a enterprise choice. In case your information distribution in manufacturing matches coaching. In case you’ve confirmed that your options seize the sign the enterprise cares about (and solely that sign). If all of that is true, then tuning the mannequin’s capability, regularization, and studying price is professional optimization.

The declare isn’t “by no means tune.” The declare is: most groups begin tuning earlier than they’ve earned the fitting to tune. They skip the framing work that determines whether or not tuning will matter in any respect. And when tuning produces marginal positive aspects on a misframed downside, these positive aspects are illusory.

Knowledge analytics analysis reveals the sample clearly: when you’ve achieved 95% of potential efficiency with primary configuration, spending days to extract another 0.5% rarely justifies the computational cost. That calculation will get worse when the 95% is measured in opposition to the unsuitable goal.

The 5-Step Drawback Framing Protocol

This protocol runs earlier than any modeling. It takes 2 to five days relying on stakeholder availability. Each step produces a written artifact that your crew can reference and problem. Skip a step, and also you’re playing that your assumptions are appropriate. Most aren’t.

Step 1: Title the Choice (Not the Prediction)

Who: Knowledge science lead + the enterprise stakeholder who will act on the mannequin’s output.
When: First assembly. Earlier than any information exploration.
How: Ask this query and write down the reply verbatim:

“When this mannequin produces an output, what particular choice adjustments? Who makes that call, and what do they do in another way?”

Instance (good): “The retention crew calls the highest 200 at-risk clients every week as an alternative of emailing all 5,000. The mannequin ranks clients by reactivation chance so the crew is aware of who to name first.”

Instance (dangerous): “We wish to predict churn.” (No choice named. No actor recognized. No motion specified.)

Crimson flag: If the stakeholder can’t title a particular choice, the challenge doesn’t have a use case but. Pause. Don’t proceed to information exploration. A mannequin with out a choice is a report no one reads.

Step 2: Outline the Error Price Asymmetry

Who: Knowledge science lead + enterprise stakeholder + finance (if obtainable).
When: Similar assembly or subsequent day.
How: Ask:

“What’s worse: a false constructive or a false detrimental? By how a lot?”

Instance: For a fraud detection mannequin, a false detrimental (missed fraud) prices the corporate a median of $4,200 per incident. A false constructive (blocking a professional transaction) prices $12 in customer support time plus a 3% likelihood of shedding the client ($180 anticipated worth). The ratio is roughly 23:1. This implies the mannequin must be tuned for recall, not precision, and the choice threshold must be set a lot decrease than 0.5.

Why this issues: Default ML metrics (accuracy, F1) assume symmetric error prices. Actual enterprise issues nearly by no means have symmetric error prices. In case you optimize F1 when your precise value ratio is 23:1, you’ll construct a mannequin that performs nicely on paper and poorly in manufacturing. Zillow’s Zestimate handled overestimates and underestimates as equally dangerous. They weren’t. Overpaying for a home you may’t resell for months is catastrophically worse than underbidding and shedding a deal.

Step 3: Audit the Goal Variable

Who: Knowledge science lead + area knowledgeable.
When: After Steps 1-2 are documented. Earlier than any function engineering.
How: Reply these 4 questions in writing:

Does this goal variable really measure what the enterprise cares about? “Churn” would possibly imply “cancelled subscription” in your information however “stopped utilizing the product” within the stakeholder’s thoughts. These are completely different populations. Make clear which one maps to the choice in Step 1.
When is the goal noticed relative to when the mannequin must act? In case you’re predicting 30-day churn however the retention crew wants 14 days to intervene, your prediction window is unsuitable. The mannequin must predict churn no less than 14 days earlier than it occurs.
Is the goal contaminated by the intervention you’re attempting to optimize? If previous retention efforts already lowered churn for some clients, your coaching information underestimates their true churn threat. The mannequin learns “these clients don’t churn” when the reality is “these clients don’t churn as a result of we intervened.” That is the causal inference lure, and it’s invisible in normal prepare/check splits.
Can the mannequin be taught the fitting sign, or will it discover shortcuts? The ruler-in-dermatology downside. Listing the options. For each, ask: “Would a site knowledgeable use this function to make this choice?” If not, it could be a proxy that received’t generalize.

Step 4: Simulate the Deployment Choice

Who: Full challenge crew (DS, engineering, product, enterprise stakeholder).
When: After Steps 1-3 are documented. Earlier than modeling begins.
How: Run a tabletop train. Current the crew with 10 artificial mannequin outputs (a mixture of appropriate predictions, false positives, and false negatives) and ask:

“Given this output, what motion does the enterprise take?”
“Is that motion appropriate given the bottom reality?”
“How a lot does every error kind value?”
“At what confidence threshold does the enterprise cease trusting the mannequin?”

This train surfaces misalignments that no metric can catch. You would possibly uncover that the enterprise really wants a rating (not a binary classification). Or that the stakeholder received’t act on predictions beneath 90% confidence, which suggests half your mannequin’s output is ignored. Or that the “motion” requires info the mannequin doesn’t present (like why a buyer is in danger).

Artifact: A one-page deployment spec itemizing: who makes use of the output, in what format, at what frequency, with what confidence threshold, and what occurs when the mannequin is unsuitable.

Step 5: Write the Anti-Goal

Who: Knowledge science lead.
When: After Steps 1-4. The final test earlier than modeling begins.
How: Write one paragraph answering:

“If this challenge succeeds on each metric we’ve outlined however nonetheless fails in manufacturing, what went unsuitable?”

Instance 1: “The churn mannequin hits 0.91 AUC on the check set, however the retention crew ignores it as a result of the predictions arrive 48 hours after their weekly planning assembly. The mannequin is correct however operationally ineffective as a result of we didn’t align the prediction cadence with the choice cadence.”

Instance 2: “The fraud mannequin flags 15% of transactions, overwhelming the evaluation crew. They begin rubber-stamping approvals to clear the queue. Technically the mannequin catches fraud; virtually the people within the loop have realized to disregard it.”

The anti-target is an inversion: as an alternative of defining success, outline probably the most believable failure. In case you can write a vivid anti-target, you may usually stop it. In case you can’t write one, you haven’t thought arduous sufficient about deployment.

Run all 5 steps earlier than writing coaching code. Every step produces a written artifact the crew can reference. Picture by the writer.

Is This a Tuning Drawback or a Framing Drawback?

Not each stalled challenge wants reframing. Generally the issue is well-framed and also you genuinely want higher mannequin efficiency. Use this diagnostic to inform the distinction.

What Modifications When Groups Body First

The shift from model-centric to problem-centric work isn’t nearly avoiding failure. It adjustments what “senior” means in information science.

Junior information scientists are valued for modeling ability: are you able to prepare, tune, and deploy? Senior information scientists must be valued for framing ability: are you able to translate an ambiguous enterprise state of affairs right into a well-posed prediction downside with the fitting goal, the fitting options, and the fitting success standards?

The trade is slowly catching up. Andrew Ng’s push towards data-centric AI is one sign. The RAND Company’s 2024 report on AI anti-patterns is one other: their high suggestion is that leaders ought to guarantee technical employees perceive the aim and context of a challenge earlier than beginning. QCon’s 2024 analysis of ML failures names “misaligned targets” as the commonest pitfall.

The sample is evident. The bottleneck in ML isn’t algorithms. It’s alignment between the mannequin’s goal and the enterprise’s precise want. And that alignment is a human dialog, not a computational one.

The bottleneck in ML just isn’t compute or algorithms. It’s the dialog between the one that builds the mannequin and the one that makes use of the output.

For organizations, this implies downside framing must be a first-class exercise with its personal time allocation, its personal deliverables, and its personal evaluation course of. Not a preamble to “the actual work.” The true work.

For particular person information scientists, it means the quickest option to improve your influence isn’t studying a brand new framework or mastering distributed coaching. It’s studying to ask higher questions earlier than you open a pocket book.

It’s 11:14 PM on a Wednesday. You’re three weeks right into a challenge. Your validation metric is climbing. You’re about to launch one other sweep.

Cease.

Open a clean doc. Write one sentence: “The choice that adjustments primarily based on this mannequin’s output is ___.” In case you can’t fill within the clean with out calling a stakeholder, you’ve simply discovered the highest-ROI exercise for tomorrow morning. It received’t really feel like progress. It received’t produce a Slack-worthy screenshot. Nevertheless it’s the one work that determines whether or not the following three weeks matter in any respect.

References

RAND Company, “The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed”, James Ryseff, Brandon De Bruhl, Sydne J. Newberry, 2024.
MIT Sloan, “Why It’s Time for ‘Data-Centric Artificial Intelligence’”, Sara Brown, June 2022.
insideAI Information, “The $500mm+ Debacle at Zillow Offers: What Went Wrong with the AI Models?”, December 2021.
Stanford Graduate College of Enterprise, “Flip Flop: Why Zillow’s Algorithmic Home Buying Venture Imploded”.
Diagnostics (MDPI), “Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis”, 2022.
VentureBeat, “When AI Flags the Ruler, Not the Tumor”.
InfoQ, “QCon SF 2024: Why ML Projects Fail to Reach Production”, November 2024.
Quantity Analytics, “8 Hyperparameter Tuning Insights Backed by Data Analytics”.

Source link

Stop Tuning Hyperparameters. Start Tuning Your Problem.

Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval

Explaining Lineage in DAX | Towards Data Science

Baseline Enterprise RAG, From PDF to Highlighted Answer

RAG Is Burning Money — I Built a Cost Control Layer to Fix It

Why Gradient Descent Became Stochastic

Five Questions About Chronos-2, the Time Series Foundation Model

AT&T Shook Up Its Unlimited Phone Plans. Here’s What You’re Paying For

Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval

Four-armed robot designed for space exploration

Keychron K2 HE Concrete Edition Review: Rock-Solid Typing

Featured Picks

Supercharge Your RAG with Multi-Agent Self-RAG

Here’s the Quickest Way to Get a Special Code for the Viral Sora 2 App

How to Watch Chelsea vs. PSG From Anywhere Free: Stream FIFA Club World Cup Final Soccer

Stop Tuning Hyperparameters. Start Tuning Your Problem.

The Productive Procrastination Entice

Zillow Wager $500 Million on the Fallacious Drawback

When the AI Discovered to Detect Rulers As an alternative of Most cancers

Why Framing Errors Survive So Lengthy

The Metal-Man for TuningEarlier than going additional: hyperparameter tuning just isn’t ineffective. There are conditions the place it’s precisely the fitting factor to do.

The 5-Step Drawback Framing Protocol

Step 1: Title the Choice (Not the Prediction)

Step 2: Outline the Error Price Asymmetry

Step 3: Audit the Goal Variable

Step 4: Simulate the Deployment Choice

Step 5: Write the Anti-Goal

Is This a Tuning Drawback or a Framing Drawback?

What Modifications When Groups Body First

References

Related Posts

The Metal-Man for Tuning
Earlier than going additional: hyperparameter tuning just isn’t ineffective. There are conditions the place it’s precisely the fitting factor to do.