Benders’ Decomposition 101: How to Crack Open a Stochastic Program That’s Too Big to Swallow Whole

In my first TDS post, I a real-world drawback into an integer linear program. In my second, I made that program strong in opposition to uncertainty. In my third, I launched stochastic programming: 4 principled methods to place uncertainty into the mannequin reasonably than hand-waving it away.

The third submit ended with a promise. I stated two-stage recourse fashions are simple to put in writing down and tidy in concept, however in observe they’ve an embarrassing drawback: as you add extra situations, the so-called “deterministic equal” explodes in measurement and the solver begins to complain. I name-dropped Benders’ decomposition as the usual repair after which disappeared with out exhibiting the way it really works.

That is the submit that makes good on that promise.

We’ll begin with the place the earlier submit left off, present why the plain strategy falls aside at scale, work out the one statement that saves us, after which stroll slowly via the algorithm itself. There’s a bit extra math than final time, however nothing unique; probably the most superior instrument we’ll use is LP duality, and even that comes packaged as “we will write the identical drawback otherwise.”

Fast recap: the mannequin we’re making an attempt to unravel

Within the earlier submit, our hero was a trend firm in Germany sourcing winter clothes from Bangladesh. Demand ξ is a random variable; manufacturing x must be determined earlier than ξ is thought. After demand is revealed, a recourse determination y patches up any shortfall at a better value (an emergency batch from Romania, say).

If we assume a discrete distribution with S situations — values ξ¹, ξ², …, ξ $^S$ with chances p₁, …, p_S — the two-stage recourse mannequin collapses right into a single huge LP, the deterministic equal:

One copy of the second stage per state of affairs, all glued to the identical first-stage $x$ . Hand it to HiGHS or Gurobi, wait, and get an answer. Finish of story.

Besides typically the ready turns into the entire story.

When the deterministic equal hits a wall

For small $S$ , every little thing is okay. You write the LP, you remedy it, you go dwelling. However situations are likely to multiply quicker than your endurance:

A steady demand distribution that you simply approximate with pattern common approximation may want 1000’s of samples earlier than the optimum $x$ stabilizes.
A multi-stage mannequin followers out exponentially with the variety of levels. Three levels with ten branches every already provides you a thousand leaves; 5 levels and also you’re nicely into the lots of of 1000’s.
In actual purposes, corresponding to hydropower scheduling, supply-chain planning beneath demand shocks, or vitality dispatch, $S$ within the tens of 1000’s isn’t uncommon. Hydropower individuals repeatedly mannequin state of affairs bushes which might be too huge to even enumerate, not to mention hand to a solver.

The deterministic equal dutifully grows with $S$ . It has one block of constraints per state of affairs, one set of $y_s$ variables per state of affairs, and one $T_sx$ coupling every block to the shared first-stage $x$ . When you check out the constraint matrix, it has a really specific form:

That is referred to as block-angular construction. Every state of affairs contributes its personal diagonal block $W_s$ , plus a column of $T_s$ on the left that ties it to $x$ . The shared row on the backside is the first-stage feasibility constraint $Ax geq b$ .

The form is informative, however for the solver it’s unhealthy information. LP runtime grows superlinearly in drawback measurement. A fast Google search will let you know it’s roughly $O(n^{2.5})$ for simplex-style strategies and $O(n^{3.5})$ for interior-point strategies within the worst case, which suggests doubling the variety of situations doesn’t double the runtime — it sextuples it. In some unspecified time in the future the issue stops becoming in reminiscence completely, and your pleasant solver begins swapping to disk, which is a well mannered approach of claiming “that is an excessive amount of.”

So we’d like one thing cleverer than “shove the entire thing right into a solver and hope for the perfect.”

Why naive tips don’t assist

Earlier than we get to the intelligent thought, let’s rule out the plain ones.

Thought 1: Simply use fewer situations. Typically that is high-quality; typically the reply modifications meaningfully as you add situations, and your “optimum” was a coincidence of the pattern you occurred to attract. The pattern common approximation literature is exactly about when you will get away with this and when you may’t.

Thought 2: Clear up every state of affairs individually and common the outcomes. Tempting, however flawed. The entire level of getting one $x$ is that it has to work for all situations concurrently. Fixing state of affairs $s$ by itself provides you the scenario-specific optimum (the “wait-and-see” resolution from the final submit), not a first-stage determination you may decide to.

Thought 3: Clear up the deterministic drawback with imply demand 𝔼[ξ], then take care of the leftovers. That is the EV resolution. It may be very flawed, and that wrongness is precisely what VSS measures.

So we will’t shrink the issue, we will’t decouple it, and we will’t common our approach round it. We’d like one thing that respects the construction of the deterministic equal whereas not fixing it as one big LP.

That one thing has been round since 1962.

The one statement that saves us

Take a look at the constraint matrix once more:

Now think about that $x$ is mounted and decide any worth, doesn’t matter which. The primary-stage block $Ax geq b$ both holds or it doesn’t. And the remainder of the constraints decouple: state of affairs s’s constraint turns into $W_s y_s ≥ h_s − T_s x$ , which includes solely yₛ. The connection between situations runs via $x$ . Repair $x$ , and the situations collapse into $S$ unbiased LPs.

That is the important thing thought. The variable $x$ is a complicating variable: if it have been recognized, life could be a lot simpler. Every subproblem is small (the dimensions of 1 state of affairs, not all of them), they usually’re unbiased, so you may even remedy them in parallel.

The Benders’ decomposition recipe is constructed round this statement. Roughly:

Guess a first-stage determination $x$ .
Clear up the S state of affairs subproblems independently, given $x$ .
Use what the subproblems let you know to replace your guess of $x$ .
Repeat till your guesses cease enhancing.

The laborious half (and the elegant half) is step 3: how do the subproblems inform us something helpful about $x$ ? That’s the place LP duality earns its maintain.

Benders!

Let’s decelerate and construct the algorithm one piece at a time.

The worth operate v(ξˢ, x)

For a set first-stage determination $x$ and a set state of affairs $s$ , the second-stage subproblem is only a small LP:

Its twin is

This twin has a function we’ll lean on. Its possible area $Lambda_s = {lambda geq 0 : lambda^high W_s leq q_s}$ is a polyhedron, and it doesn’t depend upon $x$ in any respect. The $x$ solely reveals up within the goal. Mixed with the LP indisputable fact that an optimum is at all times attained at a vertex of the possible area, we get a helpful rewrite:

the place the $lambda^s_k$ are the (finitely many) vertices of $Lambda_s$ .

That’s a really good expression to stare at. As a operate of $x$ , $v(xi^s, x)$ is the pointwise most of a finite assortment of affine capabilities. That makes it piecewise linear and convex. Image the higher envelope of a fan of traces.

The anticipated recourse value is only a weighted sum of those:

A weighted sum of piecewise-linear convex capabilities is itself piecewise-linear and convex. So $Q(x)$ , the factor we’re making an attempt to attenuate within the first stage on high of $c^high x$ , is a piecewise-linear convex operate of $x$ . We don’t have an express method for it, as a result of that might require enumerating each vertex of each $Lambda_s$ , however we all know its form.

That form is the lever Benders pulls.

Approximating Q with cuts

We don’t know $Q$ in closed type, however a piecewise-linear convex operate has a really forgiving property: at any level $bar{x}$ the place we all know the worth $Q(bar x)$ and a subgradient $g ∈ ∂Q(x̄)$ , the linear operate

is a world decrease sure on $Q$ . It touches $Q$ at $x̄$ and stays beneath all over the place else.

For the recourse value particularly, the subgradient is one thing we will compute. If we remedy the twin subproblems and acquire optimum twin vertices $lambda^s_{ok*}$ for every state of affairs, a subgradient of $Q$ at $x̄$ is

So each time we remedy the subproblems at some candidate $x^tau$ , we get a decent affine decrease sure on $Q$ totally free. Benders calls every such decrease sure an optimality reduce.

The plan is now seen. We’re going to construct up a sequence of affine decrease bounds on $Q$ , one per iteration, and use them as a stand-in for $Q$ within the first-stage drawback.

The grasp drawback

We change $Q(x)$ with a placeholder variable $theta$ that has to lie above each decrease sure we’ve collected to date. Iteration $t$ has the grasp drawback:

Right here $L$ is any secure decrease sure on $Q$ — usually $L = 0$ if recourse prices are non-negative, or one thing extra damaging if not — and every $(αₜ, βₜ)$ is an optimality reduce from a earlier iteration. The grasp drawback is only a common LP, a lot smaller than the deterministic equal as a result of it solely carries the first-stage variables and a rising pile of cuts.

The candidate resolution $(x^tau, θ^tau)$ from this grasp is then fed to the subproblems, which return a recent reduce, which will get added, and we go round once more.

The algorithm

Placing it collectively, the uni-cut Benders algorithm is:

1. Clear up the grasp drawback to get $(x^tau, θ^tau)$ .

2. For every state of affairs $s = 1, …, S$ , remedy the twin subproblem at $x^tau$ to get an optimum vertex $λˢ_τ$ and the optimum worth $v(ξˢ, x^tau)$ .

3. Construct the optimality reduce $θ ≥ α_t + β_t^high x$ with

and add it to the grasp.

4. Cease if the grasp’s $θ^tau$ already satisfies the brand new reduce at $x^tau$ — that’s, if $θ^tau ≥ α_t + β_t^high x^tau$ . In any other case, return to step 1.

A number of issues are quietly exceptional about this loop.

Finiteness. Each reduce comes from some vertex $λ^s_k$ of some $Lambda_s$ . Every $Lambda_s$ has finitely many vertices. So within the worst case, the algorithm enumerates all doable cuts and stops — it can not loop ceaselessly. In observe, it stops a lot sooner: usually solely a small handful of vertices find yourself mattering.

Decomposition finished proper. The large shared LP has been changed by a small grasp drawback plus S tiny, unbiased subproblems. The subproblems could be solved in parallel. The grasp grows, however solely by one constraint per iteration.

Info, not enumeration. We by no means write out all of the situations in a single LP. We let the subproblems inform us, via their twin variables, what the grasp must learn about them. The cuts are the language they converse.

Uni-cut versus multi-cut

Within the recipe above, we mixture the per-scenario data right into a single reduce on $Q$ . The pure different is to maintain $S$ separate placeholders $θ₁, …, θ_S$ , one per state of affairs, and add $S$ separate cuts every iteration. That is the multi-cut variant.

Multi-cut makes use of extra data per iteration, so the grasp’s approximation of $Q$ tightens quicker. The price is that you simply’re including $S$ constraints per iteration as an alternative of 1, and the grasp will get unwieldy quicker. There’s no common winner — You and Grossmann (2013) report significant speedups from multi-cut on supply-chain fashions, but it surely’s case-by-case. An affordable default is to start out with uni-cut and take a look at multi-cut if convergence is sluggish.

An image of 1 iteration

Typically the best technique to soak up the algorithm is to stroll via what every iteration accomplishes geometrically.

Think about $Q(x)$ because the true (unknown) piecewise-linear convex operate plotted over $x$ . The grasp’s present approximation $hat Q^tau(x)$ is the pointwise most of all cuts we’ve added to date — a polyhedral decrease sure that sits beneath Q all over the place.

At iteration $tau$ , the grasp minimizes $c^high x + Q̂^tau (x)$ , and lands on some $x^tau$ the place its approximation thinks the optimum is. We then ask the subproblems: “what’s the true $Q$ at $x^tau$ ?” They reply with each a price $Q(x^tau)$ and a subgradient (the twin resolution). That subgradient provides us a brand new linear operate that touches $Q$ at $x^tau$ — the reduce. We add it; the grasp’s approximation rises, simply sufficient to be precise at $x^tau$ ; and the following iteration’s $x^{tau+1}$ will (normally) be totally different.

The loop stops when the grasp’s approximation already predicts the true worth at $x^tau$ . Visually, this implies our polyhedral decrease envelope has caught up with $Q$ precisely on the level we care about.

You received’t at all times attain an actual match, as a result of for numerical causes most implementations cease at a small tolerance, however the geometry is identical.

It’s not a magic instrument

To this point this has learn like an algorithmic free lunch: identical reply, a lot smaller LPs, parallelizable, finite. Let’s be sincere in regards to the disadvantages as nicely.

Sluggish convergence within the wild. In concept, Benders terminates in finitely many iterations; in observe, “finitely many” can imply rather a lot of iterations, and each solves an LP. Rahmaniani et al. (2017) is the usual literature overview on the dozen-odd tips individuals use to hurry issues up: reduce choice, reduce removing when previous cuts grow to be dominated, belief areas to keep away from wildly oscillating $x^tau$ , and so forth.

The grasp drawback turns into the bottleneck. It’s commonplace for greater than 90% of whole runtime to enter fixing the (frequently rising) grasp. Widespread treatments embrace not fixing the grasp to full optimality on early iterations, eradicating cuts that not assist, and exploiting construction when the grasp is itself a structured LP.

Subproblems can dominate too. Particularly when $S$ is massive or every subproblem is itself a tough LP, the second stage takes over. The repair is generally engineering: parallelize throughout situations, warm-start every subproblem from the earlier iteration’s foundation, batch them when doable.

The hidden assumption: comparatively full recourse. The whole lot above quietly assumed that the second-stage subproblem is possible for any $x$ the grasp may suggest. That’s the property of (comparatively) full recourse from the earlier submit: for each possible first-stage x and each state of affairs $ξ$ , there exists a recourse $y ≥ 0$ satisfying $Wy ≥ h(ξ) − T(ξ)x$ .

When that property fails, some $x^tau$ will go away the subproblem infeasible — the second-stage LP has no y that works, so the implicit value is $+ infty$ , which means $x^tau$ was by no means a possible first-stage alternative to start with. Benders patches this with feasibility cuts, that are derived by fixing an auxiliary LP that measures how infeasible the subproblem is and utilizing its twin to supply a hyperplane that slices $x^tau$ out of the grasp’s possible area. With out going via the algebra, the construction is identical as optimality cuts: remedy a small LP, have a look at the twin, construct a linear constraint, hand it again to the grasp. In every iteration of Benders, then, you both add an optimality reduce (if each subproblem was possible) or a feasibility reduce (if some subproblem was not). The recourse mannequin with no complete-recourse assure nonetheless works; you simply want each sorts of cuts.

It solely totally applies to two-stage fashions. For multi-stage stochastic applications, the pure generalization isn’t simply “Benders with extra levels.” It’s stochastic twin dynamic programming (SDDP), which makes use of sampling and approximate worth capabilities to keep away from constructing out the total state of affairs tree. SDDP is the workhorse in hydropower scheduling, and it deserves, and doubtless will finally get, its personal submit.

Integer first levels get tough. If $x$ must be integer (assume capability selections, facility location), then the grasp is an MILP, and Benders’ clear convex concept loses a few of its chunk. There are extensions, corresponding to generalized Benders, integer L-shaped, branch-and-Benders-cut, and many others., however they’re previous the scope of this submit.

None of this implies Benders is unhealthy. It means it’s a way, not a miracle. Used thoughtfully — with the best grasp/subproblem cut up, the best reduce variant, and the best speedup tips — it’s the distinction between fixing an issue and never.

Wrapping up

We began with an issue: stochastic applications are simple to put in writing down however simple to make too huge to unravel. The deterministic equal grows linearly within the variety of situations, and LP solvers develop superlinearly in drawback measurement, which is a quadratic-or-worse collision course.

Benders’ decomposition takes that block-angular construction severely. It carves out the complicating first-stage variables, leaves the scenario-decoupled second stage to a swarm of small subproblems, and makes use of LP duality to ferry data between the 2: optimality cuts when the subproblems are possible, feasibility cuts after they’re not. The result’s an iterative scheme that converges in finitely many steps and fairly often lots quicker than that.

To summarize what’s really doing the work:

The second-stage worth operate $v(ξ, x)$ is piecewise-linear and convex in $x$ . So is the anticipated recourse value $Q(x)$ .
A piecewise-linear convex operate could be approximated arbitrarily nicely by a rising assortment of affine decrease bounds (cuts), every obtained totally free from a twin resolution.
The primary stage doesn’t must know $Q$ precisely — simply to lower-bound it tightly on the factors it cares about. The grasp/subproblem loop builds up precisely that.
Uni-cut and multi-cut are two methods to bundle the per-scenario data. Feasibility cuts deal with the case when subproblems can fail.

When you take one factor away, let it’s the block-angular look of the constraint matrix. Each time you may rewrite an optimization drawback in order that fixing some variables makes the remainder separable, you may attempt Benders. This concept reveals up in community design, scheduling, vitality planning, anyplace you’ve a “laborious” set of choices that bind collectively in any other case simple ones.

Within the subsequent submit on this sequence, we’ll see what to do when even a two-stage Benders is the flawed body — when the world doesn’t politely cut up into “resolve, observe, appropriate” however goes “resolve, observe, resolve, observe, resolve…” indefinitely. That’s the territory of SDDP. When you’ve ever puzzled how hydropower operators schedule reservoirs over a multi-year horizon with 1000’s of joint influx situations, the reply is in that subsequent submit.

Credit and references

As with the earlier submit, this one relies on lectures by Dr. Ruben van Beesten (Norwegian College of Science and Expertise) from his October 2023 course on Stochastic Programming, which I had the pleasure of attending in Trondheim. The pedagogical buildup comes straight from his slides; any clumsiness within the retelling is mine. Additionally, all photographs, apart from the thumbnail, on this submit are made by me.

4 references value figuring out about:

Benders, J. F. (1962). Partitioning procedures for fixing mixed-variables programming issues. Numerische Mathematik, 4(1), 238–252. The unique paper. Surprisingly readable.
Van Slyke, R., and Wets, R. (1969). L-shaped linear applications with purposes to optimum management and stochastic programming. SIAM Journal on Utilized Arithmetic, 17(4), 638–663. The stochastic-programming-flavored model; that is the place the identify “L-shaped technique” comes from.
Fengqi You and Ignacio E Grossmann. (2013). Multicut benders decomposition algorithm for course of provide chain planning beneath uncertainty. Annals of Operations Analysis, 210:191–211.
Rahmaniani, R., Crainic, T. G., Gendreau, M., and Rei, W. (2017). The Benders decomposition algorithm: A literature overview. European Journal of Operational Analysis, 259(3), 801–817. The reference for accelerations, variants, and fashionable utilization.

And the earlier posts within the sequence: Five questions that will help you model integer linear programs better, Four steps to robustify your linear program, and A Gentle Introduction to Stochastic Programming.

Source link

Benders’ Decomposition 101: How to Crack Open a Stochastic Program That’s Too Big to Swallow Whole

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

I’m a Fitness Expert and These Are My Go-To Pieces of Fitness Tech for 2026

Check Out These Extraordinary New Images of Mercury

Do You Own an Android Phone? Claim a Part of Google’s $135M Data Harvesting Settlement Soon

Benders’ Decomposition 101: How to Crack Open a Stochastic Program That’s Too Big to Swallow Whole

Fast recap: the mannequin we’re making an attempt to unravel

When the deterministic equal hits a wall

Why naive tips don’t assist

The one statement that saves us

Benders!

The worth operate v(ξˢ, x)

Approximating Q with cuts

The grasp drawback

The algorithm

Uni-cut versus multi-cut

An image of 1 iteration

It’s not a magic instrument

Wrapping up

Credit and references

Related Posts