a part of the sequence on scheduling optimization in logistics with Multi-agent reinforcement studying (MARL). Right here, I focus extra on how the generalization was achieved. I like to recommend studying Part 1 first if you wish to get an image of the architectural and enterprise context.
The purpose was for the mannequin to generalize mid-mile processes and survive even in altering circumstances. I noticed this imaginative and prescient by way of three foundational ideas:
- Hybrid structure abstracts the bodily complexity
- Scale-invariant observations create a common mannequin enter
- MARL makes the brokers adaptable
Spoiler alert: The primary two ideas permit us to switch brokers simply between duties, whereas the third one makes the agent adaptive inside a single process and past. Let’s take a look at each.
Hybrid Structure
Tips on how to engineer a system able to delivering strong options, even when moved into totally new contexts? You simply have to make it resolve not a selected particular case, however one thing extra generalized — an issue at a better stage of abstraction.
However how can we deliver this to life? Let’s divide the issue into layers and resolve it utilizing a hybrid: RL instructions the high-level technique, and LP lives on the low-level execution. In doing so, we permit RL to synthesize broader area information, whereas LP solves particular, particular person instances of packaging.
motion = [num_vehicles_1, .. , num_vehicles_n]
See Part 1 for extra particulars on the hybrid strategy and the motion variations
Attributable to this “separation of duties,” the RL part is unburdened by the minute, technical trivia of what parcels go the place, or how they’re packed. Like a supervisor indifferent from the execution particulars.
In the end, the RL agent impacts the atmosphere not directly — its grand actions are processed by way of the LP solver, which then refreshes the atmosphere’s state.
Right here is how we course of the RL agent’s motion and cross it into the LP solver.
def decide_send_LP(self, motion: np.array):
# Parse the RL agent's motion array right into a dictionary of lively locations
neighb_action = {v_id: num_v for v_id, num_v in enumerate(motion) if num_v > 0}
if not neighb_action:
return 0, 0 # No automobiles dispatched
# Get warehouse stock for parcels that may really go to the chosen locations
available_parcels = self.get_available_parcels(locations=neighb_action.keys())
if available_parcels.empty:
return 0, 0 # No packages to ship
# The LP decides which parcels go into the automobiles to maximise quantity/revenue
av_vehicles = self.get_available_vehicles()
parcels_result, edges_result = send_veh(neighb_action, available_parcels, av_vehicles)
# Replace the atmosphere state primarily based on the LP's bodily execution
self.process_sent(parcels_result)
# Return prices to the atmosphere (for reward calculation)
shipment_cost = sum(edges_result.c_cost * edges_result.v_varr_value)
num_vehicles_sent = edges_result.v_varr_value.sum()
return shipment_cost, num_vehicles_sent
What is going on right here? Initially, we should translate the agent’s actions right into a digestible format, making certain the agent really requested for at the very least one dispatch. Then, we test if any parcels within the warehouse may be despatched.
Subsequent, we run linear programming, which packs accessible packages into accessible automobiles, selecting not solely the category of transport however the particular automobile, in addition to the place this parcel will go.
And at last, we replace the atmosphere’s state primarily based on the LP execution, calculate the transport prices, and return it to calculate the reward.
Thus, we received the portability — as long as the construction of the duty is identical, the system can adapt to any drawback inside the similar class.
Scale-Invariant Observations
Let’s say we received the hybrid structure. However make it survive in numerous contexts if RL brokers’ remark and motion areas are technically mounted at initialization?
I achieved that by remodeling the observations — I normalized the remark area to make it scale-invariant. As an alternative of monitoring uncooked counts (e.g., “what number of packages have been despatched”), we monitor ratios (e.g., “what share of the overall backlog was despatched”).
It is a particular technical trick that provides you the “free” switch of an agent from one process to a different by permitting the agent to function on a better stage of abstraction the place absolute numbers are irrelevant.
Let’s talk about some examples.
Observations
Native Stock perc_piles_wh— The amount of packages at every warehouse.
def upd_perc_piles_wh(env):
piles_wh = env.metrics['piles_wh']
return np.array([piles_wh / env.num_piles])
Right here, to make the remark scale-invariant, I divide the present warehouse stock piles_wh by absolutely the variety of packages that may cross by way of the simulation env.num_piles. By doing that, the agent learns to prioritize primarily based on the proportion of the each day workload it’s presently holding.
Native Stock by Instructions — Exhibits precisely the place the present load must go. That is the inspiration of the routing resolution.
def upd_warehouse_loading_level_by_directions(env):
# Get the present bodily stock at this particular node
parcels = env.get_current_warehouse_parcels()
if parcels.empty:
return np.zeros(env.num_vertices)
# Put together the locations array
locations = parcels['destination'].values.astype(int)
# Get the counts for the locations
counts = np.bincount(locations, minlength=env.num_vertices)
return counts / len(parcels)
First, we pull the present inventory of packages at this particular warehouse and confirm that it’s not empty. Subsequent, we extract the ‘vacation spot’ column as an array of integers, which signify the goal warehouse IDs. Lastly, np.bincount calculates the distribution of the packages throughout all locations. By dividing these counts by the overall variety of packages presently at this native warehouse, we convert an absolute quantity right into a share. The result’s a scale-invariant vector of floats, the place every index represents the precise share of the native inventory headed for that particular vertex.
Closest Deadline by Route (deadlines_min_dist) — Distribution of the closest deadlines for the present inventory.
def upd_deadlines_min_dist(env):
parcels = env.get_current_warehouse_parcels()
deadlines = np.ones(env.num_vertices) # 1.0 means no urgency or no parcels
if not parcels.empty:
# Group by vacation spot and discover the precise minimal time left
min_times = parcels.groupby('vacation spot')['time_left'].min() / env.max_time_left
# Assign the calculated minimums to their respective vacation spot indices
deadlines[min_times.index.astype(int)] = min_times.values
return np.clip(deadlines, env.config.OBS_BOX_LOW, env.config.OBS_BOX_HIGH)
Right here, we once more pull the present native stock. We initialize a deadlines vector to be the scale of the graph and fill it with ones (the place 1.0 means no urgency, and values approaching 0.0 point out a deadline that has arrived).
Subsequent, we group the parcels by their vacation spot and discover the minimal time_left for every route. And we divide this by the utmost doable time left to transform absolute time right into a relative ratio (similar strategy right here).
As a result of this ensuing vector solely incorporates information for lively locations, it’s sparse and unaligned with our motion area. We map these pressing deadlines to their appropriate topological place IDs by utilizing the locations as integer indices.
As a remaining contact, we clip the array to strictly stay between 0 and 1. It is a crucial security measure, as overdue packages will generate destructive time values, which might break the neural community’s remark bounds.
Thus, sometimes, a brand new process implies a very new remark area. Nevertheless, in my hybrid strategy, this isn’t the case: brokers may be transferred from warehouse to warehouse by design, whatever the variety of parcels, automobiles, or neighboring nodes.
Zero-Padding or Most Node Padding
Within the present model, the one exception is the overall variety of warehouses within the community (the order of the graph). This should be recognized upfront, as switch is just doable to a graph of the identical most dimension.
We deal with this limitation utilizing normal zero-padding. We outline a most graph dimension (e.g., 100 vertices), and for any smaller graphs, we masks non-existent nodes with zero values. In case your most graph dimension is 100 vertices, you simply deploy the agent on the present lively vertices and masks the remaining with zeros. The identical logic applies to observing neighbors: the vector dimension is all the time equal to the order of the logistics graph, however solely accessible (observable) neighbors have non-zero values.
MARL
Good options below a altering context
Now let’s handle one other drawback: actuality is unstable.
A sudden snowstorm hits, 3PL tariffs triple, or there’s a large spike in orders proper earlier than the vacations. An organization must be operationally adaptable to outlive this. Notice that the bodily guidelines of the sport (automobile sizes, the map) stay the identical, however the context shifts totally.
Static heuristics (e.g., a hardcoded rule to “dispatch at 85% capability”) will instantly begin producing colossal losses in these situations. A serious benefit of the MARL strategy is that it generalizes the state of affairs given the observations. It dynamically shifts its decision-making threshold “on the fly” in response to those altering observations.
One other nice advantage of MARL is that the issue is split into smaller components, that are solved independently by the brokers. Multi-agent architectures forestall us from being pressured to resolve your entire community drawback with a single “mega-agent.” Nevertheless, I’ll cowl that in additional element in my subsequent article on dimensionality discount.
MARL Implementation
A couple of phrases on how we particularly carried out the multi-agent facet. I confronted two distinct challenges:
- As a result of brokers’ actions are interdependent, they will simply adapt to one another’s sub-optimal behaviors. Due to this fact, within the early levels of coaching, conventional MARL may be extremely unstable.
- I needed to remain inside the OpenAI Gymnasium + Steady-baselines stack, which doesn’t explicitly help native MARL coaching.
On the similar time, falling again to a single-agent resolution was unattainable because of the sheer variety of warehouses, and the “one mega-agent” strategy was dropped on the architectural stage (the small print in Part 1 architecture).
Because of this, I designed the next coaching pipeline:
- As an alternative of coaching all brokers concurrently, we prepare just one — the “present” agent per episode.
- Whereas the “present” agent trains, the others function purely in frozen inference mode.
- A worldwide atmosphere “step” consists of a sequential execution of all brokers: the “coaching” agent takes its motion, adopted by the “inference” brokers.
Right here is the way it appears in code:
# Initialize atmosphere and cargo the present finest weights for all brokers
env.env_method('prepare_env', best_agent_paths)
for i in vary(NUM_MARL_LOOPS):
for training_ag_id in brokers.keys():
# Shift the atmosphere's perspective to the present lively agent
env.env_method('set_cur_training_agent', training_ag_id)
# Fetch the lively agent's coverage mannequin
agent_obj = brokers.get(training_ag_id)
# Prepare ONLY this agent
# (It will name env.step() below the hood
# and can run the opposite brokers in frozen inference mode)
agent_obj = agent_obj.study(
TS_PER_AGENT,
reset_num_timesteps=False,
tb_log_name=f"Agent_{training_ag_id}",
callback=callbacks,
)
# Save the up to date weights and push them to the dwell fashions cache
agent_obj.save(last_agent_paths[training_ag_id])
brokers[training_ag_id] = agent_obj
First, prepare_env() is executed, which units the default values and paths for saving the brokers. Then, we launch the principle loop, which dictates the variety of coaching passes NUM_MARL_LOOPS throughout your entire community.
Inside that, we deal with the coaching of a single “present” agent. The brokers is a dictionary: keys are IDs, values are the mannequin objects. The set_cur_training_agent() methodology switches the atmosphere’s perspective. Then, we take the present agent’s mannequin and set off .study(). After that, it’s fairly easy: we save the mannequin and replace the brokers dictionary.
Now, let’s briefly take a look at how this step really executes contained in the atmosphere:
def step(self, motion) -> tuple[dict, float, bool, dict]:
# Coaching Agent executes its motion
reward = self.process_packages(motion)
self.process_inflow() # Localized to the lively agent's node
self.update_state_and_metrics(reward)
self.save_current_act_agent()
# Inference Loop: Different brokers take their turns sequentially
for ag_id in self.inference_agents.keys():
if ag_id == self.cur_training_agent:
proceed # Skip the coaching agent (it already acted)
# Swap atmosphere context to the present inference agent
self.current_origin = ag_id
self.load_act_agent()
# Load mannequin and get masked prediction
agent_obj = self.inference_agents.get(ag_id)
action_mask = self.valid_action_mask()
ag_action, _ = agent_obj.predict(self.state, action_masks=action_mask)
# Execute inference agent's motion
sub_reward = self.process_packages(ag_action)
self.update_state_and_metrics(sub_reward)
self.save_current_act_agent()
# Restore atmosphere state to the Coaching Agent's perspective
self.current_origin = self.cur_training_agent
self.load_act_agent()
# Verify terminal circumstances
carried out = self.check_if_done()
self.step_n += 1
return self.state, reward, carried out, self.data
First, we execute the motion for the “present” coaching agent. We begin by processing the parcels presently within the system through self.process_packages(motion), the place the agent’s motion is utilized to the atmosphere logic. In different phrases, if the agent decides to dispatch some vans to some warehouses, the LP solver executes it right here.
After that, we obtain new incoming packages in self.process_inflow(), replace state and metrics in self.update_state_and_metrics(), and save the agent context in save_current_act_agent().
Now the enjoyable half begins. Because the present coaching agent has already taken its motion, we have to infer the actions for the remainder of the community. So we begin a for loop over our accessible brokers, skipping the coaching one. Inside this loop, we change the “present” agent context, load its mannequin, and generate an inference by feeding the present state and motion masks into agent_obj.predict().
From there, the movement is an identical to the coaching agent: we course of the generated motion (this time, by an inference agent), and replace the atmosphere. Lastly, on the finish of the loop, we change the context again to the present coaching agent and cross the ultimate outcomes again to the loop.
Within the Subsequent Episodes
So, we now have a totally useful coaching loop. The code runs, and the MARL atmosphere will initialize, however how can we guarantee this coaching course of really:
- Finishes in an affordable timeframe?
- Makes the fashions converge?
- Produces “ok” routing methods?
That’s what I’ll break down within the subsequent articles. Keep tuned!

