, we explored the best way to prolong Reinforcement Studying (RL) past the tabular setting utilizing perform approximation. Whereas this allowed us to generalize throughout states, our experiments additionally revealed an essential limitation: in easy environments like GridWorld, approximate strategies can battle to match the steadiness and effectivity of tabular approaches. The primary cause is that studying an excellent illustration is itself a troublesome downside—one that may outweigh the advantages of generalization when the state area continues to be comparatively small.
To actually unlock the ability of perform approximation, we due to this fact want to maneuver to environments the place tabular strategies are now not viable. This naturally leads us to multi-player video games, the place the state area grows combinatorically and generalization turns into important – and on the identical time completely suits into this put up collection, as up to now we didn’t handle to be taught any significant habits on extra complicated multi-player environments. On this put up, we take this step by contemplating the traditional recreation of Join 4 and examine the best way to be taught robust insurance policies utilizing Deep Q-Studying.
From Sarsa to Deep Q-Studying
To deal with this activity, we prolong our framework alongside a number of essential dimensions.
First, we transfer from on-line updates to a batched coaching setup. In our earlier implementation of Sarsa, we up to date the mannequin after each transition. Whereas trustworthy to the unique algorithm [1], this strategy is computationally inefficient: every optimizer step incurs a non-trivial price, and trendy {hardware}—particularly GPUs—is designed to function on batches with solely marginal extra overhead.
To deal with this, we introduce a replay buffer. As an alternative of updating instantly, we retailer transitions as they’re encountered—both as much as a hard and fast capability or, in our case, till one or a number of video games have completed. We then carry out a batched replace over this collected expertise. This not solely improves computational effectivity but in addition stabilizes studying by decreasing the variance of particular person updates.
At this level, an essential conceptual shift happens. By sampling from previous expertise fairly than strictly following the present coverage, we transfer away from Sarsa—an on-policy technique—in the direction of Q-learning, which is off-policy. Whereas we have now not formally reintroduced Q-learning within the perform approximation setting right here, the extension from the tabular case is basically easy. This mixture of replay buffers and Q-learning kinds the inspiration of Deep Q-Networks (DQNs), popularized by DeepMind of their seminal work on Atari video games [2].
Lastly, we flip to scalability. Reinforcement studying is inherently data-hungry, so growing throughput is essential. To this finish, we implement a vectorized surroundings wrapper that enables us to simulate a number of video games of Join 4 in parallel. Concretely, a single name to step(a) now processes a batch of actions and advances all environments concurrently.
In apply, nonetheless, attaining true parallelism in Python is non-trivial. The World Interpreter Lock (GIL) ensures that just one thread executes Python bytecode at a time, which limits the effectiveness of multi-threading for CPU-bound workloads equivalent to surroundings stepping. We additionally experimented with multi-processing, however discovered that the extra overhead (e.g., inter-process communication) largely offset any good points in our setting. For the reader I like to recommend an earlier post of mine.
Regardless of these limitations, the mix of batched updates and surroundings vectorization yields a considerable enchancment in throughput, growing efficiency to roughly 50–100 video games per second.
Implementation
On this put up, I intentionally keep away from going into an excessive amount of element on the surroundings vectorization and as a substitute concentrate on the RL elements. Partly, it’s because the vectorization itself is “simply” an implementation element—but in addition as a result of, in all honesty, our present setup is just not very best. A lot of this is because of limitations imposed by the PettingZoo surroundings we’re utilizing.
In future posts, we’ll discover totally different environments and revisit this matter with a stronger emphasis on scalability—a vital side of recent reinforcement studying. For a extra detailed dialogue of how we construction multi-player environments, handle brokers, and keep an opponent pool, I discuss with my earlier put up on multi-player RL. The vectorized setup used right here is solely an extension of that framework to a number of video games working in parallel. As all the time, the total implementation is offered on GitHub.
Revisiting Q-Studying
Allow us to briefly revisit Q-learning and join it to our implementation.
The core replace rule is given by:
In distinction to Sarsa, which makes use of the motion truly taken within the subsequent state, Q-learning makes use of a max operator over all potential subsequent actions. This makes it off-policy, because the replace doesn’t rely upon the habits coverage used to generate the info. In apply, this typically results in sooner propagation of worth data, particularly in deterministic environments equivalent to board video games.
When mixed with neural networks, this strategy is often known as Deep Q-Studying. As an alternative of sustaining a desk of values, we practice a neural community to approximate the action-value perform. The replace is then applied as a regression downside, minimizing the distinction between the present estimate and a bootstrapped goal:

In our implementation, this corresponds on to the batch_update perform. Given a batch of transitions , we first compute the anticipated Q-values for the taken actions:
q = self.q(batch.states, ...)
q_sa = q.collect(1, batch.actions.unsqueeze(1)).squeeze(1)
Subsequent, we assemble the goal utilizing the utmost Q-value of the subsequent state. Since not all actions are authorized in Join 4, we apply a masks to make sure that solely legitimate strikes are thought-about:
q_next = self.q(batch.next_states, ...)
q_next_masked = q_next.masked_fill(~authorized, float("-inf"))
max_next = q_next_masked.max(dim=1).values
Lastly, we mix the reward and the discounted next-state worth, taking care to deal with terminal states accurately:
goal = batch.rewards + gamma * (~batch.dones).float() * max_next
The community is then educated by minimizing the Huber loss (a extra sturdy variant of imply squared error):
loss = F.smooth_l1_loss(q_sa, goal)
This batch-based formulation permits us to effectively reuse expertise collected from a number of parallel video games, which is essential for scaling to extra complicated environments. On the identical time, it highlights a key problem of Deep Q-Studying: the targets themselves rely upon the present community, which may result in instability throughout coaching.
For an extra reference, the official PyTorch tutorial on Deep Q-Studying offers a useful complementary perspective.
Outcomes
With that in place, allow us to flip to the outcomes. To place them into perspective, we first recall how the tabular strategies carried out on this activity. After 100,000 steps, most insurance policies had been nonetheless intently clustered when it comes to win charge. Specifically, even a random coverage achieved roughly 50% win charge, indicating that not one of the realized insurance policies had managed to outperform probability in a significant approach.

Within the following experiment, we concentrate on two brokers: our DQN and a random baseline. Because of the beforehand launched “zoo” setup, the DQN is just not a single fastened coverage however a pool of evolving brokers. We constantly add new variations and prune weaker ones, which step by step will increase the general power of the opponent pool.
This has an essential implication for decoding the metrics:
the win charge of “DQN vs. DQN” naturally hovers round 50%, since brokers of comparable power compete in opposition to one another. A extra informative sign is due to this fact the efficiency of the random coverage. Because the DQN improves, the random agent ought to win much less continuously.
With that in thoughts, allow us to take a look at the efficiency curve:

We observe a number of attention-grabbing results. Most notably, the win charge of the random coverage drops considerably sooner than within the tabular setting—clear proof that the DQN is certainly studying the sport. Nevertheless, after round a million steps, the advance plateaus, with the random coverage nonetheless successful roughly 20% of video games.
To raised perceive what this implies in apply, we are able to consider the realized coverage in opposition to a human participant. Within the following instance, I take the function of the pink participant going first:

The result’s fairly revealing. The agent has clearly realized to play offensively—it actively pursues its personal four-in-a-row. Nevertheless, it struggles with defensive play, failing to anticipate and block easy opponent threats.
That is in all probability a little bit of a disappointment, however: we’ll come again to this. In future posts we’ll learn to scale higher, be taught sooner, and beat people (at many issues). Penning this put up collection about Sutton’s nice ebook has been an incredible journey (though there are nonetheless a couple of posts left) – however we have now merely outgrown the very basic framework we began with to showcase all of the out there algorithms in Sutton’s ebook, overlaying each tabular and approximate answer strategies. Thus, specialization is the way in which to go – and sooner or later we’ll do precisely that, writing extremely environment friendly, customized tailor-made strategies for various issues.
Conclusion
On this put up, we moved from tabular Sarsa to Deep Q-Studying, introducing replay buffers, batched updates, and performance approximation. We utilized this to Join 4, a multi-player recreation we beforehand failed to resolve with tabular strategies, with a transparent end result: our agent is now not caught at probability degree—it learns, improves, and persistently outperforms a random coverage.
However simply as importantly, we additionally see the boundaries.
Even after intensive coaching, the agent plateaus and nonetheless reveals clear weaknesses—most notably in defensive play. This isn’t only a matter of “extra coaching.” In multi-player settings, the issue itself turns into more durable: opponents evolve, the surroundings is now not stationary, and studying targets maintain shifting.
That is the place the actual problem begins.
Up so far, our framework—loosely following [1] —has prioritized generality and readability. However to go additional, that’s now not sufficient. Efficiency requires specialization.
Within the subsequent posts, we first proceed following [1] – after which will concentrate on precisely that: constructing sooner, extra steady, and extra scalable programs—pushing past easy baselines in the direction of brokers that may actually compete.
Different Posts on this Sequence
References

