In , Go world champion Lee Sedol confronted an opponent who was not made from flesh and blood – however of traces of code.
It quickly turned clear that the human had misplaced.
Ultimately, Lee Sedol misplaced 4:1.
Final week I watched the documentary AlphaGo once more — and located it fascinating as soon as extra.
The scary factor about it? AlphaGo didn’t get its type of play from databases, guidelines or technique books.
As a substitute, it had performed towards itself hundreds of thousands of occasions — and discovered tips on how to win within the course of.
Transfer 37 in recreation 2 was the second when the entire world understood: This AI doesn’t play like a human — it performs higher.
AlphaGo mixed supervised studying, reinforcement studying, and search. One fascinating half is, its technique emerged from studying by taking part in towards itself — utilizing reinforcement studying to enhance over time.
We now use reinforcement studying not solely in video games, but additionally in robotics, equivalent to gripper arms or family robots, in vitality optimization, e.g. to scale back the vitality consumption of information facilities or for site visitors management, e.g. by means of site visitors mild optimization.
And in addition in fashionable brokers, we now use giant language fashions along with reinforcement studying (e.g. Reinforcement Studying from Human Suggestions) to make the responses of ChatGPT, Claude, or Gemini extra human-like, for instance.
On this article, I’ll present you precisely how this works, and the way we are able to higher perceive the mechanism utilizing a easy recreation: Tic Tac Toe.
What’s reinforcement studying?
After we observe a child studying to stroll, we see: It stands up, falls over, tries once more — and in some unspecified time in the future takes its first steps.
No trainer exhibits the infant tips on how to do it. As a substitute, the infant tries out completely different actions by trial and error to stroll — .
When it could actually stand or stroll a number of steps, it is a reward for the infant. In spite of everything, its purpose is to have the ability to stroll. If it falls down, there isn’t any reward.
This studying means of trial, error and reward is the essential thought behind reinforcement studying (RL).
Reinforcement studying is a studying method by which an agent learns by means of interplay with its atmosphere, which actions result in rewards.
Its purpose: To acquire as many rewards as attainable in the long run.
- In distinction to supervised studying, there aren’t any “proper solutions” or labels. The agent has to seek out out for itself which selections are good.
- In distinction to unsupervised studying, the goal is to not discover hidden patterns within the knowledge, however to hold out these actions that maximize the reward.
How an RL agent thinks, decides — and learns
For an RL agent to study, it wants 4 issues: An thought of the place it at present is (state), what it could actually do (actions), what it desires to realize (reward) and the way effectively it has completed with a method previously (worth).
An agent acts, will get suggestions, and will get higher.
For this to work, 4 issues are wanted:
1) Coverage / Technique
That is the rule or technique in keeping with which an agent decides which motion to carry out in a sure state. In easy instances, it is a lookup desk. In additional complicated purposes (e.g. with neural networks), it’s a perform.
2) Reward sign
The reward is the suggestions from the atmosphere. For instance, this may be +1 for a win, 0 for a draw and -1 for a loss. The agent’s purpose is to gather as many rewards as attainable over as many steps as attainable.
3) Worth Operate
This perform estimates the anticipated future reward of a state. The reward exhibits the agent whether or not the motion was “good” or “unhealthy”. The worth perform estimates how good a state is — not simply instantly, however contemplating future rewards the agent can anticipate from that state onward. The worth perform due to this fact estimates the long-term good thing about a state.
4) Mannequin of the atmosphere
A mannequin tells the agent: “If I do motion A in state S, I’ll most likely find yourself in state S′ and get reward R. ”
In model-free strategies like Q-learning, nevertheless, this isn’t obligatory.
Exploitation vs. Exploration: Transfer 37 – And what we are able to study from it
You might bear in mind transfer 37 from recreation 2 between AlphaGo and Lee Sedol:
An uncommon transfer that regarded like a mistake to us people – however was later hailed as genius.
Why did the algorithm try this?
The pc program was attempting out one thing new. That is known as exploration.
Reinforcement studying wants each: An agent should discover a stability between exploitation and exploration.
- Exploitation implies that the agent makes use of the actions it already is aware of.
- Exploration, however, are actions that the agent tries out for the primary time. It tries them out as a result of they might be higher than the actions it already is aware of.
The agent tries to seek out the optimum technique by means of trial and error.
Tic-Tac-Toe with reinforcement studying
Let’s check out reinforcement studying with a brilliant well-known recreation.
You’ve most likely performed it as a toddler too: Tic Tac Toe.

The sport is ideal as an introductory instance, because it doesn’t require a neural community, the principles are clear and we are able to implement the sport with just a bit Python:
- Our agent begins with zero information of the sport. It begins like a human seeing the sport for the primary time.
- The agent progressively evaluates every recreation scenario: A rating of 0.5 means “I don’t know but whether or not I’m going to win right here.” A 1.0 means “This example will virtually actually result in victory.
- By taking part in many events, the agent observes what works – and adapts his technique.
The purpose? For every flip, the agent ought to select the motion that results in the very best long-term reward.
On this part, we’ll construct such an RL system step-by-step and create the file TicTacToeRL.py.
→ You’ll find all of the code on this GitHub repository.
1. Constructing the atmosphere of the sport
In reinforcement studying, an agent learns by means of interactions with an atmosphere. It determines what a state is (e.g. the present board), which actions are permitted (e.g. the place you may place a wager) and what suggestions there may be on an motion (e.g. a reward of +1 should you win).
In principle, we consult with this setup because the Markov Resolution Course of: A mannequin consists of states, actions and rewards.
First, we create a category TicTacToe. This manages the sport board, which we create as a 3×3 NumPy array, and manages the sport logic:
- The reset(self) perform begins a brand new recreation.
- The perform available_actions() returns all free fields.
- The perform step(self, motion, participant) executes a recreation transfer. Right here we return the brand new state, a reward (1 = win, 0.5 = draw, -10 = invalid transfer) and the sport standing. We penalize invalid strikes on this instance with -10 closely in order that the agent learns to keep away from them shortly – a common technique in small RL environments.
- The perform check_winner() checks whether or not a participant has three X’s or O’s in a row and has due to this fact received.
- With render_gui() we show the present board with matplotlib as X and O graphics.
import numpy as np
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
import random
from collections import defaultdict
# Tic Tac Toe Spielumgebung
class TicTacToe:
def __init__(self):
self.board = np.zeros((3, 3), dtype=int)
self.completed = False
self.winner = None
def reset(self):
self.board[:] = 0
self.completed = False
self.winner = None
return self.get_state()
def get_state(self):
return tuple(self.board.flatten())
def available_actions(self):
return [(i, j) for i in range(3) for j in range(3) if self.board[i, j] == 0]
def step(self, motion, participant):
if self.completed:
increase ValueError("Spiel ist vorbei")
i, j = motion
if self.board[i, j] != 0:
return self.get_state(), -10, True
self.board[i, j] = participant
if self.check_winner(participant):
self.completed = True
self.winner = participant
return self.get_state(), 1, True
elif not self.available_actions():
self.completed = True
return self.get_state(), 0.5, True
return self.get_state(), 0, False
def check_winner(self, participant):
for i in vary(3):
if all(self.board[i, :] == participant) or all(self.board[:, i] == participant):
return True
if all(np.diag(self.board) == participant) or all(np.diag(np.fliplr(self.board)) == participant):
return True
return False
def render_gui(self):
fig, ax = plt.subplots()
ax.set_xticks([0.5, 1.5], minor=False)
ax.set_yticks([0.5, 1.5], minor=False)
ax.set_xticks([], minor=True)
ax.set_yticks([], minor=True)
ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-0.5, 2.5)
ax.grid(True, which='main', coloration='black', linewidth=2)
for i in vary(3):
for j in vary(3):
worth = self.board[i, j]
if worth == 1:
ax.plot(j, 2 - i, 'x', markersize=20, markeredgewidth=2, coloration='blue')
elif worth == -1:
circle = plt.Circle((j, 2 - i), 0.3, fill=False, coloration='purple', linewidth=2)
ax.add_patch(circle)
ax.set_aspect('equal')
plt.axis('off')
plt.present()
2. Program the Q-Studying Agent
Subsequent, we outline the training half: Our agent
It decides which motion to carry out in a sure state to get as a lot reward as attainable.
The agent makes use of the basic RL methodology Q-learning. A Q-value is saved for every mixture of state and motion — the estimated long-term good thing about this motion.
An important strategies are:
- Utilizing the
choose_action(self, state, actions)
perform, the agent decides in every recreation scenario whether or not to decide on an motion that it already is aware of effectively (exploitation) or whether or not to check out a brand new motion that has not but been sufficiently examined (exploration).This determination relies on the so-called ε-greedy method:
With a likelihood of ε = 0.1 the agent chooses a random motion (exploration), with 90 % likelihood (1 – ε) it chooses the at present finest recognized motion based mostly on its Q-table (exploitation).
- With the perform
replace(state, motion, reward, next_state, next_actions)
we modify the Q-value relying on how good the motion was and what occurs afterwards. That is the central studying step for the agent.
# Q-Studying-Agent
class QLearningAgent:
def __init__(self, alpha=0.1, gamma=0.9, epsilon=0.1):
self.q_table = defaultdict(float)
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
def get_q(self, state, motion):
return self.q_table[(state, action)]
def choose_action(self, state, actions):
if random.random() < self.epsilon:
return random.alternative(actions)
else:
q_values = [self.get_q(state, a) for a in actions]
max_q = max(q_values)
best_actions = [a for a, q in zip(actions, q_values) if q == max_q]
return random.alternative(best_actions)
def replace(self, state, motion, reward, next_state, next_actions):
max_q_next = max([self.get_q(next_state, a) for a in next_actions], default=0)
old_value = self.q_table[(state, action)]
new_value = old_value + self.alpha * (reward + self.gamma * max_q_next - old_value)
self.q_table[(state, action)] = new_value
On my Substack, I recurrently write summaries in regards to the revealed articles within the fields of Tech, Python, Data Science, Machine Studying and AI. In the event you’re , have a look or subscribe.
3. Practice the agent
The precise studying course of begins on this step. Throughout coaching, the agent learns by means of trial and error. The agent performs many video games, memorizes which actions have labored effectively — and adapts its technique.
Throughout coaching, the agent learns how its actions are rewarded, how its conduct impacts later states and the way higher methods develop in the long run.
- With the perform
practice(agent, episodes=10000)
we outline that the agent performs 10,000 video games towards a easy random opponent. In every episode, the agent (participant 1) makes a transfer, adopted by the opponent (participant 2). After every transfer, the agent learns by means ofreplace()
. - Each 1000 video games we save what number of wins, attracts and defeats there have been.
- Lastly, we plot the training curve with matplotlib. It exhibits how the agent improves over time.
# Coaching mit Lernkurve
def practice(agent, episodes=10000):
env = TicTacToe()
outcomes = {"win": 0, "draw": 0, "loss": 0}
win_rates = []
draw_rates = []
loss_rates = []
for episode in vary(episodes):
state = env.reset()
completed = False
whereas not completed:
actions = env.available_actions()
motion = agent.choose_action(state, actions)
next_state, reward, completed = env.step(motion, participant=1)
if completed:
agent.replace(state, motion, reward, next_state, [])
if reward == 1:
outcomes["win"] += 1
elif reward == 0.5:
outcomes["draw"] += 1
else:
outcomes["loss"] += 1
break
opp_actions = env.available_actions()
opp_action = random.alternative(opp_actions)
next_state2, reward2, completed = env.step(opp_action, participant=-1)
if completed:
agent.replace(state, motion, -1 * reward2, next_state2, [])
if reward2 == 1:
outcomes["loss"] += 1
elif reward2 == 0.5:
outcomes["draw"] += 1
else:
outcomes["win"] += 1
break
next_actions = env.available_actions()
agent.replace(state, motion, reward, next_state2, next_actions)
state = next_state2
if (episode + 1) % 1000 == 0:
complete = sum(outcomes.values())
win_rates.append(outcomes["win"] / complete)
draw_rates.append(outcomes["draw"] / complete)
loss_rates.append(outcomes["loss"] / complete)
print(f"Episode {episode+1}: Wins {outcomes['win']}, Attracts {outcomes['draw']}, Losses {outcomes['loss']}")
outcomes = {"win": 0, "draw": 0, "loss": 0}
x = [i * 1000 for i in range(1, len(win_rates) + 1)]
plt.plot(x, win_rates, label="Win Price")
plt.plot(x, draw_rates, label="Draw Price")
plt.plot(x, loss_rates, label="Loss Price")
plt.xlabel("Episodes")
plt.ylabel("Price")
plt.title("Lernkurve des Q-Studying-Agenten")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.present()
4. Visualization of the board
With the principle program “if identify == ”important“:” we outline the start line of this system. It ensures that the coaching of the agent runs routinely once we execute the script. And we use the render_gui()
methodology to show the TicTacToe board as a graphic.
# Hauptprogramm
if __name__ == "__main__":
agent = QLearningAgent()
practice(agent, episodes=10000)
# Visualisierung eines Beispielbretts
env = TicTacToe()
env.board[0, 0] = 1
env.board[1, 1] = -1
env.render_gui()
Execution within the terminal
We save the code within the file TicTacToeRL.py.
Within the terminal, we now navigate to the corresponding listing the place our TicTacToeRL.py is saved and execute the file with the command “Python TicTacToeRL.py”.
Within the terminal, we are able to see what number of video games our agent has received after each one thousandth episode:

And within the visualization we see the training curve:

Remaining Ideas
With TicTacToe, we use a easy recreation and a few Python — however we are able to simply see how Reinforcement Learning works:
- The agent begins with none prior information.
- It develops a method by means of suggestions and expertise.
- Its selections progressively enhance in consequence – not as a result of it is aware of the principles, however as a result of it learns.
In our instance, the opponent was a random agent. Subsequent, we may see how our Q-learning agent performs towards one other studying agent or towards ourselves.
Reinforcement studying exhibits us that machine intelligence just isn’t solely created by means of information or info – however by means of expertise, suggestions and adaptation.