If we speak about object detection, one mannequin that doubtless involves our thoughts first is YOLO — properly, at the least for me, because of its reputation within the subject of pc imaginative and prescient.
The very first model of this mannequin, known as YOLOv1, was launched again in 2015 within the analysis paper titled “You Solely Look As soon as: Unified, Actual-Time Object Detection” [1]. Earlier than YOLOv1 was invented, one of many state-of-the-art algorithms for performing object detection was R-CNN (Area-based Convolutional Neural Community), during which it makes use of multi-stage mechanism to do the duty. It initially employs selective search algorithm to create area proposal, then makes use of CNN-based mannequin to extract the options inside all these areas, and at last classifies the detected objects utilizing SVM [2]. Right here you may clearly think about how lengthy the method is simply to carry out object detection on a single picture.
The motivation of YOLO within the first place was to enhance velocity. The truth is, not solely attaining low computational complexity, however the authors proved that their proposed deep studying mannequin was additionally capable of obtain excessive accuracy. As this text is written, YOLOv13 has simply printed a number of days in the past [3]. However let’s simply speak about its very first ancestor for now so to see the fantastic thing about this mannequin ranging from the time it first got here out. This text goes to debate how YOLOv1 works and the right way to construct this neural community structure from scratch with PyTorch.
The Underlying Principle Behind YOLOv1
Earlier than we get into the structure, it could be higher if we perceive the thought behind YOLOv1 upfront. Let’s begin with an instance. Suppose we now have an image of a cat, and we’re about to make use of it as a coaching pattern of a YOLOv1 mannequin. And so, we have to create a floor reality for that. It’s talked about within the unique paper that we have to outline the parameter S, which denotes the variety of grid cells we’re going to divide our picture into alongside every spatial dimension. By default, this parameter is about to 7, so we may have 7×7=49 cells in complete. Check out Determine 1 beneath to higher perceive this concept.
Subsequent, we have to decide which cell corresponds to the midpoint of the item. Within the above case, the cat is positioned virtually precisely on the middle of the picture, therefore the midpoint should lie at cell (3, 3). Later within the inference section, we will consider this cell because the one accountable to foretell the cat. Now taking a more in-depth take a look at the cell, we have to decide the precise place of the midpoint. Right here you may see that alongside the vertical axis it’s positioned precisely within the center, however within the horizontal axis it’s barely shifted to the left from the center. So, if I had been to approximate, the coordinate can be (0.4, 0.5). This coordinate worth is relative to the cell and is normalized to the vary of 0 to 1. It is perhaps price noting that the (x, y) coordinate of the midpoint ought to neither be lower than 0 nor better than 1, since a worth exterior this vary would imply the midpoint lies in one other cell. In the meantime, the width w and the peak h of the bounding field are roughly 2.4 and three.2, respectively. These numbers are relative to the cell measurement, which means that if the item is greater than the cell, then the worth shall be better than 1. In a while, if we had been to create a floor reality for a picture, we have to retailer all these x, y, w and h data within the so-called goal vector.
Goal Vector
The size of the goal vector itself is 25 for every cell, during which the primary 20 components (index 0 to 19) retailer the category of the item in type of one-hot encoding. That is primarily as a result of YOLOv1 was initially skilled on PASCAL VOC dataset which has that variety of lessons. Subsequent, index 20 is used to retailer the boldness of the bounding field prediction, which within the coaching section that is set to 1 each time there may be an object midpoint inside the cell. Lastly, the (x, y) coordinate of the midpoint are positioned at indices 21 and 22, whereas w and h are saved at indices 23 and 24. The illustration in Determine 2 beneath shows what the goal vector for cell (3, 3) appears like.

Once more, keep in mind that the above goal vector solely corresponds to a single cell. To create the bottom reality for the whole picture, we have to have a bunch of comparable vectors concatenated, forming the so-called goal tensor as proven in Determine 3. Word that the category chances in addition to the bounding field confidences, areas, and sizes from all different cells are set to zero as a result of there isn’t a different object showing inside the picture.

Prediction Vector
The prediction vector is sort of a bit completely different. If the goal vector consists of 25 components, the prediction vector consists of 30. It is because by default YOLOv1 predicts two bounding containers for a similar object throughout inference. Thus, we’d like 5 extra components to retailer the details about the second bounding field generated by the mannequin. Regardless of predicting two bounding containers, later we are going to solely take the one which has better confidence.

This distinctive goal and prediction vector dimensions required the authors to rethink the loss perform. For regression issues, we usually use MAE, MSE or RMSE, whereas for classification duties we normally use cross entropy loss. However YOLOv1 is greater than only a regression and classification drawback, contemplating that we now have each steady (bounding field) and discrete (class) values within the vector illustration. Because of this motive, the authors created a brand new loss perform specialised for this mannequin as proven in Determine 5. This loss perform is sort of advanced (you see, proper?), so I made a decision to write down it in a separate article as a result of there are many issues to clarify about it — keep tuned, I’ll publish it very quickly.

The YOLOv1 Structure
Identical to typical earlier pc imaginative and prescient fashions, YOLOv1 makes use of CNN-based structure because the spine of the mannequin. It includes 24 convolution layers stacked in response to the construction in Determine 6. If you happen to take a more in-depth take a look at the determine, you’ll discover that the output layer produces a tensor of form 30×7×7. This dimension signifies that each single cell has its corresponding prediction vector of size 30 containing the category and the bounding field data of the detected object, during which this matches precisely with our earlier dialogue.

Properly, I feel I’ve lined all the basics of YOLOv1, so now let’s begin implementing the structure from scratch with PyTorch. Earlier than doing something, what we have to do first is to import the required modules and initialize the parameters S, B, and C. See Codeblock 1 beneath.
# Codeblock 1
import torch
import torch.nn as nn
S = 7
B = 2
C = 20
The three parameters I initialized above are the default values given within the paper, during which S represents the variety of grid cells alongside the horizontal and vertical axes, B denotes the variety of bounding containers generated by every cell, and C is the variety of lessons obtainable within the dataset. Since we use S=7 and B=2, our YOLOv1 will produce7×7×2=98 bounding containers in complete for every picture.
The Constructing Block
Subsequent, we’re going to create the ConvBlock class, during which it accommodates a single convolution layer (line #(1)), a leaky ReLU activation perform (#(2)), and an elective maxpooling layer (#(3)) as proven in Codeblock 2.
# Codeblock 2
class ConvBlock(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
stride,
padding,
maxpool_flag=False):
tremendous().__init__()
self.maxpool_flag = maxpool_flag
self.conv = nn.Conv2d(in_channels=in_channels, #(1)
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(2)
if self.maxpool_flag:
self.maxpool = nn.MaxPool2d(kernel_size=2, #(3)
stride=2)
def ahead(self, x):
print(f'originalt: {x.measurement()}')
x = self.conv(x)
print(f'after convt: {x.measurement()}')
x = self.leaky_relu(x)
print(f'after leaky relu: {x.measurement()}')
if self.maxpool_flag:
x = self.maxpool(x)
print(f'after maxpoolt: {x.measurement()}')
return x
In trendy architectures, we usually use the Conv-BN-ReLU construction, however on the time YOLOv1 was created, it looks like batch normalization layer was not fairly standard simply but, because it got here out solely a number of months earlier than YOLOv1. So, I assume that is in all probability the rationale that the authors didn’t make the most of this normalization layer. As a substitute, it solely makes use of a stack of convolutions and leaky ReLUs all through the whole community.
Only a fast refresher, leaky ReLU is an activation perform much like the usual ReLU, besides that the damaging values are multiplied with a small quantity as an alternative of being zeroed out. Within the case of YOLOv1, we set the multiplier to 0.1 (#(2)) in order that it could possibly nonetheless protect a bit of bit quantity of knowledge contained within the damaging enter numbers.

Because the ConvBlock class has been outlined, now I’m going to check it simply to examine if it really works correctly. In Codeblock 3 beneath I attempt to implement the very first layer within the community and move a dummy tensor by way of it. You possibly can see within the codeblock that in_channels is about to three (#(1)) and out_channels is about to 64 (#(2)) as a result of we wish this preliminary layer to just accept an RGB picture because the enter and return a 64-channel picture. The scale of the kernel is 7×7 (#(3)), therefore we have to set the padding to three (#(5)). Usually, this configuration permits us to protect the spatial dimension of the picture, however since we use stride=2 (#(4)), this padding measurement ensures that the picture is precisely halved. Subsequent, in case you return to Determine 6, you’ll discover that some conv layers are adopted by a maxpooling layer and a few others should not. Because the first convolution makes use of a maxpooling layer, we have to set the maxpool_flag parameter to True (#(6)).
# Codeblock 3
convblock = ConvBlock(in_channels=3, #(1)
out_channels=64, #(2)
kernel_size=7, #(3)
stride=2, #(4)
padding=3, #(5)
maxpool_flag=True) #(6)
x = torch.randn(1, 3, 448, 448) #(7)
out = convblock(x)
Afterwards, we will merely generate a tensor of random values with the dimension of 1×3×448×448 (#(7)) which simulates a batch of a single RGB picture of measurement 448×448 after which move it by way of the community. You possibly can see within the ensuing output beneath that our convolution layer efficiently elevated the variety of channels to 64 and halved the spatial dimension to 224×224. The halving was completed as soon as once more all the way in which to 112×112 because of the maxpooling layer.
# Codeblock 3 Output
unique : torch.Measurement([1, 3, 448, 448])
after conv : torch.Measurement([1, 64, 224, 224])
after leaky relu : torch.Measurement([1, 64, 224, 224])
after maxpool : torch.Measurement([1, 64, 112, 112])
The Spine
The following factor we’re going to do is to create a sequence of ConvBlocks to construct the whole spine of the community. In case you’re nonetheless not acquainted with the time period spine, on this case it’s primarily every thing earlier than the 2 fully-connected layers (confer with Determine 6). Now take a look at the Codeblock 4a and 4b beneath to see how I outline the Spine class.
# Codeblock 4a
class Spine(nn.Module):
def __init__(self):
tremendous().__init__()
# in_channels, out_channels, kernel_size, stride, padding, maxpool_flag
self.stage0 = ConvBlock(3, 64, 7, 2, 3, maxpool_flag=True) #(1)
self.stage1 = ConvBlock(64, 192, 3, 1, 1, maxpool_flag=True) #(2)
self.stage2 = nn.ModuleList([
ConvBlock(192, 128, 1, 1, 0),
ConvBlock(128, 256, 3, 1, 1),
ConvBlock(256, 256, 1, 1, 0),
ConvBlock(256, 512, 3, 1, 1, maxpool_flag=True) #(3)
])
self.stage3 = nn.ModuleList([])
for _ in vary(4):
self.stage3.append(ConvBlock(512, 256, 1, 1, 0))
self.stage3.append(ConvBlock(256, 512, 3, 1, 1))
self.stage3.append(ConvBlock(512, 512, 1, 1, 0))
self.stage3.append(ConvBlock(512, 1024, 3, 1, 1, maxpool_flag=True)) #(4)
self.stage4 = nn.ModuleList([])
for _ in vary(2):
self.stage4.append(ConvBlock(1024, 512, 1, 1, 0))
self.stage4.append(ConvBlock(512, 1024, 3, 1, 1))
self.stage4.append(ConvBlock(1024, 1024, 3, 1, 1))
self.stage4.append(ConvBlock(1024, 1024, 3, 2, 1)) #(5)
self.stage5 = nn.ModuleList([])
self.stage5.append(ConvBlock(1024, 1024, 3, 1, 1))
self.stage5.append(ConvBlock(1024, 1024, 3, 1, 1))
What we do within the above codeblock is to instantiate ConvBlock cases in response to the structure given within the paper. There are a number of issues I need to emphasize right here. First, the time period stage I exploit within the code shouldn’t be explicitly talked about within the paper. Nevertheless, I made a decision to make use of that phrase to explain the six teams of convolutional layers in Determine 6. Second, discover that we have to set the maxpool_flag to True for the final ConvBlock within the first 4 teams to carry out spatial downsampling (#(1–4)). For the fifth group, the downsampling is finished by setting the stride of the final convolution layer to 2 (#(5)). Third, Determine 6 doesn’t point out the padding measurement of the convolution layers, so we have to calculate them manually. There may be certainly a particular formulation to search out padding measurement based mostly on the given kernel measurement. Nevertheless, I really feel like it’s a lot simpler to memorize it. Simply understand that if we use kernel of measurement 7×7, then we have to set the padding to three to protect the spatial dimension. In the meantime, for five×5, 3×3 and 1×1 kernels, the padding ought to be set to 2, 1, and 0, respectively.
As all layers within the spine have been instantiated, we will now join all of them utilizing the ahead() technique beneath. I don’t assume I want to clarify something right here because it principally solely works by passing the enter tensor x by way of the layers sequentially.
# Codeblock 4b
def ahead(self, x):
print(f'originalt: {x.measurement()}n')
x = self.stage0(x)
print(f'after stage0t: {x.measurement()}n')
x = self.stage1(x)
print(f'after stage1t: {x.measurement()}n')
for i in vary(len(self.stage2)):
x = self.stage2[i](x)
print(f'after stage2 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage3)):
x = self.stage3[i](x)
print(f'after stage3 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage4)):
x = self.stage4[i](x)
print(f'after stage4 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage5)):
x = self.stage5[i](x)
print(f'after stage5 #{i}t: {x.measurement()}')
return x
Now let’s confirm if our implementation is right by operating the next testing code.
# Codeblock 5
spine = Spine()
x = torch.randn(1, 3, 448, 448)
out = spine(x)
If you happen to attempt to run the above codeblock, the next output ought to seem in your display. Right here you may see that the spatial dimension of the picture appropriately bought diminished after the final ConvBlock of every stage. This course of continued all the way in which to the final stage till finally we obtained a tensor of measurement 1024×7×7, during which this matches precisely with the illustration in Determine 6.
# Codeblock 5 Output
unique : torch.Measurement([1, 3, 448, 448])
after stage0 : torch.Measurement([1, 64, 112, 112])
after stage1 : torch.Measurement([1, 192, 56, 56])
after stage2 #0 : torch.Measurement([1, 128, 56, 56])
after stage2 #1 : torch.Measurement([1, 256, 56, 56])
after stage2 #2 : torch.Measurement([1, 256, 56, 56])
after stage2 #3 : torch.Measurement([1, 512, 28, 28])
after stage3 #0 : torch.Measurement([1, 256, 28, 28])
after stage3 #1 : torch.Measurement([1, 512, 28, 28])
after stage3 #2 : torch.Measurement([1, 256, 28, 28])
after stage3 #3 : torch.Measurement([1, 512, 28, 28])
after stage3 #4 : torch.Measurement([1, 256, 28, 28])
after stage3 #5 : torch.Measurement([1, 512, 28, 28])
after stage3 #6 : torch.Measurement([1, 256, 28, 28])
after stage3 #7 : torch.Measurement([1, 512, 28, 28])
after stage3 #8 : torch.Measurement([1, 512, 28, 28])
after stage3 #9 : torch.Measurement([1, 1024, 14, 14])
after stage4 #0 : torch.Measurement([1, 512, 14, 14])
after stage4 #1 : torch.Measurement([1, 1024, 14, 14])
after stage4 #2 : torch.Measurement([1, 512, 14, 14])
after stage4 #3 : torch.Measurement([1, 1024, 14, 14])
after stage4 #4 : torch.Measurement([1, 1024, 14, 14])
after stage4 #5 : torch.Measurement([1, 1024, 7, 7])
after stage5 #0 : torch.Measurement([1, 1024, 7, 7])
after stage5 #1 : torch.Measurement([1, 1024, 7, 7])
The Totally-Related Layers
After the spine is finished, we will now transfer on to the fully-connected half, which I write in Codeblock 6 beneath. This a part of the community could be very easy because it primarily solely consists of two linear layers. Talking of the small print, it’s talked about within the paper that the authors apply a dropout layer with the speed of 0.5 (#(3)) between the primary (#(1)) and the second (#(4)) linear layers. It is very important observe that the leaky ReLU activation perform continues to be used (#(2)) however solely after the primary linear layer. It is because the second acts because the output layer, therefore it doesn’t require any activation utilized to it.
# Codeblock 6
class FullyConnected(nn.Module):
def __init__(self):
tremendous().__init__()
self.linear0 = nn.Linear(in_features=1024*7*7, out_features=4096) #(1)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(2)
self.dropout = nn.Dropout(p=0.5) #(3)
self.linear1 = nn.Linear(in_features=4096, out_features=(C+B*5)*S*S)#(4)
def ahead(self, x):
print(f'originalt: {x.measurement()}')
x = self.linear0(x)
print(f'after linear0t: {x.measurement()}')
x = self.leaky_relu(x)
x = self.dropout(x)
x = self.linear1(x)
print(f'after linear1t: {x.measurement()}')
return x
Run the Codeblock 7 beneath to see how the tensor transforms as it’s processed by the stack of linear layers.
# Codeblock 7
fc = FullyConnected()
x = torch.randn(1, 1024*7*7)
out = fc(x)
# Codeblock 7 Output
unique : torch.Measurement([1, 50176])
after linear0 : torch.Measurement([1, 4096])
after linear1 : torch.Measurement([1, 1470])
We are able to see within the above output that the fc block takes an enter of form 50176, which is basically the flattened 1024×7×7 tensor. The linear0 layer works by mapping this enter into 4096-dimensional vector, after which the linear1 layer finally maps it additional to 1470. Later within the post-processing stage we have to reshape it to 30×7×7 in order that we will take the bounding field and the item classification outcomes simply. Technically talking, this reshaping course of will be completed both internally by the mannequin or exterior the mannequin. For the sake of simplicity, I made a decision to depart the output flattened, which means the reshaping shall be dealt with externally.
Connecting the FC Half to the Spine
At this level we have already got our spine and the fully-connected layers completed. Thus, they’re now able to be assembled to assemble the whole YOLOv1 structure. There may be not a lot factor I can clarify concerning the next code, as what we do right here is simply instantiating each components and join them within the ahead() technique. Simply don’t overlook to flatten (#(1)) the output of spine to make it appropriate with the enter of the fc block.
# Codeblock 8
class YOLOv1(nn.Module):
def __init__(self):
tremendous().__init__()
self.spine = Spine()
self.fc = FullyConnected()
def ahead(self, x):
x = self.spine(x)
x = torch.flatten(x, start_dim=1) #(1)
x = self.fc(x)
return x
To be able to check our mannequin, we will merely instantiate the YOLOv1 mannequin and move a dummy tensor that simulates an RGB picture of measurement 448×448 (#(1)). After feeding the tensor into the community (#(2)), I additionally attempt to simulate the post-processing step by reshaping the output tensor to 30×7×7 as proven at line #(3).
# Codeblock 9
yolov1 = YOLOv1()
x = torch.randn(1, 3, 448, 448) #(1)
out = yolov1(x) #(2)
out = out.reshape(-1, C+B*5, S, S) #(3)
And beneath is what the output appears like after the code above is run. Right here you may see that our enter tensor efficiently flows by way of all layers inside the whole community, indicating that our YOLOv1 mannequin works correctly and thus is able to practice.
# Codeblock 9 Output
unique : torch.Measurement([1, 3, 448, 448])
after stage0 : torch.Measurement([1, 64, 112, 112])
after stage1 : torch.Measurement([1, 192, 56, 56])
after stage2 #0 : torch.Measurement([1, 128, 56, 56])
after stage2 #1 : torch.Measurement([1, 256, 56, 56])
after stage2 #2 : torch.Measurement([1, 256, 56, 56])
after stage2 #3 : torch.Measurement([1, 512, 28, 28])
after stage3 #0 : torch.Measurement([1, 256, 28, 28])
after stage3 #1 : torch.Measurement([1, 512, 28, 28])
after stage3 #2 : torch.Measurement([1, 256, 28, 28])
after stage3 #3 : torch.Measurement([1, 512, 28, 28])
after stage3 #4 : torch.Measurement([1, 256, 28, 28])
after stage3 #5 : torch.Measurement([1, 512, 28, 28])
after stage3 #6 : torch.Measurement([1, 256, 28, 28])
after stage3 #7 : torch.Measurement([1, 512, 28, 28])
after stage3 #8 : torch.Measurement([1, 512, 28, 28])
after stage3 #9 : torch.Measurement([1, 1024, 14, 14])
after stage4 #0 : torch.Measurement([1, 512, 14, 14])
after stage4 #1 : torch.Measurement([1, 1024, 14, 14])
after stage4 #2 : torch.Measurement([1, 512, 14, 14])
after stage4 #3 : torch.Measurement([1, 1024, 14, 14])
after stage4 #4 : torch.Measurement([1, 1024, 14, 14])
after stage4 #5 : torch.Measurement([1, 1024, 7, 7])
after stage5 #0 : torch.Measurement([1, 1024, 7, 7])
after stage5 #1 : torch.Measurement([1, 1024, 7, 7])
unique : torch.Measurement([1, 50176])
after linear0 : torch.Measurement([1, 4096])
after linear1 : torch.Measurement([1, 1470])
torch.Measurement([1, 30, 7, 7])
Ending
It is perhaps price noting that each one the codes I present you all through this whole article is for the bottom YOLOv1 structure. It’s talked about within the paper that the authors additionally proposed the lite model of this mannequin which they confer with as Quick YOLO. This smaller YOLOv1 model affords quicker computation time because it solely consists of 9 convolution layers as an alternative of 24. Sadly, the paper doesn’t present the implementation particulars, so I can not reveal you the right way to implement that one.
Right here I encourage you to mess around with the above code. In principle, it’s potential to exchange the CNN-based spine with different deep studying fashions, comparable to ResNet, ResNeXt, ViT, and many others. All you might want to do is simply to match the output form of the spine with the enter form of the fully-connected half. Not solely that, I additionally need you to strive coaching this mannequin from scratch. However in case you determined to take action, you would possibly in all probability need to make this mannequin smaller by decreasing the depth (no of convolution layers) or the width (no of kernels) of the mannequin. That is primarily as a result of the authors talked about that they required round per week simply to do the pretraining on ImageNet dataset, to not point out the time for fantastic tuning on the item detection job.
And properly, I feel that’s just about every thing I can clarify you about how YOLOv1 works and its structure. Please let me know in case you spot any mistake on this article. Thanks!
By the way in which, the code used on this article can be obtainable on my GitHub repo [7].
References
[1] Joseph Redmon et al. You Solely Look As soon as: Unified, Actual-Time Object Detection. Arxiv. https://arxiv.org/pdf/1506.02640 [Accessed July 5, 2025].
[2] Ross Girshick et al. Wealthy function hierarchies for correct object detection and semantic segmentation. Arxiv. https://arxiv.org/pdf/1311.2524 [Accessed July 5, 2025].
[3] Mengqi Lei et al. YOLOv13: Actual-Time Object Detection with Hypergraph-Enhanced Adaptive Visible Notion. Arxiv. https://arxiv.org/abs/2506.17733 [Accessed July 5, 2025].
[4] Picture generated by creator with Gemini, edited by creator.
[5] Picture initially created by creator.
[6] Bing Xu et al. Empirical Analysis of Rectified Activations in Convolutional Community. Arxiv. https://arxiv.org/pdf/1505.00853 [Accessed July 5, 2025].
[7] MuhammadArdiPutra. The Day YOLO First Noticed the World — YOLOv1. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Day%20YOLO%20First%20Saw%20the%20World%20-%20YOLOv1.ipynb [Accessed July 7, 2025].

