FPN Paper Walkthrough: Leveraging the Internal Pyramid

I used to be speaking about YOLOv3 [1]. One of many components that makes this YOLO model higher than its predecessors is its skill in detecting small objects due to its FPN-like neck adopted by the mannequin. Sadly, my rationalization about FPN in that article was not fairly thorough since I used to be focusing extra on YOLOv3 itself. Thus, on this article I made a decision to write down particularly about FPN from its unique paper titled “Function Pyramid Networks for Object Detection” [2] to be able to get a greater understanding of what it truly is and the way it works. Not solely that, right here I may also display the way to implement FPN from scratch and the way to join it with a CNN spine and an RPN head.

Spine, Neck, and Head

Earlier than we get into FPN, we first have to know that the construction of an object detection mannequin is totally different from that of the classification mannequin, by which the principle distinction lies within the final layer. In a typical classification mannequin, the final layer contains a lot of neurons the place every of them corresponds to each single class accessible within the dataset. Or, within the case of binary classification, the output layer solely consists of a single neuron, which is accountable to foretell whether or not a pattern belongs to class 0 or 1. This sort of output layer is definitely not appropriate for detection process because it additionally requires neurons devoted for predicting the situation and the dimensions of an object along with its class.

So, to ensure that a mannequin to have the ability to predict object location and measurement, we have to change the output layer, i.e., the classification head, with the so-called detection head. The remaining layers themselves (every part besides the head) are generally referred to as spine. Some fashions that use this construction are YOLOv1 and YOLOv2, the place they use a stack of convolution layers because the spine and a particular head for predicting object location and measurement inside a picture in addition to its class.

Older object detection fashions like YOLOv1 and YOLOv2 talked about above solely include a spine and a head. As time went on, researchers discovered that this construction continues to be not fairly optimum, therefore they lastly got here up with an thought by including a brand new part referred to as neck. Because the identify suggests, that is basically one thing we place between the spine and the head. And FPN, which we’re going to speak about on this article, is without doubt one of the earliest necks proposed for object detection fashions. Take a look at the Determine 1 under to see the high-level architectural view of the older and trendy object detection fashions.

Determine 1. The structure of an object detection mannequin basically [3].

The spine of a mannequin is principally accountable for performing function extraction, whereas the neck is beneficial for enhancing function high quality, and the head is for making predictions. Primarily based on this notion, we are able to say that by making use of FPN, a community can probably obtain higher accuracy due to the function enhancement mechanism carried out by the neck.

The Evolution of Multi-Scale Detection Mechanism

Beforehand I discussed that utilizing spine and detection head with out neck isn’t fairly optimum. That is particularly concerning its functionality in detecting small objects. Now let’s check out the Determine 2 under. The primary two YOLO variations I discussed earlier make the most of the construction in picture (b), the place the bounding field and the thing class predictions are made solely on high of the function map produced by the deepest layer within the spine. This methodology is legitimate, however is efficient just for massive objects. The reason being fairly simple: as a picture will get deeper right into a community, the spatial dimension shrinks, and extra importantly, the pixel info contained within the deeper function maps turns into a illustration of a number of neighboring pixels within the shallower ones, which causes the spatial info to mix. By doing this, function maps from deeper layers obtained a big receptive discipline, permitting massive objects to be detected and acknowledged simply. Nevertheless, the degradation of spatial info as we get deeper prevents us from detecting small objects precisely as a result of we do want an in depth pixel location to be able to predict the precise coordinates of the objects.

Moreover, the receptive discipline measurement of a function map is positively correlated with the quantity of semantic info it incorporates. Within the determine under, a function map of excessive semantic info is indicated by a thick blue define. That is basically why the deepest function map in (b) has the thickest define.

Determine 2. Comparability of various function pyramid architectures [2].

Probably the most simple method to permit a community to concurrently detect massive and small objects is by utilizing featurized picture pyramid (a). This methodology is ready to obtain excessive accuracy as a result of we are able to make predictions from totally different picture resolutions. What’s basically accomplished right here is that we rescale our enter picture into a number of scales, carry out function extraction independently on every scale, and make predictions on the ensuing function maps. The smaller function map is accountable to detect massive objects, whereas the bigger one is specialised to detect small objects due to its detailed spatial info. Nevertheless, this methodology is computationally costly since we have to course of a number of uncooked photographs of various scales without delay.

One other resolution was proposed by the authors of SSD (Single Shot Multibox Detector), which in Determine 2 above is the one known as pyramidal function hierarchy (c). So, as an alternative of feeding the community with the identical picture of various sizes, the authors of SSD tried to make use of solely the most important picture and make the most of the interior pyramidal construction of the CNN spine to make predictions of various scales. This method permits the system to be computationally extra possible than possibility (a). However, right here we truly obtained a tradeoff like (b), the place the function map from the deeper layer incorporates a considerable amount of semantic info but having minimal spatial info, whereas the function map from the shallower layer has a number of spatial info nevertheless it doesn’t have that a lot semantic info. You will need to word {that a} detailed spatial info may not be fairly necessary for giant objects as we are able to simply approximate the final form of that object. Nevertheless, each spatial and semantic info are crucial for detecting small objects as a result of not solely the detailed coordinates, however the mannequin additionally wants to grasp what’s truly contained in the bounding field. So, whereas it’s true that methodology (c) is certainly in a position to detect each massive and small objects, however its skill in detecting the latter isn’t but optimum.

And right here’s the place FPN comes as an answer. If we check out picture (d) in Determine 2, we are able to see that the predictions are made on high of the corresponding function maps that are all semantically wealthy. This basically permits objects of various scales, together with the smaller ones, to be detected precisely. We’re going to discuss concerning the particulars of how FPN enriches function maps within the subsequent part.

How FPN Works

The thought of FPN is to inject info from the deeper function maps into the shallower ones, and by doing so we could have the shallower function maps not solely containing excessive spatial info but in addition excessive semantic info coming from the deeper a part of the community. In principle, this could end in a greater detection accuracy on small objects for the reason that massive function maps are actually enriched with a number of semantic info. With the intention to obtain this, they introduce the so-called top-down pathway and lateral connections. You may see the entire FPN structure in Determine 3 under, which is basically the detailed model of the one in Determine 2 (d).

Determine 3. The detailed FPN structure [3].

The authors of this paper determined to make use of ResNet-50 and ResNet-101 because the spine. Suppose we had been to make use of the previous, we might later have the conv2, conv3, conv4, and conv5 layers repeated 3, 4, 6, and three instances, respectively, as urged by the architectural particulars of ResNet in Determine 4. C2, C3, C4, and C5, that are the tensors produced by the final layer of the corresponding stage, are going to be transferred to the top-down pathway by way of lateral connections, i.e., the arrows going out from the spine.

Determine 4. The ResNet structure [3, 4].

The top-down pathway is used to switch semantic info from deeper layers, whereas lateral connections are used to protect spatial info. We mixture the 2 by performing element-wise summation, which the detailed course of is given in Determine 5 under. For the tensors that come from the spine (C), we first want to use 1×1 conv to them. This convolution layer is accountable for adjusting the variety of channels in order that it matches with the tensor coming from the top-down pathway. The tensor from the top-down pathway itself (M+1) undergoes 2× nearest-neighbor upsampling. These processes are basically accomplished as a result of we want each tensors to have the very same dimension in order that element-wise summation may be carried out. Because the summation is finished, the ensuing tensor is now known as M. This tensor has some aliasing impact because of the upsampling course of we did earlier, therefore we have to apply a 3×3 convolution to scale back that impact. Lastly, we obtained the P tensor, which is able to be forwarded to the detection head.

Determine 5. How function maps from lateral connection (C) and the top-down pathway (M+1) are aggregated [3].

Needless to say all processes described in Determine 5 above solely apply to M2, M3, and M4. Computing M5 is definitely a lot easier (see Determine 6), the place the one factor we have to do is simply to regulate the variety of channels utilizing 1×1 conv to make it uniform with the tensors within the different lateral connections. The M5 tensor itself doesn’t should be processed additional with 3×3 conv as a result of there’s nothing to be smoothed out because of the absence of upsampling mechanism. And so, we are able to principally say that P5 is the very same tensor as M5.

Determine 6. The way in which to compute M5 and P5 is barely totally different from that of M2-M4 and P2-P4 [3].

And nicely I believe that’s every part concerning the principle behind FPN. Within the subsequent part I’m going to carry you into the lower-level view of the structure by implementing it from scratch with PyTorch.

FPN From Scratch

CNN Spine

In order seen within the Codeblock 1 under, the very very first thing we have to do within the code is to import the required modules.

# Codeblock 1
import torch
import torch.nn as nn

For the reason that focus of this text is FPN, right here I’ll use a dummy mannequin for the spine as an alternative of utilizing the precise ResNet to be able to simplify issues. However nonetheless, the layers within the code are named in response to Determine 3 and 4: conv1, conv2, conv3, conv4 and conv5 as proven in Codeblock 2 under. The output tensor dimension of every stage can also be set in response to the unique ResNet structure. So, though this spine is only a plain CNN-based mannequin, you may consider this like a traditional ResNet.

Subsequent, what we do contained in the ahead() methodology is to attach all of the layers. In case you take a more in-depth take a look at the code, you’ll discover that every convolution layer is adopted by a ReLU activation operate and a maxpooling layer. The maxpooling layer itself is about to have the stride of two, successfully halves the spatial dimension of the function map. By repeating maxpooling layers a number of instances, we could have our function map step by step will get smaller as we get deeper into the community. This basically creates a pyramidal construction throughout the CNN spine which is leveraged by FPN to attain excessive detection accuracy on various object scales. In CNN, decreasing spatial dimension like this can be a normal apply to scale back computational complexity to compensate the rise of the variety of channels.

Nonetheless throughout the ahead() methodology, don’t overlook to clone the principle tensor x as proven on the strains marked with #(1), #(2), and #(3). The copied tensors, that are named c2, c3, and c4, will then be the return values of the CNN class alongside the function map from the principle move (c5) (#(4)).

# Codeblock 2
class CNN(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=64, out_channels=256, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, padding=1)
        self.conv5 = nn.Conv2d(in_channels=1024, out_channels=2048, kernel_size=3, padding=1)
        
    def ahead(self, x):
        print(f'originalt: {x.measurement()}n')
        
        x = self.relu(self.conv1(x))
        print(f'after conv1t: {x.measurement()}')
        
        x = self.maxpool(x)
        print(f'after maxpoolt: {x.measurement()}n')
        
        x = self.relu(self.conv2(x))
        print(f'after conv2t: {x.measurement()}')
        
        x = self.maxpool(x)
        print(f'after maxpoolt: {x.measurement()}n')
        
        c2 = x.clone()             #(1)
        
        x = self.relu(self.conv3(x))
        print(f'after conv3t: {x.measurement()}')
        
        x = self.maxpool(x)
        print(f'after maxpoolt: {x.measurement()}n')
        
        c3 = x.clone()             #(2)
        
        x = self.relu(self.conv4(x))
        print(f'after conv4t: {x.measurement()}')
        
        x = self.maxpool(x)
        print(f'after maxpoolt: {x.measurement()}n')
        
        c4 = x.clone()             #(3)
        
        x = self.relu(self.conv5(x))
        print(f'after conv5t: {x.measurement()}')
        
        c5 = self.maxpool(x)
        print(f'after maxpoolt: {c5.measurement()}n')
        
        return c2, c3, c4, c5      #(4)

Because the CNN class is finished, we’ll now attempt to go a dummy RGB picture of measurement 224×224 by way of the community. This tensor dimension is chosen primarily based on the enter form of the unique ResNet.

# Codeblock 3
cnn = CNN()

x = torch.randn(1, 3, 224, 224)
out_cnn = cnn(x)

And under is what the output seems like. Right here we are able to see that the variety of channels after every conv layer matches precisely with the ResNet construction given in Determine 4. Not solely that, it is usually seen that the spatial dimension of our dummy tensor efficiently halved after every stage due to the maxpooling layers. This basically signifies that our easy CNN mannequin actually mimics the final construction of a ResNet mannequin.

# Codeblock 3 Output
unique      : torch.Dimension([1, 3, 224, 224])

after conv1   : torch.Dimension([1, 64, 224, 224])
after maxpool : torch.Dimension([1, 64, 112, 112])

after conv2   : torch.Dimension([1, 256, 112, 112])
after maxpool : torch.Dimension([1, 256, 56, 56])

after conv3   : torch.Dimension([1, 512, 56, 56])
after maxpool : torch.Dimension([1, 512, 28, 28])

after conv4   : torch.Dimension([1, 1024, 28, 28])
after maxpool : torch.Dimension([1, 1024, 14, 14])

after conv5   : torch.Dimension([1, 2048, 14, 14])
after maxpool : torch.Dimension([1, 2048, 7, 7])

We are able to additionally test what the returned tensors appear like by operating the code under. You may see within the ensuing output that the c2 tensor has the form of 256×56×56, c3 is of form 512×28×28, and so forth. By the best way, you may simply ignore the #1 within the 0th axis because it solely signifies the variety of samples we go inside a single batch.

# Codeblock 4
c2, c3, c4, c5 = out_cnn

print(c2.form)
print(c3.form)
print(c4.form)
print(c5.form)

# Codeblock 4 Output
torch.Dimension([1, 256, 56, 56])
torch.Dimension([1, 512, 28, 28])
torch.Dimension([1, 1024, 14, 14])
torch.Dimension([1, 2048, 7, 7])

FPN Neck

Because the CNN spine is accomplished, now let’s transfer on to the FPN neck. Within the Codeblock 5 under, we first initialize the upsample layer (#(1)) which we’ll use each time we need to double the spatial dimension of the M tensor. Right here I set the mode parameter to nearest as urged within the paper, which is definitely a quite simple interpolation methodology, permitting the method to be quick. Check out Determine 7 to see what a nearest-neighbor interpolation seems like.

# Codeblock 5
class FPN(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.upsample = nn.Upsample(scale_factor=2, mode='nearest')    #(1)
        
        self.lateral_c5 = nn.Conv2d(in_channels=2048, out_channels=256, kernel_size=1)
        self.lateral_c4 = nn.Conv2d(in_channels=1024, out_channels=256, kernel_size=1)
        self.lateral_c3 = nn.Conv2d(in_channels=512,  out_channels=256, kernel_size=1)
        self.lateral_c2 = nn.Conv2d(in_channels=256,  out_channels=256, kernel_size=1)
        
        self.smooth_m4  = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1)
        self.smooth_m3  = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1)
        self.smooth_m2  = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1)
        
    def ahead(self, c2, c3, c4, c5):
        m5 = self.lateral_c5(c5)
        p5 = m5
        
        m4 = self.upsample(m5) + self.lateral_c4(c4)
        p4 = self.smooth_m4(m4)
        
        m3 = self.upsample(m4) + self.lateral_c3(c3)
        p3 = self.smooth_m3(m3)
        
        m2 = self.upsample(m3) + self.lateral_c2(c2)
        p2 = self.smooth_m2(m2)
        
        return p2, p3, p4, p5

Determine 7. An instance of a 2x upsampling course of with nearest-neighbor interpolation methodology [3].

In case you return to Codeblock 4, you will note that the c{5,4,3,2}tensors returned by the spine have totally different variety of channels. That is principally the rationale that we initialize the lateral_c{5,4,3,2}layers to course of these tensors in order that the ensuing channel counts will likely be uniform. In line with the paper, we have to set these convolution layers to provide 256 output channels, which is the rationale why we use that quantity for the out_channelsparameter.

Subsequent, primarily based on the Determine 7, you may simply think about how pixelated the ensuing function maps are after being upsampled. Thus, we have to course of the m{4,3,2}tensors additional with the three×3 conv layers which I confer with as smooth_m{4,3,2}. As all layers have been initialized, what we have to do subsequent is to assemble them within the ahead()methodology in response to the construction I confirmed you earlier in Determine 3.

Along with this, the paper additionally mentions that we don’t have to implement any nonlinearities throughout the FPN, which is the rationale that each one the convolution layers within the FPNclass above are usually not adopted with ReLU. Now in Codeblock 6 under I attempt to go the Ctensors we obtained earlier by way of the FPN neck we simply created. We are able to see within the output that the ensuing tensors have totally different spatial resolutions. In a while, the p2tensor (the one which has 56×56 dimension) will likely be forwarded to a detection head to detect small objects, whereas p5(the 7×7 tensor) goes to be accountable for massive objects.

# Codeblock 6
fpn = FPN()

out_fpn = fpn(c2, c3, c4, c5)
p2, p3, p4, p5 = out_fpn

print(p2.form)
print(p3.form)
print(p4.form)
print(p5.form)

# Codeblock 6 Output
torch.Dimension([1, 256, 56, 56])
torch.Dimension([1, 256, 28, 28])
torch.Dimension([1, 256, 14, 14])
torch.Dimension([1, 256, 7, 7])

Right here have already accomplished the FPN half. Do not forget that FPN is simply the neck of a detection mannequin, which basically signifies that at this level we nonetheless haven’t obtained the bounding field prediction simply but. With the intention to truly get hold of the prediction consequence, we have to join a particular head to the FPN, and on this case I’ll use the RPN (Area Proposal Community) head.

RPN Head

In case you’re nonetheless not but acquainted with RPN, that is basically the top of an object detection mannequin used for creating bounding field, which was first proposed within the Quicker R-CNN paper. Notice that whereas on this demonstration we confer with the RPN as a head, understand that it’s truly not a whole detection head because it has no functionality in performing classification on the detected objects.

We are able to see within the RPN structure under that it makes use of the so-called cls layer and reg layer, which produce objectness rating and the bounding field coordinates, respectively. The objectness rating tensor has the size of twookay, the place okay is the variety of predetermined anchor containers and a pair of is the chance of the corresponding anchor field being there. We are able to consider this like a binary classification handled with a one-hot illustration (object/non-object). In the meantime, the quantity 4 within the 4okay size of the coordinates tensor merely correspond to the xywh prediction.

Determine 8. The construction of RPN [4].

Going again to our code implementation, within the Codeblock 7 under we initialize the intermediate, cls, and reg layers throughout the __init__() methodology of the RPN class. Notice that the intermediate layer is the one one which makes use of 3×3 convolution, whereas each the cls and reg layers use 1×1 convs. Concerning the variety of channels, the intermediate layer maps the enter tensor into 256 channels, whereas the cls and reg map it into 2okay and 4okay, respectively. Lastly, we are able to merely join these layers throughout the ahead() methodology.

# Codeblock 7
NUM_ANCHORS = 3

class RPN(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.intermediate = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1)
        
        self.cls = nn.Conv2d(in_channels=256, out_channels=NUM_ANCHORS*2, kernel_size=1)
        self.reg = nn.Conv2d(in_channels=256, out_channels=NUM_ANCHORS*4, kernel_size=1)
    
    def ahead(self, x):
        x = self.intermediate(x)
        
        objectness_scores = self.cls(x)
        bbox_regressions  = self.reg(x)
        
        return objectness_scores, bbox_regressions

Now let’s check if our RPN class works correctly by operating the Codeblock 8 under. Right here I check it on the p2 function map we obtained from Codeblock 6.

# Codeblock 8
rpn = RPN()

p2_objectness, p2_bbox = rpn(p2)

print(p2_objectness.form)
print(p2_bbox.form)

Under is what the ensuing output seems like. You may see that p2_objectness is a tensor having the dimensions of 6×56×56, indicating that each single pixel within the 56×56 spatial dimension incorporates 6 prediction values, the place the primary 2 values are for the primary anchor field, the second 2 values are for the second anchor field, and the final 2 values are for the third one. The same factor additionally applies to the p2_bbox tensor, which on this case it incorporates the xywh values.

# Codeblock 8 Output
torch.Dimension([1, 6, 56, 56])
torch.Dimension([1, 12, 56, 56])

The Whole Detection Mannequin

In Codeblock 9 under we’re going to assemble the complete detection mannequin to be able to higher perceive how FPN works along with the opposite parts. Right here within the __init__() methodology I initialize the CNN bottleneck, FPN neck, and RPN head. Within the ahead() methodology, we first go the picture tensor into the CNN (#(1)). This spine returns 4 tensors, that are able to be related to the FPN by way of lateral connections. Subsequent, at line #(2) we feed all of the C tensors because the enter of FPN, producing the P tensors. Lastly, we use all of the Ps because the enter to the RPN (#(3–4)). Needless to say RPN shares its parameters throughout all detection heads, so we solely have to initialize it as soon as and use it for all function map of various scales.

# Codeblock 9
class DetectionModel(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.cnn = CNN()
        self.fpn = FPN()
        self.rpn = RPN()
        
    def ahead(self, x):
        
        c2, c3, c4, c5 = self.cnn(x)                 #(1)
        p2, p3, p4, p5 = self.fpn(c2, c3, c4, c5)    #(2)
        
        p2_pred = self.rpn(p2)        #(3)
        p3_pred = self.rpn(p3)
        p4_pred = self.rpn(p4)
        p5_pred = self.rpn(p5)        #(4)
        
        return p2_pred, p3_pred, p4_pred, p5_pred

Now because the detection head is full, we are able to check it with the Codeblock 10 under. Right here I attempt to go a dummy tensor of measurement 1×3×224×224, simulating a single RGB picture of measurement 224×224 (#(1)). Subsequent, we are able to simply go it by way of the detection_model (#(2)) and unpack the prediction outcomes (#(3–4)).

# Codeblock 10
detection_model = DetectionModel()

x = torch.randn(1, 3, 224, 224)     #(1)
p2_pred, p3_pred, p4_pred, p5_pred = detection_model(x)  #(2)

p2_objectness, p2_bbox = p2_pred    #(3)
p3_objectness, p3_bbox = p3_pred
p4_objectness, p4_bbox = p4_pred
p5_objectness, p5_bbox = p5_pred    #(4)
        
print(p2_objectness.form)
print(p3_objectness.form)
print(p4_objectness.form)
print(p5_objectness.form)
print()

print(p2_bbox.form)
print(p3_bbox.form)
print(p4_bbox.form)
print(p5_bbox.form)

Under is what the output seems like. You may see right here that the ensuing tensor dimensions are as meant, the place the objectness and bbox tensors comprise 6 and 12 values for every grid cell, respectively. So, I consider this implementation is appropriate and thus able to be educated for object detection process.

# Codeblock 10 Output
torch.Dimension([1, 6, 56, 56])
torch.Dimension([1, 6, 28, 28])
torch.Dimension([1, 6, 14, 14])
torch.Dimension([1, 6, 7, 7])

torch.Dimension([1, 12, 56, 56])
torch.Dimension([1, 12, 28, 28])
torch.Dimension([1, 12, 14, 14])
torch.Dimension([1, 12, 7, 7])

Ending

I believe that’s just about all concerning the underlying principle and the from-scratch implementation of FPN. Right here I problem you to strive implementing FPN on the actual ResNet as an alternative of a dummy CNN mannequin like I demonstrated above. I truly obtained a separate article about ResNet, which you’ll test as a reference [5]. Or, it is usually attainable to make use of different fashions if you’d like, comparable to VGG, ResNeXt, ConvNeXt, and many others since FPN can principally work on any CNN-based spine mannequin. Not solely that, it will even be higher in the event you can implement YOLO-style head as a alternative of RPN, which the examples may be seen in my earlier articles given at references quantity [6] for YOLOv1, [7] for YOLOv2, and [1] for YOLOv3.

Please let me know if there are errors in my writing or within the code. Thanks for studying! By the best way, yow will discover the code used on this article in my GitHub repo [8].

References

[1] Muhammad Ardi. YOLOv3 Paper Walkthrough: Even Higher, however Not That A lot. Medium. https://ai.gopubby.com/yolov3-paper-walkthrough-even-better-but-not-that-much-4dc6c0c1b42c [Accessed June 1, 2026].

[2] Tsung-Yi Lin. Function Pyramid Networks for Object Detection. Arxiv. https://arxiv.org/abs/1612.03144 [Accessed September 9, 2025].

[3] Picture created initially by creator.

[4] Shaoqing Ren. Quicker R-CNN: In direction of Actual-Time Object Detection with Area Proposal Networks. Arxiv. https://arxiv.org/abs/1506.01497 [Accessed September 9, 2025].

[5] Muhammad Ardi. Paper Walkthrough: Residual Community (ResNet). Python in Plain English. https://medium.com/python-in-plain-english/paper-walkthrough-residual-network-resnet-62af58d1c521 [Accessed September 9, 2025].

[6] Muhammad Ardi. YOLOv1 Paper Walkthrough: The Day YOLO First Noticed the World. Medium. https://medium.com/ai-advances/yolov1-paper-walkthrough-the-day-yolo-first-saw-the-world-ccff8b60d84b [Accessed June 1, 2026].

[7] Muhammad Ardi. YOLOv2 & YOLO9000 Paper Walkthrough: Higher, Quicker, Stronger. Medium. https://ai.gopubby.com/yolov2-yolo9000-paper-walkthrough-better-faster-stronger-c9906e0438a3 [Accessed June 1, 2026].

[8] MuhammadArdiPutra. FPN. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/FPN.ipynb [Accessed September 9, 2025].

Source link

FPN Paper Walkthrough: Leveraging the Internal Pyramid

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

After the 2026 Winter Olympics, Figure Skating Will Never Be the Same

A glimpse into OpenAI’s largest ambitions

Cologne-based Vamo secures over €7 million to help customers save billions in heating

FPN Paper Walkthrough: Leveraging the Internal Pyramid

Spine, Neck, and Head

The Evolution of Multi-Scale Detection Mechanism

How FPN Works

FPN From Scratch

CNN Spine

FPN Neck

RPN Head

The Whole Detection Mannequin

Ending

References

Related Posts