MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant

Introduction

was a breakthrough within the area of pc imaginative and prescient because it proved that deep studying fashions don’t essentially should be computationally costly to attain excessive accuracy. Final month I posted an article the place I defined the whole lot concerning the mannequin in addition to its PyTorch implementation from scratch. Examine the hyperlink at reference quantity [1] on the finish of this text in case you are considering studying it. This primary model of MobileNet was first proposed again in April 2017 in a paper titled MobileNets: Environment friendly Convolutional Neural Networks for Cell Imaginative and prescient Functions [2] by Howard et al. from Google. Not lengthy after — in January 2018 to be exact — Sandler et al. from the identical establishment launched the successor of MobileNetV1 in a paper titled MobileNetV2: Inverted Residuals and Linear Bottlenecks [3], which brings important enchancment over the earlier one by way of each accuracy and effectivity. On this article, I’m going to stroll you thru the concepts proposed within the MobileNetV2 paper and present you find out how to implement the structure from scratch.

The Enhancements

The primary model of MobileNet depends solely on the so-called depthwise separable convolution layers. It’s certainly essential to acknowledge that utilizing these layers as a substitute of ordinary convolutions permits the mannequin to be extraordinarily light-weight. Nonetheless, authors thought that this structure might nonetheless be improved even additional. They got here up with an thought the place as an alternative of solely utilizing depthwise separable convolutions, additionally they adopted the inverted residual and linear bottleneck mechanisms — which is the place the title of the MobileNetV2 paper got here from.

Inverted Residual

Should you’re aware of ResNet, I consider the so-called bottleneck block. For individuals who don’t, it’s basically a mechanism the place the constructing block of the community works by following the huge → slim → huge sample. Determine 1 beneath shows the illustration of a bottleneck block utilized in ResNet. Right here we are able to see that it initially accepts a 256-channel tensor, shrink it to 64, and expands it again to 256.

Determine 1. The constructing block of ResNet, generally referred to as “bottleneck” [4].

The inverted model of the above block is usually referred to as inverted bottleneck, which follows the slim → huge → slim construction. Determine 2 beneath exhibits an instance from the ConvNeXt paper [5], the place the variety of channels within the enter tensor is 96, expanded to 384, and compressed again to 96 by the final convolution layer. You will need to be aware that in MobileNetV2 an inverted bottleneck block is named inverted residual for some causes. So, beginning any further, I’ll use the time period to keep away from confusion.

Determine 2. The inverted bottleneck block launched in ConvNeXt [5].

At this level you could be questioning why we don’t simply use the usual bottleneck for MobileNetV2. The reply lies within the authentic function of the usual bottleneck design, the place it was first launched to scale back computational complexity. This was basically completed as a result of ResNet is computationally costly by nature but wealthy in info. For that reason, ResNet authors proposed to scale back computational price by shrinking the tensor dimension in the midst of every constructing block, which is how the bottleneck block was born.

This discount within the variety of channels doesn’t damage the mannequin capability that a lot since ResNet already has a lot of channels general. Alternatively, MobileNetV2 is meant to be as light-weight as potential within the first place, which implies the mannequin capability isn’t as excessive as ResNet. With a view to enhance mannequin capability, authors broaden the tensor dimension within the center to type the inverted residual block, which permits the mannequin to be taught extra patterns whereas solely barely rising complexity. So briefly, the center a part of a bottleneck block (slim) is used for effectivity, whereas the center a part of an inverted residual block (huge) is used to be taught advanced patterns. If we attempt to apply a regular bottleneck on MobileNetV2 as an alternative, the computation goes to be even quicker, however this may trigger a drop in accuracy because the mannequin will lose a major quantity of knowledge.

Linear Bottleneck

The subsequent idea we have to perceive is the so-called linear bottleneck. This one is definitely fairly easy since what we basically do right here is simply to omit the nonlinearity (i.e., the ReLU activation operate) within the final layer of every inverted residual block. The usage of activation capabilities in neural networks on the first place is to permit the community to seize advanced patterns. Nonetheless, it is going to destroy vital info as an alternative if we apply it on a low-dimensional tensor, particularly within the context of MobileNetV2 the place the inverted residual block tasks a excessive dimensional tensor to a smaller one within the final convolution layer. By eradicating the activation operate within the final convolution layer like this, we basically stop the mannequin from dropping vital info. Determine 3 beneath exhibits what the inverted residual block utilized in MobileNetV2 appears like. Discover that ReLU isn’t utilized within the final pointwise convolution, which basically implies that this layer behaves considerably equally to a regular linear regression layer. Along with this determine, the variables okay and okay’ denote the variety of enter and output channels, respectively. Within the intermediate course of, we basically broaden the variety of channels by t earlier than finally shrink it to okay’. I’ll go into extra element on these variables within the subsequent part.

Determine 3. The inverted residual block utilized in MobileNetV2. Discover that we don’t apply ReLU after the final pointwise convolution layer [3].

ReLU6

So why can we use ReLU6 as an alternative of normal ReLU? In case you’re not but aware of it, this activation operate is definitely just like ReLU, besides that the output worth is capped at 6. So, any enter larger than 6 can be mapped to that quantity. In the meantime, the habits for adverse inputs is strictly the identical. Thus, we are able to merely say that the output of ReLU6 will at all times be throughout the vary of 0 to six (inclusive). Take a look at Determine 4 beneath to raised perceive this concept.

In normal ReLU, there’s a risk the place the enter — and due to this fact the output — worth goes arbitrarily giant, during which it probably causes instability in low-precision environments. Keep in mind that MobileNet is meant to have the ability to work on small units, during which we all know that such units sometimes count on small numbers to avoid wasting reminiscence, say 8-bit integer. On this specific case, having very giant activation values might result in precision loss or clipping when quantized to low-bit representations. Thus, to maintain the values small and inside a manageable vary, we are able to merely make use of ReLU6 to take action.

The Full MobileNetV2 Structure

Now let’s check out the entire MobileNetV2 structure in Determine 5 beneath. Similar to the primary model of MobileNet which principally consists of depthwise separable convolutions, a lot of the elements inside MobileNetV2 are the inverted residual blocks with linear bottlenecks we mentioned earlier. Each row within the following desk labeled as bottleneck corresponds to a single stage, during which every of them consists of a number of inverted residual blocks. Speaking concerning the columns within the desk, t represents growth issue used within the center a part of every block, c denotes the variety of output channels of every block, n is the variety of repeats of the block inside that stage, and s signifies the stride of the primary block throughout the stage.

To raised perceive this concept, let’s take a better have a look at the stage which the enter form is 56×56×24. Right here you may see that the corresponding parameters of this stage are t=6, c=32, n=3, and s=2. This basically implies that the inverted residual stage consists of three blocks. All these blocks are similar besides that the primary one makes use of stride 2, lowering the spatial dimension by half from 56×56 to twenty-eight×28. Subsequent, c=32 is fairly simple because it principally says that the variety of output channel of every block throughout the stage is 32. In the meantime, t=6 signifies that the intermediate layer contained in the blocks is 6 instances wider than the enter, forming the inverted bottleneck construction. So, on this case the variety of channels within the course of goes to be 32 → 192 → 32. Nonetheless, it is very important be aware that the primary block inside that stage is totally different, the place it makes use of 24 → 144 → 32 construction due to the 24-channel enter tensor. If we refer again to Determine 3, these two constructions basically observe the okay → kt → okay’ sample.

Determine 5. The MobileNetV2 structure we’re about to implement [3].

Along with the above structure, right here we even have skip-connections positioned throughout the inverted residual blocks. This skip-connection will solely be utilized at any time when the stride of the block is about to 1. That is basically as a result of the spatial dimension of the picture will change at any time when we use stride 2, inflicting the output tensor to have totally different form to that of the enter. Such a distinction in tensor shapes will successfully stop us from performing element-wise summation between the unique circulate and the skip-connection. See Determine 6 beneath for the main points. Word that the 2 illustrations on this determine are principally simply the visualization of the desk in Determine 3.

Determine 6. We don’t implement skip-connection when the stride is about to 2 (i.e., when the layer performs spatial downsampling) [3].

Parameter Tuning

Much like MobileNetV1, MobileNetV2 additionally has two adjustable parameters known as width multiplier and enter decision. The previous is used to regulate the width of the community, whereas the latter is for altering the decision of the enter picture. The structure you see in Determine 5 is the bottom configuration, the place we set the width multiplier to 1 and the enter decision to 224×224. With these two parameters, we are able to tune the mannequin to discover a candy spot that balances accuracy and effectivity based mostly on our wants.

We are able to technically select arbitrary numbers for the 2 parameters, however authors already offered a number of predetermined numbers for his or her experiments. To the width multiplier, we are able to use 0.75, 0.5 or 0.35, during which all of them will make the mannequin smaller. As an illustration, if we use 0.5 then all numbers in column c in Determine 5 can be decreased to half of their defaults. To the enter decision, we are able to select both 192×192, 160×160, 128×128 or 96×96 as a substitute for 224×224 if you wish to decrease the variety of operations throughout inference.

Some Experimental Outcomes

Determine 7 beneath exhibits what the experimental outcomes completed by the authors appear like. Although MobileNetV1 is taken into account light-weight already, MobileNetV2 proved that its efficiency is even higher by way of all metrics in comparison with its predecessor. Nonetheless, it’s essential to acknowledge that the bottom MobileNetV2 isn’t fully superior to different light-weight fashions particularly when taking into consideration all features directly.

Determine 7. The efficiency of MobileNetV2 in comparison with different light-weight fashions on ImageNet dataset [3].

With a view to obtain even higher accuracy, authors additionally tried to enlarge the mannequin as an alternative by altering the width multiplier to 1.4 for the 224×224 enter decision, which within the above determine corresponds to the consequence within the final row. Doing this positively causes the mannequin complexity in addition to the computation time to get increased, however in return it permits the mannequin to acquire the very best accuracy. The leads to Determine 8 additionally present the same factor, the place all MobileNetV2 variants fully outperform the MobileNetV1 counterpart, with the most important MobileNetV2 acquiring the very best accuracy amongst all fashions.

Determine 8. Extra outcomes displaying the prevalence of MobileNetV2 over the prevailing fashions and the way enter decision impacts accuracy [3].

MobileNetV2 Implementation

Each time I completed studying one thing, I at all times marvel if I actually perceive what I simply discovered. Within the case of deep studying, I (virtually) at all times attempt to implement the structure by myself proper after studying the paper simply to show to myself that I perceive. And right here’s the quote that drives me that approach:

What I can’t create, I don’t perceive.

Richard Feynman

That is basically the explanation why I at all times embody the code implementation of the paper I’m explaining in my put up.

What an intermezzo that was. — Now let’s get again our focus to MobileNetV2. On this part I’m going to point out you the way we are able to implement the structure from scratch. As at all times, the very very first thing we have to do is to import the required modules.

# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import abstract

Subsequent, we additionally have to initialize some configuration variables in order that we are able to simply rescale our mannequin if we wish to. The 2 variables I wish to spotlight within the Codeblock 2 beneath are the WIDTH_MULTIPLIER and IMAGE_SIZE, the place these two basically correspond to the width multiplier and enter decision parameters we mentioned earlier. Right here I set the 2 to 1.0 and 224 as a result of I wish to implement the bottom MobileNetV2 structure.

# Codeblock 2
BATCH_SIZE        = 1
IMAGE_SIZE        = 224
IN_CHANNELS       = 3
NUM_CLASSES       = 1000
WIDTH_MULTIPLIER  = 1.0

If we check out the architectural particulars in Determine 5, we are able to see that the rows labeled as bottleneck is a bunch of blocks, which we beforehand discuss with as stage. In the meantime, every row labeled as conv2d is principally simply a regular convolution layer. I’ll begin with the latter first as a result of that one is simpler to implement.

The Normal Convolution Layer

Speaking concerning the rows labeled with conv2d, you could be asking why we actually have to wrap this single convolution layer in a separate class. Can’t we simply instantly use nn.Conv2d in the principle class? — In actual fact, it’s talked about within the paper that each conv layer is at all times adopted by a batch normalization layer earlier than finally being processed by the ReLU6 activation operate. That is truly in accordance with MobileNetV1, the place it makes use of the conv-BN-ReLU construction. With a view to make the code cleaner, we are able to simply wrap these layers inside a single class in order that we don’t essentially have to outline all of them repeatedly. Check out the Codeblock 3 beneath to see how I create the Conv class.

# Codeblock 3
class Conv(nn.Module):
    def __init__(self, first=False):      #(1)
        tremendous().__init__()
        
        if first:
            in_channels = 3               #(2)
            out_channels = int(32*WIDTH_MULTIPLIER)          #(3)
            kernel_size = 3               #(4)
            stride = 2                    #(5)
            padding = 1                   #(6)
        else:
            in_channels  = int(320*WIDTH_MULTIPLIER)         #(7)
            out_channels = int(1280*WIDTH_MULTIPLIER)        #(8)
            kernel_size = 1               #(9)
            stride = 1                    #(10)
            padding = 0                   #(11)
        
        self.conv = nn.Conv2d(in_channels=in_channels,       #(12)
                              out_channels=out_channels, 
                              kernel_size=kernel_size,
                              stride=stride, 
                              padding=padding, 
                              bias=False)
        self.bn = nn.BatchNorm2d(num_features=out_channels)  #(13)
        self.relu6 = nn.ReLU6()           #(14)
    
    def ahead(self, x):
        x = self.relu6(self.bn(self.conv(x)))                #(15)
        return x

Each time we wish to instantiate a Conv occasion, we have to cross a worth for the first parameter as proven on the line marked with #(1) within the above code. Should you check out the structure, you’ll discover that this Conv layer can be used both earlier than the sequence of inverted residuals or proper after the sequence. The Determine 9 beneath shows the structure once more with the 2 convolutions highlighted in pink and inexperienced, respectively. Later in the principle class, if we wish to instantiate the pink layer, we are able to merely set the first flag to True, and if we wish to instantiate the inexperienced one, we are able to run it with out passing any arguments since I’ve set the flag to False by default.

Determine 9. The **Conv** class can be used to instantiate these two convolution layers [3][6].

Utilizing a flag like this helps us to use totally different configurations for the 2 convolutions. After we use first=True, we set the convolution layer to just accept 3 enter channels (#(2)) and produce a 32-channel tensor (#(3)). The kernel dimension used can be 3×3 (#(4)) with a stride of two (#(5)), successfully downsampling the spatial dimension by half. With this kernel dimension, we have to set the padding to 1 (#(6)) to forestall the convolution course of from lowering the spatial dimension even additional. All these configurations are basically taken from the conv layer highlighted in pink.

In the meantime, after we use first=False, this convolution layer will take a tensor of 320 channels for the enter (#(7)) and produce one other one having 1280 channels (#(8)). This green-highlighted layer is a pointwise convolution, therefore we have to set the kernel dimension to 1 (#(9)). Since right here we gained’t carry out spatial downsampling, the stride parameter have to be set to 1 as proven at line #(10) (discover that the enter dimension of this layer and the subsequent one are each 7×7 spatially). Lastly, we set the padding to 0 (#(11)) as a result of by nature a 1×1 kernel can’t cut back spatial dimensions by itself.

Because the parameters for the convolution layer have been outlined, the subsequent factor we do within the Conv class above is to initialize the convolution layer itself utilizing nn.Conv2d (#(12)) in addition to the batch normalization layer (#(13)) and the ReLU6 activation operate (#(14)). Lastly, we assemble these layers to type the conv-BN-ReLU construction within the ahead() technique (#(15)). Along with the above code, don’t neglect to use WIDTH_MULTIPLIER when specifying the variety of enter and output channels, i.e., at line #(3), #(7), and #(8), in order that we are able to modify the mannequin dimension just by altering the worth of the variable.

Now let’s examine if we’ve got applied the Conv class appropriately by operating the 2 check instances beneath. The one in Codeblock 4 demonstrates the pink layer whereas the Codeblock 5 exhibits the inexperienced one. The form of the dummy tensor x utilized in each checks are set in line with the enter shapes required by every of the 2 layers. Primarily based on the ensuing outputs, we are able to affirm that our implementation is right because the output tensor shapes match precisely with the anticipated enter shapes of the corresponding subsequent layers.

# Codeblock 4
conv = Conv(first=True)
x = torch.randn(1, 3, 224, 224)

out = conv(x)
out.form

# Codeblock 4 Output
torch.Dimension([1, 32, 112, 112])

# Codeblock 5
conv = Conv(first=False)
x = torch.randn(1, int(320*WIDTH_MULTIPLIER), 7, 7)

out = conv(x)
out.form

# Codeblock 5 Output
torch.Dimension([1, 1280, 7, 7])

Inverted Residual Block for Stride 2

As we’ve got accomplished the category for normal convolution layers, we’ll now speak concerning the one for the inverted residual blocks. Understand that there are instances the place we use both stride 1 or 2, which leads to a slight distinction within the block construction (see Determine 6). On this case I made a decision to implement them in two separate lessons. When it comes to practicality, it would certainly be cleaner if we simply put them throughout the identical class. Nonetheless, for the sake of this tutorial I really feel like breaking them down into two will make issues simpler to observe. I’m going to implement the one with stride 2 first since this one is easier due to the absence of the skip-connection. See the InvResidualS2 class in Codeblock 6 beneath for the main points.

# Codeblock 6
class InvResidualS2(nn.Module):
    def __init__(self, in_channels, out_channels, t):         #(1)
        tremendous().__init__()
        
        in_channels  = int(in_channels*WIDTH_MULTIPLIER)      #(2)
        out_channels = int(out_channels*WIDTH_MULTIPLIER)     #(3)
        
        self.pwconv0 = nn.Conv2d(in_channels=in_channels,     #(4)
                                 out_channels=in_channels*t,
                                 kernel_size=1, 
                                 stride=1, 
                                 bias=False)
        
        self.bn_pwconv0 = nn.BatchNorm2d(num_features=in_channels*t)
        
        self.dwconv = nn.Conv2d(in_channels=in_channels*t,    #(5)
                                out_channels=in_channels*t, 
                                kernel_size=3,                #(6)
                                stride=2, 
                                padding=1,
                                teams=in_channels*t,         #(7)
                                bias=False)
        
        self.bn_dwconv = nn.BatchNorm2d(num_features=in_channels*t)
        
        self.pwconv1 = nn.Conv2d(in_channels=in_channels*t,   #(8)
                                 out_channels=out_channels, 
                                 kernel_size=1, 
                                 stride=1, 
                                 bias=False)
        
        self.bn_pwconv1 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu6 = nn.ReLU6()
    
    def ahead(self, x):
        print('originaltt:', x.form)
        
        x = self.pwconv0(x)
        print('after pwconv0tt:', x.form)
        x = self.bn_pwconv0(x)
        print('after bn0_pwconv0t:', x.form)
        x = self.relu6(x)
        print('after relutt:', x.form)
        
        x = self.dwconv(x)
        print('after dwconvtt:', x.form)
        x = self.bn_dwconv(x)
        print('after bn_dwconvtt:', x.form)
        x = self.relu6(x)
        print('after relutt:', x.form)
        
        x = self.pwconv1(x)
        print('after pwconv1tt:', x.form)
        x = self.bn_pwconv1(x)
        print('after bn_pwconv1t:', x.form)
        
        return x

The above class takes three parameters with the intention to work: in_channels, out_channels, and t, as written at line #(1). The primary two corresponds to the variety of enter and output channels of the inverted residual block, whereas t is the growth issue for figuring out the channel rely of the huge a part of the block. So, what we principally do right here is simply to make the center tensors to have t instances extra channels than the enter. The variety of enter and output channels themselves are adjustable through the WIDTH_MULTIPLIER variable we initialized earlier as proven at line #(2) and #(3).

What we have to do subsequent is to initialize the layers throughout the inverted residual block in line with the construction in Determine 3 and 6. Discover within the two figures that we’ve got a depthwise convolution layer positioned between two pointwise convolutions. The primary pointwise convolution (#(4)) is used to broaden the channel dimension from in_channels to in_channels*t. Subsequently, the depthwise convolution at line #(5) is accountable to seize info alongside the spatial dimension. Right here we set the kernel dimension to three×3 (#(6)), which permits the layer to seize spatial info from its neighboring pixels. Don’t neglect to set the teams parameter to be the identical because the variety of enter channels to this layer (#(7)) since we would like the convolution operation to be carried out independently of every channel. Subsequent, we course of the ensuing tensor with the second pointwise convolution (#(8)), during which this layer is used to venture the tensor to the anticipated variety of output channels of the block.

Within the ahead() technique, we place the layers one after one other. Keep in mind that we use the conv-BN-ReLU construction aside from the final convolution, following the conference of linear bottleneck we mentioned earlier. Moreover, right here I additionally print out the output form after every layer as a way to clearly see how the tensor transforms throughout the course of.

Subsequent, we’re going to check whether or not the InvResidualS2 class works correctly. The next testing code simulates the primary inverted residual block (n=1) of the third row within the structure (i.e., the one having 16×112×112 enter form).

# Codeblock 7
inv_residual_s2 = InvResidualS2(in_channels=16, out_channels=24, t=6)
x = torch.randn(1, int(16*WIDTH_MULTIPLIER), 112, 112)

out = inv_residual_s2(x)

You’ll be able to see on the line marked with #(1) within the following output that the primary pointwise convolution efficiently expands the channel axis from 16 to 96. The spatial dimension shrinks from 112×112 to 56×56 after the tensor being processed by the depthwise convolution layer within the center (#(2)). Lastly, our second pointwise convolution compresses the variety of channels to 24 as written at line #(3). This ultimate tensor dimension is now able to be handed by means of the subsequent inverted residual block throughout the identical stage.

# Codeblock 7 Output
authentic          : torch.Dimension([1, 16, 112, 112])
after pwconv0     : torch.Dimension([1, 96, 112, 112])  #(1)
after bn0_pwconv0 : torch.Dimension([1, 96, 112, 112])
after relu        : torch.Dimension([1, 96, 112, 112])
after dwconv      : torch.Dimension([1, 96, 56, 56])    #(2)
after bn_dwconv   : torch.Dimension([1, 96, 56, 56])
after relu        : torch.Dimension([1, 96, 56, 56])
after pwconv1     : torch.Dimension([1, 24, 56, 56])    #(3)
after bn_pwconv1  : torch.Dimension([1, 24, 56, 56])

Inverted Residual Block for Stride 1

The code used for implementing the inverted residual block with stride 1 is generally just like the one with stride 2. See the InvResidualS1 class in Codeblock 8 beneath.

# Codeblock 8
class InvResidualS1(nn.Module):
    def __init__(self, in_channels, out_channels, t):
        tremendous().__init__()
        
        in_channels  = int(in_channels*WIDTH_MULTIPLIER)    #(1)
        out_channels = int(out_channels*WIDTH_MULTIPLIER)   #(2)
        
        self.in_channels  = in_channels
        self.out_channels = out_channels
        
        self.pwconv0 = nn.Conv2d(in_channels=in_channels, 
                                 out_channels=in_channels*t, 
                                 kernel_size=1, 
                                 stride=1, 
                                 bias=False)
        
        self.bn_pwconv0 = nn.BatchNorm2d(num_features=in_channels*t)
        
        self.dwconv = nn.Conv2d(in_channels=in_channels*t, 
                                out_channels=in_channels*t, 
                                kernel_size=3, 
                                stride=1,            #(3)
                                padding=1,
                                teams=in_channels*t, 
                                bias=False)
        
        self.bn_dwconv = nn.BatchNorm2d(num_features=in_channels*t)
        
        self.pwconv1 = nn.Conv2d(in_channels=in_channels*t, 
                                 out_channels=out_channels, 
                                 kernel_size=1, 
                                 stride=1, 
                                 bias=False)
        
        self.bn_pwconv1 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu6 = nn.ReLU6()
        
    def ahead(self, x):
        
        if self.in_channels == self.out_channels:    #(4)
            residual = x          #(5)
            print(f'residualtt: {residual.dimension()}')
        
        x = self.pwconv0(x)
        print('after pwconv0tt:', x.form)
        x = self.bn_pwconv0(x)
        print('after bn_pwconv0t:', x.form)
        x = self.relu6(x)
        print('after relutt:', x.form)
        
        x = self.dwconv(x)
        print('after dwconvtt:', x.form)
        x = self.bn_dwconv(x)
        print('after bn_dwconvtt:', x.form)
        x = self.relu6(x)
        print('after relutt:', x.form)
        
        x = self.pwconv1(x)
        print('after pwconv1tt:', x.form)
        x = self.bn_pwconv1(x)
        print('after bn_pwconv1t:', x.form)
        
        if self.in_channels == self.out_channels:
            x = x + residual      #(6)
            print('after summationtt:', x.form)
        
        return x

The primary distinction we’ve got right here is unquestionably the stride parameter itself, particularly the one belongs to the depthwise convolution layer at line #(3). By setting the stride parameter to 1 like this, the spatial output dimension of this inverted residual block goes to be the identical because the enter.

One other factor that we didn’t do beforehand is creating occasion attributes for in_channels and out_channels as proven at strains #(1) and #(2). We do that now as a result of in a while we might want to entry these values from the ahead() technique. That is truly only a fundamental OOP idea, the place if we don’t assign them to self, then they may solely exist regionally throughout the __init__() technique and gained’t be accessible to different strategies within the class.

Contained in the ahead() technique itself, what we have to do first is to examine whether or not the variety of enter and output channels are the identical (#(4)). If that’s the case, we’ll hold the unique enter tensor (#(5)) to implement the skip-connection, during which this tensor can be element-wise summed with the one from the principle circulate (#(6)). This tensor dimensionality checking is carried out as a result of we have to be certain that the 2 tensors to be summed have the very same dimension. We certainly have assured the spatial dimension to stay unchanged since we’ve got set all of the three convolution layers to make use of stride 1. Nonetheless, there may be nonetheless a risk that the variety of output channels differs from the enter, similar to the primary block throughout the phases highlighted in purple, blue and orange in Determine 10 beneath. In such instances, skip-connection is not going to be utilized as a result of it’s simply unattainable to carry out element-wise summation on tensors with totally different shapes.

Determine 10. Regardless of not performing spatial downsampling, there is no such thing as a skip-connection throughout the first block within the three highlighted phases because the variety of enter and output channels are totally different [3][6].

Now let’s check the InvResidualS1 class by operating the Codeblock 9 beneath. Right here I’m going to simulate the second inverted residual block (n=2) of the third row within the structure, during which that is truly simply the continuation of the earlier check case. Right here you may see that the dummy tensor we use has the very same form because the one we obtained from Codeblock 7, i.e., 24×56×56.

# Codeblock 9
inv_residual_s1 = InvResidualS1(in_channels=24, out_channels=24, t=6)
x = torch.randn(1, int(24*WIDTH_MULTIPLIER), 56, 56)

out = inv_residual_s1(x)

And beneath is what the ensuing output appears like. It’s clearly seen right here that the community certainly follows the slim → huge → slim construction, which on this case is 24 → 144 → 24. Along with this, because the spatial dimensions of the enter and the output tensors are the identical, we are able to technically stack this inverted residual block as many instances as we would like.

# Codeblock 9 Output
residual          : torch.Dimension([1, 24, 56, 56])
after pwconv0     : torch.Dimension([1, 144, 56, 56])
after bn_pwconv0  : torch.Dimension([1, 144, 56, 56])
after relu        : torch.Dimension([1, 144, 56, 56])
after dwconv      : torch.Dimension([1, 144, 56, 56])
after bn_dwconv   : torch.Dimension([1, 144, 56, 56])
after relu        : torch.Dimension([1, 144, 56, 56])
after pwconv1     : torch.Dimension([1, 24, 56, 56])
after bn_pwconv1  : torch.Dimension([1, 24, 56, 56])
after summation   : torch.Dimension([1, 24, 56, 56])

The Total MobileNetV2 Structure

As we’ve got accomplished defining the Conv, InvResidualS2 and InvResidualS1 lessons, we are able to now assemble all of them to assemble the whole MobileNetV2 structure. Take a look at the Codeblock 10 beneath to see how I try this.

# Codeblock 10
class MobileNetV2(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        # Enter form: 3x224x224
        self.first_conv = Conv(first=True)
        
        # Enter form: 32x112x112
        self.inv_residual0 = InvResidualS1(in_channels=32, 
                                           out_channels=16, 
                                           t=1)
        
        # Enter form: 16x112x112
        self.inv_residual1 = nn.ModuleList([InvResidualS2(in_channels=16, 
                                                          out_channels=24, 
                                                          t=6)])
        
        self.inv_residual1.append(InvResidualS1(in_channels=24, 
                                                out_channels=24, 
                                                t=6))
        
        # Enter form: 24x56x56
        self.inv_residual2 = nn.ModuleList([InvResidualS2(in_channels=24, 
                                                          out_channels=32, 
                                                          t=6)])
        
        for _ in vary(2):
            self.inv_residual2.append(InvResidualS1(in_channels=32, 
                                                    out_channels=32, 
                                                    t=6))
        
        # Enter form: 32x28x28
        self.inv_residual3 = nn.ModuleList([InvResidualS2(in_channels=32, 
                                                          out_channels=64, 
                                                          t=6)])
        
        for _ in vary(3):
            self.inv_residual3.append(InvResidualS1(in_channels=64, 
                                                    out_channels=64, 
                                                    t=6))
            
        # Enter form: 64x14x14
        self.inv_residual4 = nn.ModuleList([InvResidualS1(in_channels=64, 
                                                          out_channels=96, 
                                                          t=6)])
        
        for _ in vary(2):
            self.inv_residual4.append(InvResidualS1(in_channels=96, 
                                                    out_channels=96, 
                                                    t=6))
        
        
        # Enter form: 96x14x14
        self.inv_residual5 = nn.ModuleList([InvResidualS2(in_channels=96, 
                                                          out_channels=160, 
                                                          t=6)])
        
        for _ in vary(2):
            self.inv_residual5.append(InvResidualS1(in_channels=160, 
                                                    out_channels=160, 
                                                    t=6))
        
        # Enter form: 160x7x7
        self.inv_residual6 = InvResidualS1(in_channels=160, 
                                           out_channels=320, 
                                           t=6)
        
        # Enter form: 320x7x7
        self.last_conv = Conv(first=False)
        
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))        #(1)
        self.dropout = nn.Dropout(p=0.2)                              #(2)
        self.fc = nn.Linear(in_features=int(1280*WIDTH_MULTIPLIER),   #(3)
                            out_features=1000)

    def ahead(self, x):
        x = self.first_conv(x)
        print(f"after first_convt: {x.form}")
        
        x = self.inv_residual0(x)
        print(f"after inv_residual0t: {x.form}")
            
        for i, layer in enumerate(self.inv_residual1):
            x = layer(x)
            print(f"after inv_residual1 #{i}t: {x.form}")
            
        for i, layer in enumerate(self.inv_residual2):
            x = layer(x)
            print(f"after inv_residual2 #{i}t: {x.form}")
            
        for i, layer in enumerate(self.inv_residual3):
            x = layer(x)
            print(f"after inv_residual3 #{i}t: {x.form}")
            
        for i, layer in enumerate(self.inv_residual4):
            x = layer(x)
            print(f"after inv_residual4 #{i}t: {x.form}")
            
        for i, layer in enumerate(self.inv_residual5):
            x = layer(x)
            print(f"after inv_residual5 #{i}t: {x.form}")
        
        x = self.inv_residual6(x)
        print(f"after inv_residual6t: {x.form}")
        
        x = self.last_conv(x)
        print(f"after last_convtt: {x.form}")
        
        x = self.avgpool(x)
        print(f"after avgpooltt: {x.form}")
        
        x = torch.flatten(x, start_dim=1)
        print(f"after flattentt: {x.form}")
        
        x = self.dropout(x)
        print(f"after dropouttt: {x.form}")
        
        x = self.fc(x)
        print(f"after fctt: {x.form}")
                
        return x

Regardless of being fairly lengthy, I believe the above code is fairly simple since what we principally do right here is simply to position the blocks in line with the given architectural particulars. Nonetheless, I really need you to concentrate to the variety of block repeats inside a single stage (n) in addition to whether or not or not the primary block in a stage performs downsampling (s). It is because the structure doesn’t appear to observe a selected sample. There’s a case the place the block is repeated 4 instances, there are different instances the place the repeats is finished two or thrice, and there may be even a stage that consists of a single block solely. Not solely that, it is usually unclear underneath what circumstances authors determined to make use of stride 1 or 2 for the primary block within the stage. Nonetheless, I consider that this ultimate structure was obtained based mostly on their inner design iterations and experiments that aren’t mentioned within the paper.

Going again to the code, after the phases have been initialized, what we have to do subsequent is to initialize the remaining layers, specifically a mean pooling layer (#(1)), a dropout layer (#(2)) and a linear layer (#(3)) for the classification head. Should you return to the architectural particulars, you’ll discover that the ultimate layer must be a pointwise convolution, not a linear layer like this. In actual fact, within the case when the spatial dimension of the enter tensor is 1×1, a pointwise convolution and a linear layer are equal. So, it’s principally superb to make use of both one.

To make sure our MobileNetV2 is working correctly, we are able to run the Codeblock 11 beneath. Right here we are able to see that this class occasion runs with none errors. Extra importantly, the output form additionally matches precisely with the structure specified within the paper. This confirms that our implementation is right, and thus prepared for coaching — simply don’t neglect to regulate the output dimension of the ultimate layer to match the variety of lessons in your dataset.

# Codeblock 11
mobilenetv2 = MobileNetV2()
x = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)

out = mobilenetv2(x)

# Codeblock 11 Output
after first_conv       : torch.Dimension([1, 32, 112, 112])
after inv_residual1    : torch.Dimension([1, 16, 112, 112])
after inv_residual1 #0 : torch.Dimension([1, 24, 56, 56])
after inv_residual1 #1 : torch.Dimension([1, 24, 56, 56])
after inv_residual2 #0 : torch.Dimension([1, 32, 28, 28])
after inv_residual2 #1 : torch.Dimension([1, 32, 28, 28])
after inv_residual2 #2 : torch.Dimension([1, 32, 28, 28])
after inv_residual3 #0 : torch.Dimension([1, 64, 14, 14])
after inv_residual3 #1 : torch.Dimension([1, 64, 14, 14])
after inv_residual3 #2 : torch.Dimension([1, 64, 14, 14])
after inv_residual3 #3 : torch.Dimension([1, 64, 14, 14])
after inv_residual4 #0 : torch.Dimension([1, 96, 14, 14])
after inv_residual4 #1 : torch.Dimension([1, 96, 14, 14])
after inv_residual4 #2 : torch.Dimension([1, 96, 14, 14])
after inv_residual5 #0 : torch.Dimension([1, 160, 7, 7])
after inv_residual5 #1 : torch.Dimension([1, 160, 7, 7])
after inv_residual5 #2 : torch.Dimension([1, 160, 7, 7])
after inv_residual6    : torch.Dimension([1, 320, 7, 7])
after last_conv        : torch.Dimension([1, 1280, 7, 7])
after avgpool          : torch.Dimension([1, 1280, 1, 1])
after flatten          : torch.Dimension([1, 1280])
after dropout          : torch.Dimension([1, 1280])
after fc               : torch.Dimension([1, 1000])

Alternatively, it is usually potential to check our MobileNetV2 mannequin utilizing the abstract() operate from torchinfo, which may also present us the variety of parameters contained inside every layer. Should you scroll down all the way in which to the tip of the output, you’ll see that this mannequin with default width multiplier has 3,505,960 trainable params. This quantity is totally different from the one disclosed within the paper, the place in line with Determine 7 it must be 3.4 million. Nonetheless, if we go to the official PyTorch documentation [7], it says that the parameter rely of this mannequin is 3,504,872, which could be very near our implementation. Let me know within the feedback if which components of the code I ought to change to make this quantity match precisely with the one from PyTorch.

# Codeblock 12
mobilenetv2 = MobileNetV2()
abstract(mobilenetv2, input_size=(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE))

# Codeblock 12 Output
==========================================================================================
Layer (sort:depth-idx)                   Output Form              Param #
==========================================================================================
MobileNetV2                              [1, 1000]                 --
├─Conv: 1-1                              [1, 32, 112, 112]         --
│    └─Conv2d: 2-1                       [1, 32, 112, 112]         864
│    └─BatchNorm2d: 2-2                  [1, 32, 112, 112]         64
│    └─ReLU6: 2-3                        [1, 32, 112, 112]         --
├─InvResidualS1: 1-2                     [1, 16, 112, 112]         --
│    └─Conv2d: 2-4                       [1, 32, 112, 112]         1,024
│    └─BatchNorm2d: 2-5                  [1, 32, 112, 112]         64
│    └─ReLU6: 2-6                        [1, 32, 112, 112]         --
│    └─Conv2d: 2-7                       [1, 32, 112, 112]         288
│    └─BatchNorm2d: 2-8                  [1, 32, 112, 112]         64
│    └─ReLU6: 2-9                        [1, 32, 112, 112]         --
│    └─Conv2d: 2-10                      [1, 16, 112, 112]         512
│    └─BatchNorm2d: 2-11                 [1, 16, 112, 112]         32
├─ModuleList: 1-3                        --                        --
│    └─InvResidualS2: 2-12               [1, 24, 56, 56]           --
│    │    └─Conv2d: 3-1                  [1, 96, 112, 112]         1,536
│    │    └─BatchNorm2d: 3-2             [1, 96, 112, 112]         192
│    │    └─ReLU6: 3-3                   [1, 96, 112, 112]         --
│    │    └─Conv2d: 3-4                  [1, 96, 56, 56]           864
│    │    └─BatchNorm2d: 3-5             [1, 96, 56, 56]           192
│    │    └─ReLU6: 3-6                   [1, 96, 56, 56]           --
│    │    └─Conv2d: 3-7                  [1, 24, 56, 56]           2,304
│    │    └─BatchNorm2d: 3-8             [1, 24, 56, 56]           48
│    └─InvResidualS1: 2-13               [1, 24, 56, 56]           --
│    │    └─Conv2d: 3-9                  [1, 144, 56, 56]          3,456
│    │    └─BatchNorm2d: 3-10            [1, 144, 56, 56]          288
│    │    └─ReLU6: 3-11                  [1, 144, 56, 56]          --
│    │    └─Conv2d: 3-12                 [1, 144, 56, 56]          1,296
│    │    └─BatchNorm2d: 3-13            [1, 144, 56, 56]          288
│    │    └─ReLU6: 3-14                  [1, 144, 56, 56]          --
│    │    └─Conv2d: 3-15                 [1, 24, 56, 56]           3,456
│    │    └─BatchNorm2d: 3-16            [1, 24, 56, 56]           48
├─ModuleList: 1-4                        --                        --
│    └─InvResidualS2: 2-14               [1, 32, 28, 28]           --
│    │    └─Conv2d: 3-17                 [1, 144, 56, 56]          3,456
│    │    └─BatchNorm2d: 3-18            [1, 144, 56, 56]          288
│    │    └─ReLU6: 3-19                  [1, 144, 56, 56]          --
│    │    └─Conv2d: 3-20                 [1, 144, 28, 28]          1,296
│    │    └─BatchNorm2d: 3-21            [1, 144, 28, 28]          288
│    │    └─ReLU6: 3-22                  [1, 144, 28, 28]          --
│    │    └─Conv2d: 3-23                 [1, 32, 28, 28]           4,608
│    │    └─BatchNorm2d: 3-24            [1, 32, 28, 28]           64
│    └─InvResidualS1: 2-15               [1, 32, 28, 28]           --
│    │    └─Conv2d: 3-25                 [1, 192, 28, 28]          6,144
│    │    └─BatchNorm2d: 3-26            [1, 192, 28, 28]          384
│    │    └─ReLU6: 3-27                  [1, 192, 28, 28]          --
│    │    └─Conv2d: 3-28                 [1, 192, 28, 28]          1,728
│    │    └─BatchNorm2d: 3-29            [1, 192, 28, 28]          384
│    │    └─ReLU6: 3-30                  [1, 192, 28, 28]          --
│    │    └─Conv2d: 3-31                 [1, 32, 28, 28]           6,144
│    │    └─BatchNorm2d: 3-32            [1, 32, 28, 28]           64
│    └─InvResidualS1: 2-16               [1, 32, 28, 28]           --
│    │    └─Conv2d: 3-33                 [1, 192, 28, 28]          6,144
│    │    └─BatchNorm2d: 3-34            [1, 192, 28, 28]          384
│    │    └─ReLU6: 3-35                  [1, 192, 28, 28]          --
│    │    └─Conv2d: 3-36                 [1, 192, 28, 28]          1,728
│    │    └─BatchNorm2d: 3-37            [1, 192, 28, 28]          384
│    │    └─ReLU6: 3-38                  [1, 192, 28, 28]          --
│    │    └─Conv2d: 3-39                 [1, 32, 28, 28]           6,144
│    │    └─BatchNorm2d: 3-40            [1, 32, 28, 28]           64
├─ModuleList: 1-5                        --                        --
│    └─InvResidualS2: 2-17               [1, 64, 14, 14]           --
│    │    └─Conv2d: 3-41                 [1, 192, 28, 28]          6,144
│    │    └─BatchNorm2d: 3-42            [1, 192, 28, 28]          384
│    │    └─ReLU6: 3-43                  [1, 192, 28, 28]          --
│    │    └─Conv2d: 3-44                 [1, 192, 14, 14]          1,728
│    │    └─BatchNorm2d: 3-45            [1, 192, 14, 14]          384
│    │    └─ReLU6: 3-46                  [1, 192, 14, 14]          --
│    │    └─Conv2d: 3-47                 [1, 64, 14, 14]           12,288
│    │    └─BatchNorm2d: 3-48            [1, 64, 14, 14]           128
│    └─InvResidualS1: 2-18               [1, 64, 14, 14]           --
│    │    └─Conv2d: 3-49                 [1, 384, 14, 14]          24,576
│    │    └─BatchNorm2d: 3-50            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-51                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-52                 [1, 384, 14, 14]          3,456
│    │    └─BatchNorm2d: 3-53            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-54                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-55                 [1, 64, 14, 14]           24,576
│    │    └─BatchNorm2d: 3-56            [1, 64, 14, 14]           128
│    └─InvResidualS1: 2-19               [1, 64, 14, 14]           --
│    │    └─Conv2d: 3-57                 [1, 384, 14, 14]          24,576
│    │    └─BatchNorm2d: 3-58            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-59                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-60                 [1, 384, 14, 14]          3,456
│    │    └─BatchNorm2d: 3-61            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-62                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-63                 [1, 64, 14, 14]           24,576
│    │    └─BatchNorm2d: 3-64            [1, 64, 14, 14]           128
│    └─InvResidualS1: 2-20               [1, 64, 14, 14]           --
│    │    └─Conv2d: 3-65                 [1, 384, 14, 14]          24,576
│    │    └─BatchNorm2d: 3-66            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-67                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-68                 [1, 384, 14, 14]          3,456
│    │    └─BatchNorm2d: 3-69            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-70                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-71                 [1, 64, 14, 14]           24,576
│    │    └─BatchNorm2d: 3-72            [1, 64, 14, 14]           128
├─ModuleList: 1-6                        --                        --
│    └─InvResidualS1: 2-21               [1, 96, 14, 14]           --
│    │    └─Conv2d: 3-73                 [1, 384, 14, 14]          24,576
│    │    └─BatchNorm2d: 3-74            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-75                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-76                 [1, 384, 14, 14]          3,456
│    │    └─BatchNorm2d: 3-77            [1, 384, 14, 14]          768
│    │    └─ReLU6: 3-78                  [1, 384, 14, 14]          --
│    │    └─Conv2d: 3-79                 [1, 96, 14, 14]           36,864
│    │    └─BatchNorm2d: 3-80            [1, 96, 14, 14]           192
│    └─InvResidualS1: 2-22               [1, 96, 14, 14]           --
│    │    └─Conv2d: 3-81                 [1, 576, 14, 14]          55,296
│    │    └─BatchNorm2d: 3-82            [1, 576, 14, 14]          1,152
│    │    └─ReLU6: 3-83                  [1, 576, 14, 14]          --
│    │    └─Conv2d: 3-84                 [1, 576, 14, 14]          5,184
│    │    └─BatchNorm2d: 3-85            [1, 576, 14, 14]          1,152
│    │    └─ReLU6: 3-86                  [1, 576, 14, 14]          --
│    │    └─Conv2d: 3-87                 [1, 96, 14, 14]           55,296
│    │    └─BatchNorm2d: 3-88            [1, 96, 14, 14]           192
│    └─InvResidualS1: 2-23               [1, 96, 14, 14]           --
│    │    └─Conv2d: 3-89                 [1, 576, 14, 14]          55,296
│    │    └─BatchNorm2d: 3-90            [1, 576, 14, 14]          1,152
│    │    └─ReLU6: 3-91                  [1, 576, 14, 14]          --
│    │    └─Conv2d: 3-92                 [1, 576, 14, 14]          5,184
│    │    └─BatchNorm2d: 3-93            [1, 576, 14, 14]          1,152
│    │    └─ReLU6: 3-94                  [1, 576, 14, 14]          --
│    │    └─Conv2d: 3-95                 [1, 96, 14, 14]           55,296
│    │    └─BatchNorm2d: 3-96            [1, 96, 14, 14]           192
├─ModuleList: 1-7                        --                        --
│    └─InvResidualS2: 2-24               [1, 160, 7, 7]            --
│    │    └─Conv2d: 3-97                 [1, 576, 14, 14]          55,296
│    │    └─BatchNorm2d: 3-98            [1, 576, 14, 14]          1,152
│    │    └─ReLU6: 3-99                  [1, 576, 14, 14]          --
│    │    └─Conv2d: 3-100                [1, 576, 7, 7]            5,184
│    │    └─BatchNorm2d: 3-101           [1, 576, 7, 7]            1,152
│    │    └─ReLU6: 3-102                 [1, 576, 7, 7]            --
│    │    └─Conv2d: 3-103                [1, 160, 7, 7]            92,160
│    │    └─BatchNorm2d: 3-104           [1, 160, 7, 7]            320
│    └─InvResidualS1: 2-25               [1, 160, 7, 7]            --
│    │    └─Conv2d: 3-105                [1, 960, 7, 7]            153,600
│    │    └─BatchNorm2d: 3-106           [1, 960, 7, 7]            1,920
│    │    └─ReLU6: 3-107                 [1, 960, 7, 7]            --
│    │    └─Conv2d: 3-108                [1, 960, 7, 7]            8,640
│    │    └─BatchNorm2d: 3-109           [1, 960, 7, 7]            1,920
│    │    └─ReLU6: 3-110                 [1, 960, 7, 7]            --
│    │    └─Conv2d: 3-111                [1, 160, 7, 7]            153,600
│    │    └─BatchNorm2d: 3-112           [1, 160, 7, 7]            320
│    └─InvResidualS1: 2-26               [1, 160, 7, 7]            --
│    │    └─Conv2d: 3-113                [1, 960, 7, 7]            153,600
│    │    └─BatchNorm2d: 3-114           [1, 960, 7, 7]            1,920
│    │    └─ReLU6: 3-115                 [1, 960, 7, 7]            --
│    │    └─Conv2d: 3-116                [1, 960, 7, 7]            8,640
│    │    └─BatchNorm2d: 3-117           [1, 960, 7, 7]            1,920
│    │    └─ReLU6: 3-118                 [1, 960, 7, 7]            --
│    │    └─Conv2d: 3-119                [1, 160, 7, 7]            153,600
│    │    └─BatchNorm2d: 3-120           [1, 160, 7, 7]            320
├─InvResidualS1: 1-8                     [1, 320, 7, 7]            --
│    └─Conv2d: 2-27                      [1, 960, 7, 7]            153,600
│    └─BatchNorm2d: 2-28                 [1, 960, 7, 7]            1,920
│    └─ReLU6: 2-29                       [1, 960, 7, 7]            --
│    └─Conv2d: 2-30                      [1, 960, 7, 7]            8,640
│    └─BatchNorm2d: 2-31                 [1, 960, 7, 7]            1,920
│    └─ReLU6: 2-32                       [1, 960, 7, 7]            --
│    └─Conv2d: 2-33                      [1, 320, 7, 7]            307,200
│    └─BatchNorm2d: 2-34                 [1, 320, 7, 7]            640
├─Conv: 1-9                              [1, 1280, 7, 7]           --
│    └─Conv2d: 2-35                      [1, 1280, 7, 7]           409,600
│    └─BatchNorm2d: 2-36                 [1, 1280, 7, 7]           2,560
│    └─ReLU6: 2-37                       [1, 1280, 7, 7]           --
├─AdaptiveAvgPool2d: 1-10                [1, 1280, 1, 1]           --
├─Dropout: 1-11                          [1, 1280]                 --
├─Linear: 1-12                           [1, 1000]                 1,281,000
==========================================================================================
Whole params: 3,505,960
Trainable params: 3,505,960
Non-trainable params: 0
Whole mult-adds (Models.MEGABYTES): 313.65
==========================================================================================
Enter dimension (MB): 0.60
Ahead/backward cross dimension (MB): 113.28
Params dimension (MB): 14.02
Estimated Whole Dimension (MB): 127.91
==========================================================================================

Ending

And that’s just about the whole lot about MobileNetV2. I do encourage you to discover this structure by yourself — a minimum of by truly coaching it on a picture classification dataset. Don’t neglect to mess around with the width multiplier and the enter decision parameters to search out the proper stability between prediction accuracy and computational effectivity. It’s also possible to discover the code used on this article in my GitHub repository [8] by the way in which.

I hope you discovered one thing new right now. Thanks for studying!

References

[1] Muhammad Ardi. MobileNetV1 Paper Walkthrough: The Tiny Big. In direction of Information Science. https://towardsdatascience.com/the-tiny-giant-mobilenetv1/ [Accessed September 25, 2025].

[2] Andrew G. Howard et al. MobileNets: Environment friendly Convolutional Neural Networks for Cell Imaginative and prescient Functions. Arxiv. https://arxiv.org/abs/1704.04861 [Accessed April 7, 2025].

[3] Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. Arxiv. https://arxiv.org/abs/1801.04381 [Accessed April 12, 2025].

[4] Kaiming He et al. Deep Residual Studying for Picture Recognition. Arxiv. https://arxiv.org/abs/1512.03385 [Accessed April 12, 2025].

[5] Zhuang Liu et al. A ConvNet for the 2020s. Arxiv. https://arxiv.org/abs/2201.03545 [Accessed April 12, 2025].

[6] Picture created initially by creator.

[7] mobilenet_v2. PyTorch. https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html#mobilenet-v2 [Accessed April 12, 2025].

[8] MuhammadArdiPutra. The Smarter Tiny Big — MobileNetV2. GitHub. medium_articles/The Smarter Tiny Giant — MobileNetV2.ipynb at main · MuhammadArdiPutra/medium_articles [Accessed April 12, 2025].

Source link

MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Essential steps to launching a successful startup in the UK entertainment industry (Sponsored)

Berlin’s Peec AI lands €18 million as demand grows for AI-based brand visibility tools

Supersonic Tech Solves AI’s Power Problem

MobileNetV2 Paper Walkthrough: The Smarter Tiny Giant

Introduction

The Enhancements

Inverted Residual

Linear Bottleneck

ReLU6

The Full MobileNetV2 Structure

Parameter Tuning

Some Experimental Outcomes

MobileNetV2 Implementation

The Normal Convolution Layer

Inverted Residual Block for Stride 2

Inverted Residual Block for Stride 1

The Total MobileNetV2 Structure

Ending

References

Related Posts