MobileNetV1 Paper Walkthrough: The Tiny Giant

Introduction

used to give attention to enhancing accuracy. They stored pushing the restrict larger and better till they ultimately realized that the computational complexity of their fashions turned increasingly more costly. This was positively an issue researchers wanted to handle as a result of we wish deep studying fashions to have the ability to work not solely on high-end computer systems but in addition on small units. To beat this situation, Howard et al. again in 2017 proposed an especially light-weight neural community mannequin known as MobileNet which they launched in a paper titled MobileNets: Environment friendly Convolutional Neural Networks for Cellular Imaginative and prescient Functions [1]. The truth is, the mannequin proposed within the paper is the primary model of MobileNet, which is often often known as MobileNetV1. Presently we have already got 4 MobileNet variations: MobileNetV1 all the best way to MobileNetV4. Nonetheless, on this article we’re solely going to give attention to MobileNetV1, masking the thought behind the structure and find out how to implement it from scratch with PyTorch — I’ll save the later MobileNet variations for my upcoming articles.

Depthwise Separable Convolution

In an effort to obtain a light-weight mannequin, MobileNet leverages the thought of depthwise separable convolution, which is used practically all through your complete community. Determine 1 under shows the structural distinction between this layer (proper) and a typical convolution layer (left). You’ll be able to see within the determine that depthwise separable convolution principally contains two varieties of convolution layers: depthwise convolution and pointwise convolution. Along with that, we usually observe the conv-BN-ReLU construction in terms of establishing CNN-based fashions. That is primarily the rationale that within the illustration we’ve batch normalization and ReLU proper after every conv layer. We’re going to focus on depthwise and pointwise convolutions extra deeply within the subsequent sections.

Determine 1. The construction of a typical convolution layer (left) and a depthwise separable convolution layer (proper) [1].

Depthwise Convolution

A typical convolution layer is principally a convolution with the group parameter set to 1. It is very important keep in mind that on this case utilizing a 3×3 kernel truly means making use of a kernel of form C×3×3 to the enter tensor, the place C is the variety of enter channels. The usage of this kernel form permits us to combination data from all channels inside every 3×3 patch without delay. That is the rationale why the usual convolution operation is computationally costly, but in return the output tensor accommodates quite a lot of data. When you take a more in-depth take a look at Determine 2 under, a typical convolution layer corresponds to the one within the leftmost a part of the tradeoff line.

Determine 2. The tradeoff between fewer and extra convolution teams [2].

When you’re already aware of group convolution, depthwise convolution must be simple so that you can perceive. Group convolution is a technique the place we divide the channels of the enter tensor in keeping with the variety of teams used and apply convolution independently inside every group. As an example, suppose we’ve an enter tensor of 64 channels and wish to course of it with 128 kernels grouped into 2. In such a case, the primary 64 kernels are liable for processing the primary 32 channels of the enter tensor, whereas the remaining 64 kernels course of the final 32 channels of the enter tensor. This mechanism leads to 64 output channels for every group. The ultimate output tensor is obtained by concatenating the ensuing tensors from all teams alongside the channel dimension, leading to a complete of 128 channels on this instance.

As we proceed growing the variety of teams, we ultimately attain the intense case often known as depthwise convolution, which is a particular case of group convolution the place the variety of teams is about equal to the variety of enter channels. With this configuration, we principally have every channel processed independently of one another, inflicting each channel within the enter to provide solely a single output channel. By concatenating all of the ensuing 1-channel tensors, the ultimate variety of output channels stays precisely the identical as that of the enter. This mechanism requires us to make use of a kernel of measurement 1×3×3 as an alternative of C×3×3, stopping us to carry out data aggregation alongside channel axis. This permits us to have extraordinarily light-weight computation, but in return inflicting the output tensor to comprise much less data as a result of absence of channel-wise data aggregation.

For the reason that goal of MobileNet is to make the computation as quick as attainable, we have to place ourselves on the rightmost a part of the above tradeoff line regardless of capturing the least quantity of data. That is positively an issue that must be addressed, which is the rationale why we make use of pointwise convolution within the subsequent step.

Pointwise Convolution

Pointwise convolution is principally simply a typical convolution, besides that it makes use of kernels of measurement 1×1 — or to be extra exact, it’s truly C×1×1. This kernel form permits us to combination data alongside the channel axis with out being influenced by spatial data, successfully compensating for the limitation of depthwise convolution. Moreover, keep in mind that depthwise convolution alone can solely output a tensor of the identical variety of channels as its enter, which limits our flexibility in designing the mannequin structure. By making use of pointwise convolution within the subsequent step, we will set it to return as many channels as we wish, permitting us to adapt the layer to the following one as wanted.

We are able to consider depthwise convolution and pointwise convolution as two complementary processes, the place the previous focuses on capturing spatial relationships whereas the latter captures channel relationships. These two processes may appear a bit inefficient at first look since we will principally do the 2 processes without delay utilizing a typical convolution layer. Nonetheless, if we take a more in-depth take a look at the computational complexity, depthwise separable convolution is much more light-weight in comparison with the normal convolution layer counterpart. Within the subsequent part I’ll focus on in additional element how we will calculate the variety of parameters in these two fashions which positively additionally impacts the computational complexity.

Parameters Depend Calculation

Suppose we’ve a picture of measurement 3×H×W, the place H and W are the peak and width of a picture, respectively. For the sake of this instance, let’s assume that we’re about to course of the picture with 16 kernels of measurement 5×5, the place the stride is about to 1 and the padding is about to 2 (which on this case is equal to padding = similar). With this configuration, the scale of the output tensor goes to be 16×H×W. If we use a typical convolution layer, the variety of parameters will probably be 5×5×3×16 = 1200 (with out bias), through which this quantity is obtained based mostly on the equation in Determine 3. The usage of bias time period shouldn’t be strictly needed on this case, but when we do the whole variety of parameters goes to be (5×5×3+1) × 16 = 1216.

Determine 3. Equation to calculate the variety of parameters of a convolution layer [2].

Now let’s calculate the parameter rely of the depthwise separable convolution counterpart to provide the very same tensor dimension. Following the identical formulation, we can have 5×5×1×3 = 75 for the depthwise convolution half (with out bias). Or if we additionally account for the biases, then we can have (5×5×1+1) × 3 = 78 trainable params. Within the case of depthwise convolution like this, the variety of enter channels is taken into account 1 since every kernel is liable for processing a single channel solely. To the pointwise convolution half, the variety of parameters will probably be 1×1×3×16 = 48 (with out bias) or (1×1×3+1) × 16 = 64 (with bias). Now to acquire the whole variety of parameters in your complete depthwise separable convolution course of, we will merely calculate 75+48 = 123 (with out bias) or 78+64 = 142 (with bias) — that’s practically 90% discount in parameter rely if we examine it with the usual convolution! In concept, such an excessive drop in parameter rely causes the mannequin to have a lot decrease capability. However that’s simply the speculation. Later I’ll present you ways MobileNet manages to maintain up with different fashions when it comes to accuracy.

The Detailed MobileNetV1 Structure

Determine 4 under shows your complete MobileNetV1 structure intimately. The depthwise convolution layers are the rows marked with dw, whereas the pointwise convolutions are those having 1×1 filter form. Discover that every dw layer is all the time adopted by a 1×1 convolution, indicating that your complete structure primarily consists of depthwise separable convolutions. Moreover, in the event you take a more in-depth take a look at the structure, you will notice that spatial downsampling is completed by the depthwise convolutions which have a stride of two (discover the rows with s2 within the desk). Right here you’ll be able to see that each time we scale back the spatial dimension by half, the variety of channels doubles to compensate for the lack of spatial data.

Determine 4. Your entire MobileNetV1 structure [1].

Width and Decision Multiplier

The authors of MobileNet proposed a brand new parameter tuning mechanism by introducing the so-called width and decision multiplier, that are formally denoted as α and ρ, respectively. The α parameter can technically be adjusted freely, however authors counsel utilizing both 1.0, 0.75, 0.5, or 0.25. This parameter works by lowering the variety of channels produced by all convolution layers. As an example, if we set α to 0.5, the primary convolution layer within the community will flip 3-channel enter into 16 as an alternative of 32. Alternatively, ρ is used to regulate the spatial dimension of the enter tensor. It is very important word that regardless that ideally we should always assign a floating-point quantity to this parameter, but in apply it’s extra preferable to immediately decide the precise decision for the enter picture. On this case, authors advocate utilizing both 224, 192, 160 and 128, through which the enter measurement of 224×224 corresponds to ρ = 1. The structure displayed in Determine 4 above follows the default configuration the place each α and ρ are set to 1.

Experimental Outcomes

Authors performed loads of experiments to show the robustness of MobileNet. The primary end result to debate is the one displayed in Determine 5 under, the place on this experiment they tried to learn how using depthwise separable convolution layers impacts efficiency. The second row of the desk reveals the end result obtained by the structure I confirmed you earlier in Determine 4, whereas the primary row is the end result when the layers are changed with conventional convolutions. Right here we will see that the accuracy of MobileNet with conventional CNN is certainly larger than that of the one utilizing depthwise separable convolutions. Nonetheless, if we take note of the variety of multiplications and additions (mult-adds) in addition to the parameter rely, we will clearly see that the one with conventional convolution layers requires far more computational value and reminiscence utilization simply to make a slight enchancment in accuracy. Thus, with depthwise separable convolutions, regardless that the mannequin complexity of MobileNet considerably will get diminished, authors proved that the mannequin capability stays excessive.

Determine 5. Efficiency comparability between MobileNet with depthwise separable convolution layers (second row) and its full-convolution counterpart (first row) [1].

The α and ρ parameters I defined earlier are primarily used to offer flexibility, contemplating that not all duties require the very best MobileNet functionality. Authors initially performed experiments on 1000-class ImageNet dataset, however in apply, we’d in all probability solely want the mannequin to carry out classification on a dataset with fewer variety of lessons. In such a case, deciding on decrease values for the 2 parameters may be preferable as it might probably velocity up the inference course of whereas on the similar time the mannequin nonetheless has sufficient capability to accommodate the classification job. Speaking extra particularly about α, utilizing smaller worth for this parameter causes MobileNet to have decrease accuracy. However that’s the end result on 1000-class dataset. If our dataset is easier and has fewer lessons, utilizing smaller α may nonetheless be nice. In Determine 6 under the values 1.0, 0.75, 0.5, and 0.25 written subsequent to every mannequin correspond to the α used.

Determine 6. How width multiplier impacts mannequin accuracy, variety of operations, and parameter rely [1].

The identical factor additionally applies to the ρ parameter, which is liable for altering the decision of the enter picture. Determine 7 under shows what the experimental outcomes appear like after we use completely different enter resolutions. The outcomes are considerably much like those within the earlier determine, the place the accuracy rating decreases as we make the enter picture smaller. It is very important understand that lowering enter decision like this additionally reduces the variety of operations however doesn’t have an effect on the parameter rely. That is primarily as a result of those counted as parameters are the weights and biases, the place within the case of CNN they correspond to the values contained in the kernel. So, the parameter rely will stay the identical so long as we don’t change the configuration of the convolution layers. The variety of operations, however, will get diminished in accordance with the lower in enter decision because the variety of pixels to be processed in smaller photos is fewer than in bigger photos.

Determine 7. How enter decision impacts mannequin accuracy, variety of operations, and parameter rely [1].

As an alternative of simply evaluating completely different values for α and ρ, authors additionally in contrast MobileNet with different fashionable fashions. We are able to see in Determine 8 that the biggest MobileNet variant (the one utilizing most α and ρ) achieved comparable accuracy with GoogLeNet (InceptionV1) and VGG16 whereas sustaining the bottom computational complexity. That is principally the rationale that I named this text The Tiny Big — light-weight but highly effective.

Determine 8. MobileNet achieves comparable accuracy to fashionable fashions whereas sustaining a a lot decrease computational complexity and parameter rely [1].

Moreover, authors additionally in contrast the smaller MobileNet variant with different small fashions. What’s to me attention-grabbing in Determine 9 is that regardless that the parameter rely of SqueezeNet is decrease than MobileNet, the variety of operations in MobileNet is over 22 occasions smaller than SqueezeNet whereas nonetheless sustaining larger accuracy.

Determine 9. The efficiency of the smaller MobileNet variant in comparison with fashionable fashions [1].

MobileNetV1 Implementation

As we’ve understood the thought behind MobileNetV1, we will now leap into the code. The structure I’m about to implement is predicated on the desk in Determine 4. As all the time, the very first thing we have to do is to import the required modules.

# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import abstract

Subsequent, we initialize a number of configurable parameters in order that we will modify the mannequin measurement in keeping with our wants. In Codeblock 2 under, I denote α as ALPHA, which the worth might be modified to 0.75, 0.5 or 0.25 if we wish the mannequin to be smaller. We don’t specify any variable for ρ since we will immediately change IMAGE_SIZE to 192, 160 or 128 as we mentioned earlier.

# Codeblock 2
BATCH_SIZE  = 1
IMAGE_SIZE  = 224
IN_CHANNELS = 3
NUM_CLASSES = 1000
ALPHA       = 1

First Convolution

If we return to Determine 4, we will see that MobileNet primarily solely consists of repeating patterns, i.e., a depthwise separable convolution adopted by pointwise convolution. Nonetheless, discover that the primary row within the determine doesn’t observe this sample as it’s truly simply a typical convolution layer. Because of this cause, we have to create a separate class for this, which I seek advice from as FirstConv in Codeblock 3 under.

# Codeblock 3
class FirstConv(nn.Module):
    def __init__(self):
        tremendous().__init__()
        self.conv = nn.Conv2d(in_channels=3, 
                              out_channels=int(32*ALPHA),    #(1)
                              kernel_size=3,    #(2)
                              stride=2,         #(3)
                              padding=1,        #(4)
                              bias=False)       #(5)
        self.bn = nn.BatchNorm2d(num_features=int(32*ALPHA))
        self.relu = nn.ReLU()
    
    def ahead(self, x):
        x = self.relu(self.bn(self.conv(x)))
        return x

Do not forget that MobileNet follows the conv-BN-ReLU construction. Thus, we have to initialize these three layers throughout the __init__() methodology of this class. The convolution layer itself is about to just accept 3 enter channels and output 32 channels. Since we wish this variety of output channels to be adjustable, we have to multiply it with ALPHA on the line marked with #(1). Needless to say we have to change the datatype to integer after the multiplication since having a floating-point quantity for channel rely is simply nonsense. Subsequent, at line #(2) and #(3) we set the kernel measurement to three and the stride to 2. With this configuration, the spatial dimension of the ensuing tensor goes to be half that of the enter. Moreover, utilizing kernel of measurement 3×3 like this implicitly requires us to set the padding to 1 to implement padding = similar (#(4)). On this case we’re not going to make the most of the bias time period, which is the rationale that we set the bias parameter to False (#(5)). That is truly a typical apply after we use the conv-BN-ReLU construction, since on the finish of the day the worth distribution of the convolution kernels will probably be centered round 0 once more by the batch normalization layer, cancelling out the biases utilized throughout the convolution kernel.

In an effort to discover out whether or not the FirstConv class works correctly, we’re going to take a look at it with the Codeblock 4 under. Right here we initialize the layer and cross a tensor simulating a single RGB picture of measurement 224×224. You’ll be able to see within the ensuing output that our convolution layer efficiently downsampled the spatial dimension to 112×112 whereas on the similar time increasing the variety of channels to 32.

# Codeblock 4
first_conv = FirstConv()
x = torch.randn((1, 3, 224, 224))

out = first_conv(x)
out.form

# Codeblock 4 Output
torch.Dimension([1, 32, 112, 112])

Depthwise Separable Convolutions

As the primary convolution is completed, we will now work on the repeating depthwise-pointwise layers. Since this sample is the core thought of depthwise separable convolution, within the following code I wrap the 2 varieties of conv layers in a category referred to as DepthwiseSeparableConv.

# Codeblock 5
class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, downsample=False):  #(1)
        tremendous().__init__()
        
        in_channels  = int(in_channels*ALPHA)    #(2)
        out_channels = int(out_channels*ALPHA)   #(3)       
        
        if downsample:    #(4)
            stride = 2
        else:
            stride = 1
        
        self.dwconv = nn.Conv2d(in_channels=in_channels,
                                out_channels=in_channels,     #(5)
                                kernel_size=3,                #(6)
                                stride=stride,                #(7)
                                padding=1,
                                teams=in_channels,           #(8)
                                bias=False)
        self.bn0 = nn.BatchNorm2d(num_features=in_channels)   #(9)
        
        self.pwconv = nn.Conv2d(in_channels=in_channels,   
                                out_channels=out_channels,    #(10)
                                kernel_size=1,                #(11)
                                stride=1,                     #(12)
                                padding=0,                    #(13)
                                teams=1,                     #(14)
                                bias=False)
        self.bn1 = nn.BatchNorm2d(num_features=out_channels)  #(15)
        
        self.relu = nn.ReLU()    #(16)

    def ahead(self, x):
        print(f'originalt: {x.measurement()}')
        
        x = self.relu(self.bn0(self.dwconv(x)))
        print(f'after dw convt: {x.measurement()}')
        
        x = self.relu(self.bn1(self.pwconv(x)))
        print(f'after pw convt: {x.measurement()}')
        
        return x

Totally different from FirstConv which doesn’t take any enter argument within the initialization part, right here we set the DepthwiseSeparableConv class to take a number of inputs as proven at line #(1) in Codeblock 5 above. I do that as a result of we wish the category to be reusable throughout all depthwise separable convolution layers all through your complete community, through which every of them has barely completely different behaviors from each other.

We are able to see in Determine 4 that after the 3-channel picture is expanded to 32 by the primary layer, this channel rely will increase to 64, 128, and so forth all the best way to 1024 within the subsequent processes. That is principally the rationale that I set this class to just accept the variety of enter and output channels (in_channels and out_channels) in order that we will initialize the layer with versatile channel configurations. It’s also vital to understand that we have to modify this channel counts based mostly on ALPHA. This could merely be executed utilizing the code at line #(2) and #(3). Moreover, right here I additionally create a flag referred to as downsample because the enter parameter which by default is about to False. This flag is accountable to find out whether or not the layer will scale back the spatial dimension. Once more, in the event you return to Determine 4, you’ll discover that there are situations the place we scale back the spatial dimension by half and there are additionally another situations the place the dimension is preserved. Each time we wish to carry out downsampling, we have to set the stride to 2, but when we don’t, we are going to set this parameter to 1 as an alternative (#(4)).

Nonetheless with the Codeblock 5 above, the subsequent factor we have to do is to initialize the layers themselves. As we’ve mentioned earlier, the depthwise convolution is accountable to seize spatial relationships between pixels, which is definitely the rationale that the kernel measurement is about to three×3 (#(6)). To ensure that the enter channels to be processed independently of one another, we will merely set the teams and out_channels parameters to be the identical because the variety of enter channels itself (#(8) and #(5)). It’s price noting that if we set out_channels to be greater than the variety of enter channels — say, twice as massive — then we can have every channel processed by 2 kernels. Lastly for the depthwise convolution layer, the stride parameter at line #(7) can both be 1 or 2 which is set in keeping with the downsampling flag we mentioned earlier.

In the meantime, the pointwise convolution makes use of 1×1 kernel (#(11)) since it’s not supposed to seize spatial data. That is truly the rationale that we set the padding to 0 (#(13)) as a result of there isn’t a method this kernel measurement can scale back spatial dimension by itself. The teams parameter, however, is about to 1 (#(14)) as a result of we wish this layer to seize data from all channels without delay. Not like the depthwise convolution layer, right here we will make use of as many kernels as wanted which corresponds to the variety of channels within the ensuing output tensor (#(10)). In the meantime, the stride is about mounted to 1 (#(12)) since we are going to by no means carry out downsampling with this layer.

Right here we have to initialize two separate batch normalization layers to be positioned after the depthwise and pointwise convs (#(9) and #(15)). As for the ReLU activation perform, we solely must initialize it as soon as (#(16)) since it’s only a mapping perform with none trainable parameters. Because of this, we will reuse the identical ReLU occasion a number of occasions throughout the community.

Now let’s see if our DepthwiseSeparableConv class works correctly by passing a dummy tensor via it. Right here I’ve ready two take a look at circumstances for this class. The primary one is after we don’t carry out downsampling and the second is after we do. In Determine 10 under, the 2 checks I wish to carry out contain using the layers highlighted in inexperienced and blue, respectively.

Determine 10. The layers highlighted in inexperienced and blue are those we’re going to simulate to check the DepthwiseSeparableConv class [1][2].

To create the inexperienced half, we will merely use the DepthwiseSeparableConv class and set the variety of enter and output channels to 32 and 64 as seen in Codeblock 6 under (#(1–2)). Passing downsample = False shouldn’t be fairly needed since we already set it because the default configuration (#(3)) — however I do that anyway only for the sake of readability. The form of the dummy tensor x can also be configured to have the scale of 32×112×112, through which it matches precisely with the enter form of the layer (#(4)).

# Codeblock 6
depthwise_sep_conv = DepthwiseSeparableConv(in_channels=32,     #(1)
                                            out_channels=64,    #(2)
                                            downsample=False)   #(3)
x = torch.randn((1, int(32*ALPHA), 112, 112))                   #(4)

x = depthwise_sep_conv(x)

When you run the above code, the next output ought to seem in your display screen. Right here you’ll be able to see that the depthwise convolution layer returns a tensor of the very same form because the enter (#(1)). The variety of channels then doubles from 32 to 64 after the tensor being processed by the pointwise convolution (#(2)). This end result proves that our DepthwiseSeparableConv class works correctly for non-downsampling course of. We’ll use this output tensor within the subsequent take a look at because the enter for the blue layer.

# Codeblock 6 Output
authentic       : torch.Dimension([1, 32, 112, 112])
after dw conv  : torch.Dimension([1, 32, 112, 112])    #(1)
after pw conv  : torch.Dimension([1, 64, 112, 112])    #(2)

The second take a look at is sort of much like the primary one, besides that right here we have to configure the mannequin based mostly on the variety of enter and output channels of the blue layer. Not solely that, the downsample parameter additionally must be set to True since we wish the layer to scale back the spatial dimension by half. See Codeblock 7 under for the small print.

# Codeblock 7
depthwise_sep_conv = DepthwiseSeparableConv(in_channels=64, 
                                            out_channels=128,
                                            downsample=True)

x = depthwise_sep_conv(x)

# Codeblock 7 Output
authentic       : torch.Dimension([1, 64, 112, 112])
after dw conv  : torch.Dimension([1, 64, 56, 56])    #(1)
after pw conv  : torch.Dimension([1, 128, 56, 56])   #(2)

We are able to see within the above output that the spatial downsampling works correctly because the depthwise convolution layer efficiently transformed the 112×112 picture to 56×56 (#(1)). The channel axis is lastly expanded to 128 with the assistance of the pointwise convolution layer (#(2)), making it able to be fed into the following layer.

Based mostly on the 2 checks I demonstrated above, it’s confirmed that our DepthwiseSeparableConv class is appropriate and thus prepared for use to assemble your complete MobileNetV1 structure.

The Total MobileNetV1 Structure

I wrap every part inside a category which I seek advice from as MobileNetV1. Since this class is sort of lengthy, I break it down into Codeblock 8a and 8b. If you wish to run this code your self, simply be certain that these two codeblocks are written throughout the similar pocket book cell.

Now let’s begin from the __init__() methodology of this class. The very first thing to do right here is to initialize the FirstConv layer we created earlier (#(1)). The subsequent layers we have to initialize are the core thought of MobileNet, i.e., the depthwise separable convolutions, through which each single of these layers consists of depthwise and pointwise convs. On this implementation I made a decision to call these pairs ranging from depthwise_sep_conv0 all the best way to depthwise_sep_conv8. When you return to Determine 4, you’ll discover that the downsampling course of is completed alternately with the non-downsampling layers. This could merely be carried out by setting the downsample flag to True for the layers #1, 3, 5 and seven. The depthwise_sep_conv6 is a bit particular since it’s truly not a standalone layer. Reasonably, it’s a bunch of depthwise separable convolutions of the very same specification repeated 5 occasions.

# Codeblock 8a
class MobileNetV1(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.first_conv = FirstConv()    #(1)
        
        self.depthwise_sep_conv0 = DepthwiseSeparableConv(in_channels=32, 
                                                          out_channels=64)
        
        self.depthwise_sep_conv1 = DepthwiseSeparableConv(in_channels=64, 
                                                          out_channels=128, 
                                                          downsample=True)
        
        self.depthwise_sep_conv2 = DepthwiseSeparableConv(in_channels=128, 
                                                          out_channels=128)
        
        self.depthwise_sep_conv3 = DepthwiseSeparableConv(in_channels=128, 
                                                          out_channels=256, 
                                                          downsample=True)
        
        self.depthwise_sep_conv4 = DepthwiseSeparableConv(in_channels=256, 
                                                          out_channels=256)
        
        self.depthwise_sep_conv5 = DepthwiseSeparableConv(in_channels=256, 
                                                          out_channels=512, 
                                                          downsample=True)
        
        self.depthwise_sep_conv6 = nn.ModuleList(
            [DepthwiseSeparableConv(in_channels=512, out_channels=512) for _ in range(5)]
        )
        
        self.depthwise_sep_conv7 = DepthwiseSeparableConv(in_channels=512, 
                                                          out_channels=1024, 
                                                          downsample=True)
        
        self.depthwise_sep_conv8 = DepthwiseSeparableConv(in_channels=1024,  #(2)
                                                          out_channels=1024)
        
        num_out_channels = self.depthwise_sep_conv8.pwconv.out_channels      #(3)
        
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))      #(4)
        self.fc = nn.Linear(in_features=num_out_channels,           #(5)
                            out_features=NUM_CLASSES)
        self.softmax = nn.Softmax(dim=1)                            #(6)

As we’ve reached the final DepthwiseSeparableConv layer (#(2)), what we have to do subsequent is to initialize three extra layers: a median pooling layer (#(4)), a fully-connected layer (#(5)), and a softmax activation perform layer (#(6)). One factor that you just want to remember is that the variety of output channels produced by the depthwise_sep_conv8 shouldn’t be all the time 1024 regardless that it seems to be mounted to that quantity. The truth is, this output channel rely will probably be completely different if we alter the ALPHA. In an effort to make our implementation adaptive to such modifications, we have to take the precise variety of output channels generated utilizing the code at line #(3), which can then be used because the enter measurement of the fully-connected layer (#(5)).

Relating to the ahead() methodology in Codeblock 8b, I believe there’s nothing I would like to elucidate since what we principally do right here is simply to cross a tensor from one layer to the following ones.

# Codeblock 8b
    def ahead(self, x):
        x = self.first_conv(x)
        print(f"after first_convtt: {x.form}")
        
        x = self.depthwise_sep_conv0(x)
        print(f"after depthwise_sep_conv0t: {x.form}")
        
        x = self.depthwise_sep_conv1(x)
        print(f"after depthwise_sep_conv1t: {x.form}")
        
        x = self.depthwise_sep_conv2(x)
        print(f"after depthwise_sep_conv2t: {x.form}")
        
        x = self.depthwise_sep_conv3(x)
        print(f"after depthwise_sep_conv3t: {x.form}")
        
        x = self.depthwise_sep_conv4(x)
        print(f"after depthwise_sep_conv4t: {x.form}")
        
        x = self.depthwise_sep_conv5(x)
        print(f"after depthwise_sep_conv5t: {x.form}")
        
        for i, layer in enumerate(self.depthwise_sep_conv6):
            x = layer(x)
            print(f"after depthwise_sep_conv6 #{i}t: {x.form}")
        
        x = self.depthwise_sep_conv7(x)
        print(f"after depthwise_sep_conv7t: {x.form}")
        
        x = self.depthwise_sep_conv8(x)
        print(f"after depthwise_sep_conv8t: {x.form}")
        
        x = self.avgpool(x)
        print(f"after avgpoolttt: {x.form}")
        
        x = torch.flatten(x, start_dim=1)
        print(f"after flattenttt: {x.form}")
        
        x = self.fc(x)
        print(f"after fcttt: {x.form}")
        
        x = self.softmax(x)
        print(f"after softmaxttt: {x.form}")
        
        return x

Now let’s see if our MobileNetV1 works correctly by operating the next take a look at code.

# Codeblock 9
mobilenetv1 = MobileNetV1()
x = torch.randn((BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE))

out = mobilenetv1(x)

And under is what the output seems like. Right here we will see that our dummy picture tensor efficiently went via the first_conv layer all the best way to the ultimate output layer. Throughout the convolution part, we will see that the spatial dimension decreased as we get into the deeper layers whereas on the similar time the variety of channels elevated. Afterwards, we apply a median pooling layer which works by taking the common worth from every channel. We are able to say that at this level each single channel of measurement 7×7 is now represented as a single worth, which is definitely the rationale that the spatial dimension dropped to 1×1 (#(1)). This tensor is then flattened (#(2)) in order that we will course of it additional with the fully-connected layer (#(3)).

# Codeblock 9 Output
after first_conv             : torch.Dimension([1, 32, 112, 112])
after depthwise_sep_conv0    : torch.Dimension([1, 64, 112, 112])
after depthwise_sep_conv1    : torch.Dimension([1, 128, 56, 56])
after depthwise_sep_conv2    : torch.Dimension([1, 128, 56, 56])
after depthwise_sep_conv3    : torch.Dimension([1, 256, 28, 28])
after depthwise_sep_conv4    : torch.Dimension([1, 256, 28, 28])
after depthwise_sep_conv5    : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #0 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #1 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #2 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #3 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #4 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv7    : torch.Dimension([1, 1024, 7, 7])
after depthwise_sep_conv8    : torch.Dimension([1, 1024, 7, 7])
after avgpool                : torch.Dimension([1, 1024, 1, 1])    #(1)
after flatten                : torch.Dimension([1, 1024])          #(2)
after fc                     : torch.Dimension([1, 1000])          #(3)
after softmax                : torch.Dimension([1, 1000])

If you need an much more detailed structure, we will use the abstract() perform from torchinfo we imported earlier. When you scroll down the ensuing output under, we will see that this mannequin accommodates roughly 4.2 million trainable parameters, through which this quantity matches with the one written in Determine 5, 6, 7 and eight. I additionally tried to initialize the identical mannequin with completely different ALPHA, and I discovered that the numbers match with the desk in Determine 6. Because of this cause, I believe our MobileNetV1 implementation is appropriate.

# Codeblock 10
mobilenetv1 = MobileNetV1()
abstract(mobilenetv1, input_size=(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE))

# Codeblock 10 Output
==========================================================================================
Layer (sort:depth-idx)                   Output Form              Param #
==========================================================================================
MobileNetV1                              [1, 1000]                 --
├─FirstConv: 1-1                         [1, 32, 112, 112]         --
│    └─Conv2d: 2-1                       [1, 32, 112, 112]         864
│    └─BatchNorm2d: 2-2                  [1, 32, 112, 112]         64
│    └─ReLU: 2-3                         [1, 32, 112, 112]         --
├─DepthwiseSeparableConv: 1-2            [1, 64, 112, 112]         --
│    └─Conv2d: 2-4                       [1, 32, 112, 112]         288
│    └─BatchNorm2d: 2-5                  [1, 32, 112, 112]         64
│    └─ReLU: 2-6                         [1, 32, 112, 112]         --
│    └─Conv2d: 2-7                       [1, 64, 112, 112]         2,048
│    └─BatchNorm2d: 2-8                  [1, 64, 112, 112]         128
│    └─ReLU: 2-9                         [1, 64, 112, 112]         --
├─DepthwiseSeparableConv: 1-3            [1, 128, 56, 56]          --
│    └─Conv2d: 2-10                      [1, 64, 56, 56]           576
│    └─BatchNorm2d: 2-11                 [1, 64, 56, 56]           128
│    └─ReLU: 2-12                        [1, 64, 56, 56]           --
│    └─Conv2d: 2-13                      [1, 128, 56, 56]          8,192
│    └─BatchNorm2d: 2-14                 [1, 128, 56, 56]          256
│    └─ReLU: 2-15                        [1, 128, 56, 56]          --
├─DepthwiseSeparableConv: 1-4            [1, 128, 56, 56]          --
│    └─Conv2d: 2-16                      [1, 128, 56, 56]          1,152
│    └─BatchNorm2d: 2-17                 [1, 128, 56, 56]          256
│    └─ReLU: 2-18                        [1, 128, 56, 56]          --
│    └─Conv2d: 2-19                      [1, 128, 56, 56]          16,384
│    └─BatchNorm2d: 2-20                 [1, 128, 56, 56]          256
│    └─ReLU: 2-21                        [1, 128, 56, 56]          --
├─DepthwiseSeparableConv: 1-5            [1, 256, 28, 28]          --
│    └─Conv2d: 2-22                      [1, 128, 28, 28]          1,152
│    └─BatchNorm2d: 2-23                 [1, 128, 28, 28]          256
│    └─ReLU: 2-24                        [1, 128, 28, 28]          --
│    └─Conv2d: 2-25                      [1, 256, 28, 28]          32,768
│    └─BatchNorm2d: 2-26                 [1, 256, 28, 28]          512
│    └─ReLU: 2-27                        [1, 256, 28, 28]          --
├─DepthwiseSeparableConv: 1-6            [1, 256, 28, 28]          --
│    └─Conv2d: 2-28                      [1, 256, 28, 28]          2,304
│    └─BatchNorm2d: 2-29                 [1, 256, 28, 28]          512
│    └─ReLU: 2-30                        [1, 256, 28, 28]          --
│    └─Conv2d: 2-31                      [1, 256, 28, 28]          65,536
│    └─BatchNorm2d: 2-32                 [1, 256, 28, 28]          512
│    └─ReLU: 2-33                        [1, 256, 28, 28]          --
├─DepthwiseSeparableConv: 1-7            [1, 512, 14, 14]          --
│    └─Conv2d: 2-34                      [1, 256, 14, 14]          2,304
│    └─BatchNorm2d: 2-35                 [1, 256, 14, 14]          512
│    └─ReLU: 2-36                        [1, 256, 14, 14]          --
│    └─Conv2d: 2-37                      [1, 512, 14, 14]          131,072
│    └─BatchNorm2d: 2-38                 [1, 512, 14, 14]          1,024
│    └─ReLU: 2-39                        [1, 512, 14, 14]          --
├─ModuleList: 1-8                        --                        --
│    └─DepthwiseSeparableConv: 2-40      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-1                  [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-2             [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-3                    [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-4                  [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-5             [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-6                    [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-41      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-7                  [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-8             [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-9                    [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-10                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-11            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-12                   [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-42      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-13                 [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-14            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-15                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-16                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-17            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-18                   [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-43      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-19                 [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-20            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-21                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-22                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-23            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-24                   [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-44      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-25                 [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-26            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-27                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-28                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-29            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-30                   [1, 512, 14, 14]          --
├─DepthwiseSeparableConv: 1-9            [1, 1024, 7, 7]           --
│    └─Conv2d: 2-45                      [1, 512, 7, 7]            4,608
│    └─BatchNorm2d: 2-46                 [1, 512, 7, 7]            1,024
│    └─ReLU: 2-47                        [1, 512, 7, 7]            --
│    └─Conv2d: 2-48                      [1, 1024, 7, 7]           524,288
│    └─BatchNorm2d: 2-49                 [1, 1024, 7, 7]           2,048
│    └─ReLU: 2-50                        [1, 1024, 7, 7]           --
├─DepthwiseSeparableConv: 1-10           [1, 1024, 7, 7]           --
│    └─Conv2d: 2-51                      [1, 1024, 7, 7]           9,216
│    └─BatchNorm2d: 2-52                 [1, 1024, 7, 7]           2,048
│    └─ReLU: 2-53                        [1, 1024, 7, 7]           --
│    └─Conv2d: 2-54                      [1, 1024, 7, 7]           1,048,576
│    └─BatchNorm2d: 2-55                 [1, 1024, 7, 7]           2,048
│    └─ReLU: 2-56                        [1, 1024, 7, 7]           --
├─AdaptiveAvgPool2d: 1-11                [1, 1024, 1, 1]           --
├─Linear: 1-12                           [1, 1000]                 1,025,000
├─Softmax: 1-13                          [1, 1000]                 --
==========================================================================================
Complete params: 4,231,976
Trainable params: 4,231,976
Non-trainable params: 0
Complete mult-adds (Items.MEGABYTES): 568.76
==========================================================================================
Enter measurement (MB): 0.60
Ahead/backward cross measurement (MB): 80.69
Params measurement (MB): 16.93
Estimated Complete Dimension (MB): 98.22
==========================================================================================

Ending

That was just about every part about MobileNetV1. I do encourage you to mess around with the above mannequin. If you wish to practice it for picture classification, you’ll be able to modify the variety of neurons within the output layer in keeping with the variety of lessons out there in your dataset. You too can attempt to discover completely different α and ρ to seek out the values that swimsuit finest to your case when it comes to accuracy and effectivity. Moreover, since this implementation is actually executed from scratch, additionally it is attainable to vary different issues that aren’t explicitly talked about within the paper, such because the variety of repeats of the depthwise_sep_conv6 layer, and even utilizing α and ρ higher than 1. And effectively, there are principally numerous issues to discover from our MobileNetV1 implementation! You too can entry the code used on this article in my GitHub repository [3].

Be at liberty to remark in the event you spot any mistake in my clarification or the code. Thanks for studying!

References

[1] Andrew G. Howard et al. MobileNets: Environment friendly Convolutional Neural Networks for Cellular Imaginative and prescient Functions. Arxiv. https://arxiv.org/abs/1704.04861 [Accessed April 7, 2025].

[2] Picture created initially by creator.

[3] MuhammadArdiPutra. The Tiny Big — MobileNetV1. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Tiny%20Giant%20-%20MobileNetV1.ipynb [Accessed April 7, 2025].

Source link

MobileNetV1 Paper Walkthrough: The Tiny Giant

Escaping the Valley of Choice in BI

Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

How to Combine Claude Code and Codex for Maximum Coding Power

It’s the Lessons We Learned Along the Way. Or, Is It?

Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

Florida crackdown targets illegal machines in Sarasota

Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds

New radio bursts detected from binary stars

Remarkable, Catalysr and Indigenous pre-accelerators score NSW government support for diverse founders

Featured Picks

Beluga whale mating secrets reveal nature’s strategy

Octopus robot gripper switches fast from rigid to supple

Container ship ‘drifts’ around extremely tight river bend

MobileNetV1 Paper Walkthrough: The Tiny Giant

Introduction

Depthwise Separable Convolution

Depthwise Convolution

Pointwise Convolution

Parameters Depend Calculation

The Detailed MobileNetV1 Structure

Width and Decision Multiplier

Experimental Outcomes

MobileNetV1 Implementation

First Convolution

Depthwise Separable Convolutions

The Total MobileNetV1 Structure

Ending

References

Related Posts