The Channel-Wise Attention | Squeeze and Excitation

After we discuss consideration in laptop imaginative and prescient, one factor that in all probability involves your thoughts first is the one used within the Imaginative and prescient Transformer (ViT) structure. In truth, that’s not the one consideration mechanism we now have for picture knowledge. There may be truly one other one known as Squeeze and Excitation Community (SENet). If the eye in ViT operates spatially, i.e., assigning weights to completely different patches of a picture, the eye mechanism proposed in SENet operates in channel-wise method, i.e., assigning weights to completely different channels. — On this article, we’re going to talk about how the Squeeze and Excitation structure works, the way to implement it from scratch, and the way to combine the community into the ResNeXt mannequin.

The Squeeze and Excitation Module

SENet, which was first proposed in a paper titled “Squeeze-and-Excitation Networks” by Hu et al. [1], isn’t a standalone community like VGG, Inception, or ResNet. As a substitute, it’s truly a constructing block to be positioned on an present community. In CNN-based fashions, we assume that pixels spatially shut to one another have excessive correlations, which is the rationale that we make use of small-sized kernels to seize these correlations. This type of assumption is mainly the inductive bias of CNN. Then again, SENet introduces a brand new inductive bias, the place the authors assume that each picture channel contributes in a different way to predicting a selected class. By making use of SE modules to a CNN, the mannequin not solely depends on spatial patterns but in addition captures the significance of every channel. To raised illustrate this, we are able to consider a picture of fireplace, the place the pink channel would theoretically give a better contribution to the ultimate prediction than the blue and inexperienced channels.

The construction of the SE module itself is proven in Determine 1. Because the identify of the community suggests, there are two foremost steps performed on this module: squeeze and excitation. The squeeze half corresponds to the operation denoted as F_sq, whereas the excitation half consists of each F_ex and F_scale. Then again, the F_tr operation, is definitely not the a part of the SE module. Moderately, it represents a metamorphosis perform that initially belongs to the mannequin the place the SE module is utilized. For instance, if we have been to put this SE module on ResNet, the F_tr operation refers back to the stack of convolution layers inside the bottleneck block.

Determine 1. The construction of the Squeeze and Excitation module [1].

Speaking extra particularly concerning the F_sq operation, it primarily works by using international common pooling mechanism, the place it’s used to seize the knowledge from the whole spatial dimension of every channel. By doing so, each channel of the enter tensor goes to be represented by a single quantity, which is mainly simply the common worth of the corresponding channel. The authors seek advice from this operation as international data embedding. Mathematically talking, this could formally be written within the equation proven in Determine 2, the place we mainly sum all values throughout the peak H and width W earlier than finally dividing it with the variety of pixels inside that channel (H×W).

Determine 2. The mathematical expression of the worldwide common pooling mechanism in SE module [1].

In the meantime, each excitation and scaling operations are known as adaptive recalibration since what they primarily do is to dynamically modify the weightings of every channel within the enter tensor in response to its significance. In truth, the diagram in Determine 1 doesn’t fully depict the whole SENet structure. You’ll be able to see within the determine that F_ex seems to be a single operation, but it truly consists of two linear layers every adopted by an activation perform. See the Determine 3 under for the main points.

Determine 3. The mathematical formulation of the ***F_ex*** operation [1].

The 2 linear layers are denoted as W_1 and W_2, whereas δ and σ characterize ReLU and sigmoid activation capabilities, respectively. So, primarily based on this mathematical definition, what we mainly must do later within the implementation is to move tensor z (the average-pooled tensor) by the primary linear layer, adopted by the ReLU activation perform, the second linear layer, and lastly the sigmoid activation perform. Do not forget that the sigmoid perform normalizes enter values to be inside the vary of 0 to 1. On this case, we are going to understand the ensuing output as the burden of every channel, the place a price near 1 signifies that the corresponding channel incorporates necessary data, therefore we enable the mannequin to pay extra consideration to that channel. In any other case, if the ensuing quantity is near 0, it signifies that the corresponding channel doesn’t contribute that a lot to the output.

With a purpose to make the most of these channel weights, we are able to carry out the F_scale operation, which is mainly only a multiplication of the unique tensor u and the burden tensor s, as proven in Determine 4 under. By doing this, we primarily retain the values inside the necessary channels whereas on the identical time suppressing the values of the unimportant ones.

Determine 4. The scaling course of is only a multiplication of the unique and the burden tensors [1].

By the best way sorry for getting a bit too mathy right here, lol. However I consider it will enable you perceive the code later within the implementation part.

The place to Put the SE Module

Making use of the SE module on a plain CNN mannequin like VGG is straightforward, as we are able to merely place it proper after every convolution layer. Nevertheless, it may not be simple within the case of Inception or ResNet because of the presence of parallel branches in these two networks. To handle this confusion, authors present a information to implement the SE module particularly on the 2 fashions as proven in Determine 5 under.

Determine 5. The place SE module is positioned in Inception and ResNet [1].

For the Inception mannequin, as an alternative of inserting SE module proper after every convolution layer, we move the enter tensor by the whole Inception block (together with all of the branches inside) after which connect the SE module afterwards. The identical method additionally works for ResNet, however take into account that the summation between the tensor in skip connection and the principle move occurs after the principle tensor has been processed by the SE module.

As I discussed earlier, the excitation stage primarily consists of two linear layers. If we take a more in-depth take a look at the above construction, we are able to see that the output form of the primary linear layer is 1×1×C/r. The variable r is named discount ratio which reduces the dimensionality of the burden tensor earlier than finally projecting it again to 1×1×C by the second linear layer. The dimensionality discount performed by the primary layer acts as a bottleneck operation, which is helpful to restrict mannequin complexity and to enhance generalization. Authors carried out experiments on completely different r values, and so they discovered that r = 16 produces the perfect steadiness between accuracy and complexity.

Determine 6. A number of methods attainable for use to connect SE module in ResNet [1].

Along with implementing the SE module in ResNet, it’s seen in Determine 6 that there are literally a number of methods we are able to comply with to take action. In line with the experimental ends in Determine 7, it seems like the usual SE, SE-PRE, and SE-Identification blocks obtained related outcomes, whereas on the identical time all of them outperformed SE-POST by a major margin. This implies that the location of the SE module impacts mannequin efficiency when it comes to accuracy. Based mostly on these findings, the authors argue that we’re going to acquire good outcomes so long as we apply the SE module earlier than the element-wise summation operation. Later within the coding part, I’m going to reveal the way to implement the usual SE block.

Determine 7. Experimental outcomes on completely different SE module integration methods [1].

Extra Experimental Outcomes

There are literally much more experimental outcomes mentioned within the paper. Certainly one of them is a desk displaying accuracy rating enhancements when SE module is utilized to present CNN-based fashions. The desk I’m referring to is displayed in Determine 8 under.

Determine 8. Experimental outcomes on making use of SE module on completely different fashions [1][2].

The columns highlighted in blue characterize the error charges of every mannequin and those in pink point out the computational complexity measured in GFLOPs. The re-implementation column refers back to the plain mannequin that the authors applied themselves, whereas the SENet column represents the identical mannequin outfitted with SE module. The desk clearly reveals that each top-1 and top-5 errors lower when the SE module is utilized. It is very important know that though including the SE module causes the GFLOPs to get increased, but this enhance is significantly marginal in comparison with the discount in error charge.

Subsequent, we are able to truly reveal attention-grabbing insights by printing out the values contained within the SE modules in the course of the inference part. Let’s check out the charts in Determine 9 under to raised illustrate this. The x axis of those charts denotes the channel numbers, the y axis represents how a lot weight does every channel have in response to its significance, and the colour of the traces signifies the category being predicted.

Determine 9. What the activation of SE modules seems like in several community depth [1].

In shallower layers, the options captured by SE module are class-agnostic, which mainly signifies that it captures generic data required to foretell all lessons. The charts known as (a) and (b), that are the SE modules from ResNet stage 2 and three, present that there’s not a lot distinction in channel exercise from one class to a different, indicating that these two modules don’t seize data relating to a selected class. The case is definitely completely different from the SE modules in deeper layers, i.e., those in stage 4 (c) and stage 5 (d). We are able to see that these two modules modify channel weights in a different way relying on the category being predicted. That is primarily the rationale that the SE modules in deeper layers are mentioned to be class-specific. Nevertheless, the authors acknowledge that there may be uncommon habits occurring in a number of the SE modules which occurs within the 2nd block of stage 5 (e). Right here the SE module doesn’t present significant channel recalibration habits, indicating that it doesn’t contribute as a lot as those we mentioned earlier.

The Detailed Structure

On this article we’re going to implement the SE-ResNeXt-50 (32×4d) mannequin, which in Determine 10 it corresponds to the one within the rightmost column. The ResNeXt mannequin itself is just like ResNet, besides that the group parameter of the second convolution layer inside every block is about to 32. If you happen to’re conversant in ResNeXt, that is primarily the best but efficient strategy to implement the so-called cardinality. I like to recommend you learn my earlier article about ResNeXt if you’re not but conversant in it, which the hyperlink is offered at reference quantity [3] on the finish of this text.

Taking a more in-depth take a look at the structure, what differentiates SE-ResNet-50 from ResNet-50 is solely the presence of SE modules. The identical additionally applies to SE-ResNeXt-50 (32×4d) in comparison with ResNeXt-50 (32×4d) (not displayed within the desk). Discover within the determine under that the fashions with SE modules have an fc layer connected after the final convolution layer inside every block, which the corresponding two numbers point out the primary and second fully-connected layers contained in the SE module.

Determine 10. The whole structure of ResNet-50, SE-ResNet-50 and SE-ResNeXt-50 (32×4d) [1].

From Scratch Implementation

Do not forget that right here we’re about to combine the SE module on ResNeXt, so we have to implement each of them from scratch. Technically talking, it’s truly attainable to take the ResNeXt structure straight from PyTorch, then manually connect the SE module on it. Nevertheless, right here I made a decision to make use of the ResNeXt implementation from my earlier article as an alternative since I really feel like it’s a lot simpler to grasp than the one from PyTorch. Notice that right here I’ll concentrate on developing the SE module and the way to connect it to the ResNeXt mannequin relatively than explaining the ResNeXt itself since I’ve already coated it in that article [3].

Now let’s begin the code by importing the required modules.

# Codeblock 1
import torch
import torch.nn as nn

Squeeze and Excitation Module

The next SE module implementation follows the diagram proven in Determine 5 (proper). It’s price noting that the SEModule class under doesn’t embody the skip-connection (curved arrow), as the whole SE module is utilized after the preliminary branching however earlier than the merging (summation).

The __init__() methodology of this class accepts two parameters: num_channels and r, as proven at line #(1) in Codeblock 2a. We undoubtedly need this SE module to be usable all through the whole community. So, we have to set the num_channels parameter to be adjustable as a result of the variety of output channels varies throughout ResNeXt blocks at completely different levels, as proven again in Determine 10. In the meantime, although we usually use the identical discount ratio r within the SE modules inside the complete community, however it’s technically attainable for us to make use of completely different r for various stage, which could in all probability be an attention-grabbing factor to experiment with. So, that is primarily the rationale that I additionally set the r parameter to be adjustable.

# Codeblock 2a
class SEModule(nn.Module):
    def __init__(self, num_channels, r):                     #(1)
        tremendous().__init__()
        
        self.global_pooling = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(2)
        self.fc0 = nn.Linear(in_features=num_channels,       #(3)
                             out_features=num_channels//r, 
                             bias=False)
        self.relu = nn.ReLU()                                #(4)
        self.fc1 = nn.Linear(in_features=num_channels//r,    #(5)
                             out_features=num_channels, 
                             bias=False)
        self.sigmoid = nn.Sigmoid()                          #(6)

There are 5 layers we have to initialize contained in the __init__() methodology. I write them down in response to the sequence given in Determine 5, i.e., international common pooling layer (#(2)), linear layer (#(3)), ReLU activation perform (#(4)), one other linear layer (#(5)), and sigmoid activation perform (#(6)). Right here you may see that the primary linear layer is accountable to carry out dimensionality discount by shrinking the variety of channels from num_channels to num_channels//r, which can then be expanded again to num_channels by the second linear layer. Notice that we set the bias time period of each linear layers to False, which primarily means that we’ll solely make the most of the burden tensors. The absence of bias phrases within the two layers forces the SE module to be taught the correlation between one channel to the others relatively than simply including fastened changes.

Nonetheless with the SEModule class, let’s now transfer on to the ahead() methodology to outline the move of the community. You’ll be able to see at line #(1) in Codeblock 2b that we begin from a single enter x, which within the case of ResNeXt it’s primarily a tensor produced by the third convolution layer inside the identical ResNeXt block. As proven in Determine 5, what we have to do subsequent is to department out the community. Right here we straight course of the department utilizing the global_pooling layer, which I identify the ensuing tensor squeezed (#(2)). The unique enter tensor x itself shall be left as is since we’re not going to carry out any operation on it till the scaling part. Subsequent, we have to drop the spatial dimension of the squeezed tensor utilizing torch.flatten() (#(3)). That is mainly performed as a result of we wish to course of it additional with the linear layers at line #(4) and #(5), which may solely work with a single-dimensional tensor. The spatial dimension is then launched once more at line #(6), permitting us to carry out multiplication between x (the unique tensor) and excited (the channel weights) at line #(7). This whole course of produces a recalibrated model of x which we seek advice from as scaled. Right here I print out the tensor dimension after every step to be able to higher perceive the move of this SE module.

# Codeblock 2b
    def ahead(self, x):                                  #(1)
        print(f'originaltt: {x.measurement()}')
        
        squeezed = self.global_pooling(x)                  #(2)
        print(f'after avgpooltt: {squeezed.measurement()}')
        
        squeezed = torch.flatten(squeezed, 1)              #(3)
        print(f'after flattentt: {squeezed.measurement()}')
        
        excited = self.relu(self.fc0(squeezed))            #(4)
        print(f'after fc0-relutt: {excited.measurement()}')
        
        excited = self.sigmoid(self.fc1(excited))          #(5)
        print(f'after fc1-sigmoidt: {excited.measurement()}')
        
        excited = excited[:, :, None, None]                #(6)
        print(f'after reshapett: {excited.measurement()}')
        
        scaled = x * excited                               #(7)
        print(f'after scalingtt: {scaled.measurement()}')
        
        return scaled

Now we’re going to see if we now have applied the community appropriately by passing a dummy tensor by it. In Codeblock 3 under, I initialize an SE module and configure it to just accept a picture tensor of 512 channels and has a discount ratio of 16 (#(1)). If you happen to check out the SE-ResNeXt structure in Determine 10, this SE module mainly corresponds to the one within the third stage (which the output measurement is 28×28). Thus, at line #(2) we have to modify the form of the dummy tensor accordingly. We then feed this tensor into the community utilizing the code at line #(3).

# Codeblock 3
semodule = SEModule(num_channels=512, r=16)    #(1)
x = torch.randn(1, 512, 28, 28)                #(2)

out = semodule(x)      #(3)

And under is what the print capabilities give us.

# Codeblock 3 Output
authentic          : torch.Measurement([1, 512, 28, 28])    #(1)
after avgpool     : torch.Measurement([1, 512, 1, 1])      #(2)
after flatten     : torch.Measurement([1, 512])            #(3)
after fc0-relu    : torch.Measurement([1, 32])             #(4)
after fc1-sigmoid : torch.Measurement([1, 512])            #(5)
after reshape     : torch.Measurement([1, 512, 1, 1])      #(6)
after scaling     : torch.Measurement([1, 512, 28, 28])    #(7)

You’ll be able to see that the unique tensor form matches precisely with our dummy tensor, i.e., 1×512×28×28 (#(1)). By the best way we are able to ignore the #1 within the 0th axis because it primarily denotes the batch measurement, which on this case I assume that we solely bought a single picture in a batch. After being pooled, the spatial dimension collapses to 1×1 since now every channel is represented by a single quantity (#(2)). The aim of the flatten operation I defined earlier is to drop the 2 empty axes (#(3)) because the subsequent linear layers can solely work with single-dimensional tensor. Right here you may see that the primary linear layer reduces the tensor dimension to 32 because of the discount ratio which we beforehand set to 16 (#(4)). The size of this tensor is then expanded again to 512 by the second linear layer (#(5)). Subsequent, we unsqueeze the tensor in order that we get our 1×1 spatial dimension again (#(6)), permitting us to multiply it with the enter tensor (#(7)). Based mostly on this detailed move, you may see that an SE module mainly preserves the unique tensor dimension, proving that this module could be connected to any CNN-based mannequin with out disrupting the unique move of the community.

ResNeXt

As we now have understood the way to implement SE module from scratch, now that I’m going to indicate you ways we are able to connect it on a ResNeXt mannequin. Earlier than doing so, we have to initialize the parameters required to implement the ResNeXt structure. Within the Codeblock 4 under the primary 4 variables are decided in response to the ResNeXt-50 (32×4d) variant, whereas the final one (R) represents the discount ratio for the SE modules.

# Codeblock 4
CARDINALITY  = 32
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048]
NUM_BLOCKS   = [3, 4, 6, 3]
NUM_CLASSES  = 1000
R = 16

The Block class outlined in Codeblock 5a and 5b is the ResNeXt block from my earlier article. There are literally a number of issues we do contained in the __init__() methodology, however the common thought is that we initialize three convolution layers known as conv0 (#(1)), conv1 (#(2)), and conv2 (#(3)) earlier than initializing the SE module at line #(4). We are going to later configure these layers in response to the SE-ResNeXt structure proven again in Determine 10.

# Codeblock 5a
class Block(nn.Module):
    def __init__(self, 
                 in_channels,
                 add_channel=False,
                 channel_multiplier=2,
                 downsample=False):
        tremendous().__init__()

        self.add_channel = add_channel
        self.channel_multiplier = channel_multiplier
        self.downsample = downsample
        
        
        if self.add_channel:
            out_channels = in_channels*self.channel_multiplier
        else:
            out_channels = in_channels
        
        mid_channels = out_channels//2
        
        
        if self.downsample:
            stride = 2
        else:
            stride = 1
            

        if self.add_channel or self.downsample:
            self.projection = nn.Conv2d(in_channels=in_channels,
                                        out_channels=out_channels, 
                                        kernel_size=1, 
                                        stride=stride, 
                                        padding=0, 
                                        bias=False)
            nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
            self.bn_proj = nn.BatchNorm2d(num_features=out_channels)

        self.conv0 = nn.Conv2d(in_channels=in_channels,       #(1)
                               out_channels=mid_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
        self.bn0 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv1 = nn.Conv2d(in_channels=mid_channels,      #(2)
                               out_channels=mid_channels, 
                               kernel_size=3, 
                               stride=stride,
                               padding=1, 
                               bias=False, 
                               teams=CARDINALITY)
        nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        self.bn1 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv2 = nn.Conv2d(in_channels=mid_channels,      #(3)
                               out_channels=out_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        self.bn2 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu = nn.ReLU()
        
        self.semodule = SEModule(num_channels=out_channels, r=R)    #(4)

The ahead() methodology itself is usually additionally the identical as the unique ResNeXt mannequin, besides that right here we have to put the SE module proper earlier than the element-wise summation as proven at line #(1) within the Codeblock 5b under. Do not forget that this implementation follows the usual SE block structure in Determine 6 (b).

# Codeblock 5b
    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')
        
        if self.add_channel or self.downsample:
            residual = self.bn_proj(self.projection(x))
            print(f'after projectiont: {residual.measurement()}')
        else:
            residual = x
            print(f'no projectiontt: {residual.measurement()}')
        
        x = self.conv0(x)
        x = self.bn0(x)
        x = self.relu(x)
        print(f'after conv0-bn0-relut: {x.measurement()}')

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        print(f'after conv1-bn1-relut: {x.measurement()}')
        
        x = self.conv2(x)
        x = self.bn2(x)
        print(f'after conv2-bn2tt: {x.measurement()}')
        
        x = self.semodule(x)      #(1)
        print(f'after semodulett: {x.measurement()}')
        
        x = x + residual
        x = self.relu(x)
        print(f'after summationtt: {x.measurement()}')
        
        return x

With the above implementation, each time we instantiate a Block object we may have a ResNeXt block which is already outfitted with an SE module. Now we’re going to check the above class to see if we now have applied it appropriately. Right here I’m going to simulate a ResNeXt block inside the third stage. The add_channel and downsample parameters are set to False since we wish to protect each the variety of channels and the spatial dimension of the enter tensor.

# Codeblock 6
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)

out = block(x)

Under is what the output seems like. Right here you may see that our first convolution layer efficiently decreased the variety of channels from 512 to 256 (#(1)), which is then expanded again to its authentic dimension by the third convolution layer (#(2)). Afterwards, the tensor goes by the SE block which the ensuing output measurement is identical as its enter, similar to what we noticed earlier in Codeblock 3 (#(3)). Because the processing with SE module is completed, we are able to lastly carry out the element-wise summation between the tensor from the principle department and the one from the skip-connection (#(4)).

authentic             : torch.Measurement([1, 512, 28, 28])
no projection        : torch.Measurement([1, 512, 28, 28])
after conv0-bn0-relu : torch.Measurement([1, 256, 28, 28])    #(1)
after conv1-bn1-relu : torch.Measurement([1, 256, 28, 28])
after conv2-bn2      : torch.Measurement([1, 512, 28, 28])    #(2)
after semodule       : torch.Measurement([1, 512, 28, 28])    #(3)
after summation      : torch.Measurement([1, 512, 28, 28])    #(4)

And under is how I implement the whole structure. What we primarily must do is simply to stack a number of SE-ResNeXt blocks in response to the structure in Determine 10. In truth, the SEResNeXt class in Codeblock 7 is strictly the identical because the ResNeXt class in my earlier article [3] (I actually copy-pasted it) since what makes SE-ResNeXt completely different from the unique ResNeXt is barely the presence of SE module inside the Block class we mentioned earlier.

# Codeblock 7
class SEResNeXt(nn.Module):
    def __init__(self):
        tremendous().__init__()

        # conv1 stage
        self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
                                       out_channels=NUM_CHANNELS[1],
                                       kernel_size=7,
                                       stride=2,
                                       padding=3, 
                                       bias=False)
        nn.init.kaiming_normal_(self.resnext_conv1.weight, 
                                nonlinearity='relu')
        self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
        self.relu = nn.ReLU()
        self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3,
                                             stride=2, 
                                             padding=1)

        # conv2 stage
        self.resnext_conv2 = nn.ModuleList([
            Block(in_channels=NUM_CHANNELS[1],
                  add_channel=True,
                  channel_multiplier=4,
                  downsample=False)
        ])
        for _ in vary(NUM_BLOCKS[0]-1):
            self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))

        # conv3 stage
        self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2],
                                                  add_channel=True, 
                                                  downsample=True)])
        for _ in vary(NUM_BLOCKS[1]-1):
            self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
            
            
        # conv4 stage
        self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[2]-1):
            self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
            
            
        # conv5 stage
        self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[3]-1):
            self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
 
       
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

        self.fc = nn.Linear(in_features=NUM_CHANNELS[5],
                            out_features=NUM_CLASSES)
        

    def ahead(self, x):
        print(f'originaltt: {x.measurement()}')
        
        x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
        print(f'after resnext_conv1t: {x.measurement()}')
        
        x = self.resnext_maxpool1(x)
        print(f'after resnext_maxpool1t: {x.measurement()}')
        
        for i, block in enumerate(self.resnext_conv2):
            x = block(x)
            print(f'after resnext_conv2 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv3):
            x = block(x)
            print(f'after resnext_conv3 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv4):
            x = block(x)
            print(f'after resnext_conv4 #{i}t: {x.measurement()}')
            
        for i, block in enumerate(self.resnext_conv5):
            x = block(x)
            print(f'after resnext_conv5 #{i}t: {x.measurement()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.measurement()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.measurement()}')
        
        x = self.fc(x)
        print(f'after fctt: {x.measurement()}')
        
        return x

As the whole SE-ResNeXt-50 (32×4d) structure is accomplished, now that we’re going to check it by passing by a tensor of measurement 1×3×224×224 by the community, simulating a single RGB picture of measurement 224×224. You’ll be able to see within the output of the Codeblock 8 under that it looks like mannequin works correctly because the tensor efficiently handed by all layers inside the seresnext mannequin with out returning any error. Thus, I consider this mannequin is now able to be skilled. By the best way don’t overlook to alter the variety of neurons within the output channel in response to the variety of lessons in your dataset if you wish to truly practice this mannequin.

# Codeblock 8
seresnext = SEResNeXt()
x = torch.randn(1, 3, 224, 224)

out = seresnext(x)

# Codeblock 8 Output
authentic               : torch.Measurement([1, 3, 224, 224])
after resnext_conv1    : torch.Measurement([1, 64, 112, 112])
after resnext_maxpool1 : torch.Measurement([1, 64, 56, 56])
after resnext_conv2 #0 : torch.Measurement([1, 256, 56, 56])
after resnext_conv2 #1 : torch.Measurement([1, 256, 56, 56])
after resnext_conv2 #2 : torch.Measurement([1, 256, 56, 56])
after resnext_conv3 #0 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Measurement([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Measurement([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Measurement([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Measurement([1, 2048, 7, 7])
after avgpool          : torch.Measurement([1, 2048, 1, 1])
after flatten          : torch.Measurement([1, 2048])
after fc               : torch.Measurement([1, 1000])

Moreover, we are able to additionally print out the variety of parameters this mannequin has utilizing the next code. Right here you may see that the codeblock returns 27,543,848. This variety of parameters is barely increased than the unique ResNeXt mannequin counterpart, which solely has 25,028,904 parameters as talked about in my earlier article in addition to the official PyTorch documentation [4]. Such a rise within the mannequin measurement undoubtedly is smart because the ResNeXt blocks all through the whole community now have extra layers because of the presence of SE modules.

# Codeblock 9
def count_parameters(mannequin):
    return sum([params.numel() for params in model.parameters()])

count_parameters(seresnext)

# Codeblock 9 Output
27543848

Ending

And that’s just about all the things concerning the Squeeze and Excitation module. I do encourage you to discover from right here by coaching this mannequin by yourself dataset in order that you will note whether or not the findings introduced within the paper additionally apply to your case. Not solely that, I believe it might even be attention-grabbing if you happen to attempt to implement SE module on different neural community architectures like VGG or Inception by your self.

I hope you be taught one thing new at the moment. Thanks for studying!

By the best way you too can discover the code used on this article in my GitHub repo [5].

[1] Jie Hu et al. Squeeze and Excitation Networks. Arxiv. https://arxiv.org/abs/1709.01507 [Accessed March 17, 2025].

[2] Picture initially created by creator.

[3] Taking ResNet to the Subsequent Stage. In direction of Knowledge Science. https://towardsdatascience.com/taking-resnet-to-the-next-level/ [Accessed July 22, 2025].

[4] Resnext50_32x4d. PyTorch. https://pytorch.org/vision/main/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 17, 2025].

[5] MuhammadArdiPutra. The Channel-Smart Consideration — Squeeze and Excitation. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Channel-Wise%20Attention%20-%20Squeeze%20and%20Excitation.ipynb [Accessed April 7, 2025].

Source link

The Channel-Wise Attention | Squeeze and Excitation

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Google DeepMind Workers Vote to Unionize Over Military AI Deals

Apple’s Foldable iPhone Ultra: Release Date, Price, and Leaks

Change These iPhone Settings to Adjust Liquid Glass in iOS 26

The Channel-Wise Attention | Squeeze and Excitation

The Squeeze and Excitation Module

The place to Put the SE Module

Extra Experimental Outcomes

The Detailed Structure

From Scratch Implementation

Squeeze and Excitation Module

ResNeXt

Ending

Related Posts