Utilizing Depthwise Separable Convolutions in Tensorflow

Final Up to date on August 4, 2022

Taking a look at the entire very giant convolutional neural networks akin to ResNets, VGGs, and the like, it begs the query on how we are able to make all of those networks smaller with much less parameters whereas nonetheless sustaining the identical stage of accuracy and even bettering generalization of the mannequin utilizing a smaller quantity of parameters. One strategy is depthwise separable convolutions, additionally identified by separable convolutions in TensorFlow and Pytorch (to not be confused with spatially separable convolutions that are additionally known as separable convolutions). Depthwise separable convolutions have been launched by Sifre in “Inflexible-motion scattering for picture classification” and has been adopted by widespread mannequin architectures akin to MobileNet and an identical model in Xception. It splits the channel and spatial convolutions which are normally mixed collectively in regular convolutional layers

On this tutorial, we’ll be what depthwise separable convolutions are and the way we are able to use them to hurry up our convolutional neural community picture fashions.

After finishing this tutorial, you’ll study:

  • What’s a depthwise, pointwise, and depthwise separable convolution
  • Find out how to implement depthwise separable convolutions in Tensorflow
  • Utilizing them as a part of our laptop imaginative and prescient fashions

Let’s get began!

Utilizing Depthwise Separable Convolutions in Tensorflow
Picture by Arisa Chattasa. Some rights reserved.


This tutorial is cut up into 3 elements:

  • What’s a depthwise separable convolution
  • Why are they helpful
  • Utilizing depthwise separable convolutions in laptop imaginative and prescient mannequin

What’s a Depthwise Separable Convolution

Earlier than diving into depthwise and depthwise separable convolutions, it is perhaps useful to have a fast recap on convolutions. Convolutions in picture processing is a means of making use of a kernel over quantity, the place we do a weighted sum of the pixels with the weights because the values of the kernels. Visually as follows:

Making use of a 3×3 kernel on a 10x10x3 outputs an 8x8x1 quantity

Now, let’s introduce a depthwise convolution. A depthwise convolution is principally a convolution alongside just one spatial dimension of the picture. Visually, that is what a single depthwise convolutional filter would appear to be and do:

Making use of a depthwise 3x3 kernel on the inexperienced channel on this instance

The important thing distinction between a traditional convolutional layer and a depthwise convolution is that the depthwise convolution applies the convolution alongside just one spatial dimension (i.e. channel) whereas a traditional convolution is utilized throughout all spatial dimensions/channels at every step.

If we take a look at what a complete depthwise layer does on all RGB channels,

Making use of a depthwise convolutional filter on 10x10x3 enter quantity outputs 8x8x3 quantity

Discover that since we’re making use of one convolutional filter for every output channel, the variety of output channels is the same as the variety of enter channels. After making use of this depthwise convolutional layer, we then apply a pointwise convolutional layer.

Merely put a pointwise convolutional layer is a daily convolutional layer with a 1x1 kernel (therefore a single level throughout all of the channels). Visually, it seems to be like this:

Making use of a pointwise convolution on a 10x10x3 enter quantity outputs a 10x10x1 output quantity

Why are Depthwise Separable Convolutions Helpful?

Now, you is perhaps questioning, what’s the usage of doing two operations with the depthwise separable convolutions? Provided that the title of this text is to hurry up laptop imaginative and prescient fashions, how does doing two operations as an alternative of 1 assist to hurry issues up?

To reply that query, let’s take a look at the variety of parameters within the mannequin (there could be some further overhead related to doing two convolutions as an alternative of 1 although). Let’s say we needed to use 64 convolutional filters to our RGB picture to have 64 channels in our output. Variety of parameters in regular convolutional layer (together with bias time period) is $ 3 occasions 3 occasions 3 occasions 64 + 64 = 1792$. Then again, utilizing a depthwise separable convolutional layer would solely have $(3 occasions 3 occasions 1 occasions 3 + 3) + (1 occasions 1 occasions 3 occasions 64 + 64) = 30 + 256 = 286$  parameters, which is a big discount, with depthwise separable convolutions having lower than 6 occasions the parameters of the conventional convolution.

This will help to scale back the variety of computations and parameters, which reduces coaching/inference time and will help to regularize our mannequin respectively.

Let’s see this in motion. For our inputs, let’s use the CIFAR10 picture dataset of 32x32x3 photos,

Then, we implement a depthwise separable convolution layer. There’s an implementation in Tensorflow however we’ll go into that within the ultimate instance.

Developing a mannequin with utilizing a depthwise separable convolutional layer and searching on the variety of parameters,

which provides the output

which we are able to examine with an identical mannequin utilizing a daily 2D convolutional layer,

which provides the output

That corroborates with our preliminary calculations on the variety of parameters accomplished earlier and exhibits the discount in variety of parameters that may be achieved by utilizing depthwise separable convolutions.

Extra particularly, let’s take a look at the quantity and measurement of kernels in a traditional convolutional layer and a depthwise separable one. When a daily 2D convolutional layer with $c$ channels as inputs, $w occasions h$ kernel spatial decision, and $n$ channels as output, we would wish to have $(n, w, h, c)$ parameters, that’s $n$ filters, with every filter having a kernel measurement of $(w, h, c)$. Nevertheless, that is completely different for the same depthwise separable convolution even with the identical variety of enter channels, kernel spatial decision, and output channels. First, there’s the depthwise convolution which includes $c$ filters, every with a kernel measurement of $(w, h, 1)$ which outputs $c$ channels because it acts on every filter. This depthwise convolutional layer has $(c, w, h, 1)$ parameters (plus some bias items). Then comes the pointwise convolution which takes within the $c$ channels from the depthwise layer, and outputs $n$ channels, so now we have $n$ filters every with a kernel measurement of $(1, 1, n)$. This pointwise convolutional layer has $(n, 1, 1, n)$ parameters (plus some bias items).

You is perhaps considering proper now, however why do they work?

One mind-set about it, from the Xception paper by Chollet is that depthwise separable convolutions have the belief that we are able to individually map cross-channel and spatial correlations. Given this, there shall be bunch of redundant weights within the convolutional layer which we are able to scale back by separating the convolution into two convolutions of the depthwise and pointwise element. One mind-set about it for these aware of linear algebra is how we’re in a position to decompose a matrix into outer product of two vectors when the column vectors within the matrix are multiples of one another.

Utilizing Depthwise Separable Convolutions in Laptop Imaginative and prescient Fashions

Now that we’ve seen the discount in parameters that we are able to obtain by utilizing a depthwise separable convolution over a traditional convolutional filter, let’s see how we are able to use it in follow with Tensorflow’s SeparableConv2D filter.

For this instance, we shall be utilizing the CIFAR-10 picture dataset used within the above instance, whereas for the mannequin we shall be utilizing a mannequin constructed off VGG blocks. The potential of depthwise separable convolutions is in deeper fashions the place the regularization impact is extra helpful to the mannequin and the discount in parameters is extra apparent versus a lighter weight mannequin akin to LeNet-5.

Creating our mannequin utilizing VGG blocks utilizing regular convolutional layers,

Then we take a look at the outcomes of this 6-layer convolutional neural community with regular convolutional layers,

Let’s check out the identical structure however change the conventional convolutional layers with Keras’ SeparableConv2D layers as an alternative:

Working the above code offers us the outcome:

Discover that there are considerably much less parameters within the depthwise separable convolution model (~200k vs ~1.2m parameters), together with a barely decrease prepare time per epoch. Depthwise separable convolutions is extra prone to work higher on deeper fashions that may face an overfitting drawback and on layers with bigger kernels since there’s a larger lower in parameters and computations that will offset the extra computational value of doing two convolutions as an alternative of 1. Subsequent, we plot the prepare and validation and accuracy of the 2 fashions, to see variations within the coaching efficiency of the fashions:


Coaching and validation accuracy of community with regular convolutional layers

Coaching and validation accuracy of community with depthwise separable convolutional layers

The very best validation accuracy is comparable for each fashions, however the depthwise separable convolution seems to have much less overfitting to the prepare set, which could assist it generalize higher to new knowledge.

Combining all of the code collectively for the depthwise separable convolutions model of the mannequin,

Additional Studying

This part offers extra sources on the subject if you’re trying to go deeper.




On this submit, you’ve seen what are depthwise, pointwise, and depthwise separable convolutions. You’ve additionally seen how utilizing depthwise separable convolutions permits us to get aggressive outcomes whereas utilizing a considerably smaller variety of parameters.

Particularly, you’ve learnt:

  • What’s a depthwise, pointwise, and depthwise separable convolution
  • Find out how to implement depthwise separable convolutions in Tensorflow
  • Utilizing them as a part of our laptop imaginative and prescient fashions


Develop Deep Studying Fashions for Imaginative and prescient At the moment!

Deep Learning for Computer Vision

Develop Your Personal Imaginative and prescient Fashions in Minutes

…with only a few traces of python code

Uncover how in my new E book:

Deep Studying for Laptop Imaginative and prescient

It offers self-study tutorials on subjects like:

classification, object detection (yolo and rcnn), face recognition (vggface and facenet), knowledge preparation and far more…

Lastly Convey Deep Studying to your Imaginative and prescient Tasks

Skip the Teachers. Simply Outcomes.

See What’s Inside