Designing Compact Networks: Taking Convolution Apart

The classic networks (AlexNet, VGG, Inception, ResNet, SENet, …) were mostly built to chase accuracy. But to actually deploy — smaller runtime memory, lower latency, higher throughput — you need architectures designed to be compact. This post walks through several design threads.

MobileNet: depthwise separable convolution

MobileNet is more or less where compact design begins. As the name implies, it targets mobile deployment, and its key contribution is splitting standard convolution into two steps: depthwise convolution and 1×11\times1 (pointwise) convolution.

Filters of standard, depthwise, and 1×1 convolution
Standard vs. depthwise vs. 1×1 convolution filters.
MobileNet architecture
The overall MobileNet structure.

Let’s quantify it. Say we turn a DF×DF×MD_F\times D_F\times M tensor into DF×DF×ND_F\times D_F\times N with kernel size DKD_K. Standard convolution’s FLOPs:

FLOPsstd=DK×DK×M×N×DF×DF\text{FLOPs}_{\text{std}}=D_K \times D_K \times M \times N \times D_F \times D_F

After splitting into depthwise + 1×11\times1:

FLOPsdw+pw=DK×DK×M×DF×DF+M×N×DF×DF\text{FLOPs}_{\text{dw+pw}}=D_K \times D_K \times M \times D_F \times D_F + M \times N \times D_F \times D_F

Their ratio is:

FLOPsdw+pwFLOPsstd=1N+1DK×DK\frac{\text{FLOPs}_{\text{dw+pw}}}{\text{FLOPs}_{\text{std}}}=\frac{1}{N}+\frac{1}{D_K\times D_K}

For the most common 3×33\times3 kernel (DK=3D_K=3), when output channels NN are large this split cuts conv FLOPs to as little as 19\tfrac{1}{9} of standard convolution. But note: 9× fewer FLOPs ≠ 9× faster in practice — inference frameworks usually implement convolution via im2col+GEMM or Winograd, and depthwise convolution doesn’t really reduce memory access, so the measured speedup is quite limited. The parameter analysis is analogous, with the same ratio 1N+1DK2\tfrac{1}{N}+\tfrac{1}{D_K^2}; the parameter reduction is more “real” than the FLOPs one, since it genuinely saves disk space — unlike FLOPs, whose reduction is bottlenecked by the software implementation.

ShuffleNet: grouped 1×1 convolution + channel shuffle

ShuffleNet arrived a few months after MobileNet and goes further at the 1×11\times1 convolution: besides keeping the depthwise 3×33\times3 conv, it also groups the standard 1×11\times1 conv, then adds channel shuffle so information across groups can fuse in the next layer. Note channel shuffle isn’t a true random permutation — in PyTorch it’s implemented with reshape + permute.

Channel shuffle in ShuffleNet
Channel shuffle: with grouped conv alone, groups don’t communicate (a); after shuffling, channel info fuses across groups (c).

ShuffleNet stacks many blocks:

ShuffleNet block
ShuffleNet block: replace 1×1 convs with grouped convs and add channel shuffle.
ShuffleNet structure
The overall ShuffleNet structure.

Quantitatively: channel shuffle itself doesn’t change FLOPs or parameters, but it changes the feature map’s memory layout, adding strided memory access for the next layer. A grouped 1×11\times1 conv with GG groups has FLOPs M×N×DF×DFG\tfrac{M\times N\times D_F\times D_F}{G} — both FLOPs and parameters drop to 1G\tfrac{1}{G} of a standard 1×11\times1 conv. At fixed total FLOPs, more groups means you can use more filters; getting the best accuracy means trading off carefully between the two.

MobileNetV2: Inverted Residual

MobileNetV2 folds ResNet’s residual connection and bottleneck into MobileNet, but with an inverted bottleneck: a 1×11\times1 conv first expands the channels, then a depthwise conv operates, then a second 1×11\times1 conv brings the channels back down to the input size.

MobileNetV2 inverted residual block
ResNet’s bottleneck (a) vs. MobileNetV2’s inverted bottleneck (b).
MobileNetV2 structure
MobileNetV2: t is the expansion ratio, c output channels, n the repeat count, s the stride.

It invents no new compact operation; instead it folds capacity-boosting ideas (residual connections, bottlenecks) into a compact network, pushing accuracy further. Notably, MobileNetV2 later became the backbone for many architecture-search methods (both MobileNetV3 and EfficientNet below build on it).

ShuffleNetV2: four rules from memory-access cost

ShuffleNetV2 focuses on an often-overlooked metric — memory-access cost (MAC) — and derives four design rules for compact networks.

Rule 1: keep a conv’s input and output channels equal. For a 1×11\times1 conv with input/output channels c1,c2c_1,c_2 over an h×wh\times w map, FLOPs=hwc1c2\text{FLOPs}=h\,w\,c_1\,c_2. At fixed FLOPs, the memory-access cost

MAC=hw(c1+c2)+c1c2  2hwFLOPs+FLOPshw\text{MAC}=h\,w\,(c_1+c_2)+c_1 c_2 \ \ge\ 2\sqrt{h\,w\cdot\text{FLOPs}}+\frac{\text{FLOPs}}{h\,w}

By the AM-GM inequality, MAC has a lower bound, reached when c1=c2c_1=c_2. Experiments confirm it: at the same FLOPs, a 1:11:1 input/output channel ratio is fastest (the same on GPU and ARM).

Speed for different input/output channel ratios
Rule 1: at equal FLOPs, a 1:1 input/output channel ratio is fastest.

Rule 2: too many groups increases MAC. For a grouped 1×11\times1 conv with gg groups:

MAC=hw(c1+c2)+c1c2g=hwc1+FLOPsgc1+FLOPshw\text{MAC}=h\,w\,(c_1+c_2)+\frac{c_1 c_2}{g}=h\,w\,c_1+\frac{\text{FLOPs}\cdot g}{c_1}+\frac{\text{FLOPs}}{h\,w}

At fixed FLOPs, MAC grows with gg.

Speed for different numbers of groups
Rule 2: at equal FLOPs, more groups means slower.

Rule 3: too much fragmentation hurts parallelism. Many small serial/parallel branches weaken parallel computation. But deeper structures often give higher accuracy, so it’s a trade-off between accuracy and parallel speedup.

Serial vs parallel structures
Fragment structures of different counts, serial and parallel.
Speed for different fragment counts and connections
Rule 3: at equal FLOPs, fewer fragments are faster; for the same count, serial beats parallel.

Rule 4: don’t ignore the cost of element-wise operations. An operation’s time has two parts, MAC and FLOPs. For convolution, FLOPs far exceeds MAC; but for low-FLOPs operations like element-wise add and ReLU, MAC is the dominant cost and can’t be ignored.

Speed with ReLU or residual connection removed
Rule 4: element-wise ops like ReLU and residual connections have non-trivial cost too.

Following these, ShuffleNetV2’s block is almost entirely different from V1’s: no more grouped 1×11\times1 convs, replaced by channel split (keeping the residual connection without the extra MAC of grouped convolution — Rule 2); and the block’s Channel Split, Concat, and Channel Shuffle can fuse into one operation to cut MAC (Rule 4).

ShuffleNetV2 block
ShuffleNetV1 (a, b) vs. ShuffleNetV2 (c, d) blocks.
ShuffleNetV2 structure
The overall ShuffleNetV2 structure.

MobileNetV3: h-swish

MobileNetV3 is derived from MobileNetV2 via automated search (MnasNet + NetAdapt compression), mainly tweaking the number of conv layers, kernel sizes, channels, and adding SE modules in some layers. It also swaps ReLU for hard-swish in deeper layers:

h-swish(x)=xReLU6(x+3)6\text{h-swish}(x)=x\,\frac{\text{ReLU6}(x+3)}{6}

The original swish is swish(x)=xσ(x)\text{swish}(x)=x\cdot\sigma(x), but sigmoid is expensive; hard-swish approximates it for nearly the same effect at far lower cost.

The hard-swish activation
hard-swish: approximate swish with ReLU6, avoiding sigmoid’s heavy compute.
MobileNetV3-small
MobileNetV3-small (there’s also a large version).

EfficientNet: compound scaling

EfficientNet arrived around the same time as MobileNetV3, also searched on top of MnasNet, but its search used only simple grid search. Its core is a network scaling method: if a compact network reaches decent accuracy at small FLOPs, then for higher accuracy just scale it up. Scaling spans three dimensions: width, depth, and input resolution.

Network scaling
Width / depth / resolution scaling, and their compound combination.

Each dimension’s scale factor decouples into a relative factor and a global factor ϕ\phi:

depth:d=αϕ,width:w=βϕ,resolution:r=γϕ\text{depth}:d=\alpha^{\phi},\quad \text{width}:w=\beta^{\phi},\quad \text{resolution}:r=\gamma^{\phi}
s.t.αβ2γ22,α,β,γ1\text{s.t.}\quad \alpha\cdot\beta^2\cdot\gamma^2\approx2,\quad \alpha,\beta,\gamma\ge1

The constraint αβ2γ22\alpha\beta^2\gamma^2\approx2 exists because FLOPs grow quadratically with width and resolution but linearly with depth; fixing αβ2γ2\alpha\beta^2\gamma^2 to a constant keeps FLOPs controllable under any scaling. During search, fix ϕ=1\phi=1 and grid-search the best α,β,γ\alpha,\beta,\gamma; afterward, dialing ϕ\phi scales the network to any FLOPs level.

Compound vs single-dimension scaling
Compound scaling beats scaling a single dimension at the same FLOPs.
EfficientNet-B0 architecture
EfficientNet-B0 (the base network for scaling, searched by MnasNet); B1–B7 scale up from it.

GhostNet: cheap “ghost” features

GhostNet also takes a decomposition route, but from a distinctive observation: visualizing feature maps shows that many channels are very similar to one another — so there’s no need to compute them all with expensive standard convolution.

Many standard-conv feature maps are similar
Feature maps from standard convolution — many are highly similar to each other.

So GhostNet splits standard convolution into two steps: first compute part of the output with fewer standard-conv filters, then “generate” the rest from it via cheap operations (linear transforms / depthwise convs), and concat the two parts.

Standard convolution vs the Ghost module
Standard convolution (a) vs. Ghost module (b): a few standard convs + cheap ops generate the rest.

Quantitatively: with kernel kk, input channels cc, and output n×h×wn\times h\times w, standard convolution has FLOPs nhwck2n\,h\,w\,c\,k^2. If the Ghost module computes mm channels in the first step, uses a depthwise conv of size dd for the cheap ops, and sets s=nms=\tfrac{n}{m}, the speedup is

RFLOPs=nhwck2mhwck2+(nm)hwd2=ck21sck2+s1sd2sR_{\text{FLOPs}}=\frac{n\,h\,w\,c\,k^2}{m\,h\,w\,c\,k^2+(n-m)\,h\,w\,d^2}=\frac{c\,k^2}{\tfrac{1}{s}c\,k^2+\tfrac{s-1}{s}d^2}\approx s

(the approximations use dkd\approx k and scs\ll c). The parameter compression ratio is also roughly ss.

GhostNet block
GhostNet block: like MobileNetV2, expand then squeeze the feature map in the middle.
GhostNet structure
The overall GhostNet structure.

From MobileNet taking convolution apart, to ShuffleNetV2 watching memory access, to GhostNet reusing similar features — the through-line of compact design is endlessly trading off FLOPs, memory access, and accuracy. And MobileNetV3 and EfficientNet already replace “humans designing” with “automatic search” — which is exactly the direction of neural architecture search (NAS).

References

  • Howard, Andrew G., et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861, 2017.
  • Zhang, Xiangyu, et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CVPR, 2018.
  • Ma, Ningning, et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. ECCV, 2018.
  • Sandler, Mark, et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR, 2018.
  • Howard, Andrew, et al. Searching for MobileNetV3. arXiv:1905.02244, 2019.
  • Tan, Mingxing, Le, Quoc V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946, 2019.
  • Han, Kai, et al. GhostNet: More Features from Cheap Operations. CVPR, 2020.
  • Tan, Mingxing, et al. MnasNet: Platform-Aware Neural Architecture Search for Mobile. CVPR, 2019.
  • Jia, Yangqing, et al. Caffe: Convolutional Architecture for Fast Feature Embedding. ACM MM, 2014.
  • Lavin, Andrew, Gray, Scott. Fast Algorithms for Convolutional Neural Networks (Winograd). CVPR, 2016.