Designing Compact Networks: Taking Convolution Apart

The classic networks (AlexNet, VGG, Inception, ResNet, SENet, …) were mostly built to chase accuracy. But to actually deploy — smaller runtime memory, lower latency, higher throughput — you need architectures designed to be compact. This post walks through several design threads.

MobileNet: depthwise separable convolution

MobileNet is more or less where compact design begins. As the name implies, it targets mobile deployment, and its key contribution is splitting standard convolution into two steps: depthwise convolution and $1\times1$ (pointwise) convolution.

Filters of standard, depthwise, and 1×1 convolution — Standard vs. depthwise vs. 1×1 convolution filters.

MobileNet architecture — The overall MobileNet structure.

Let’s quantify it. Say we turn a $D_F\times D_F\times M$ tensor into $D_F\times D_F\times N$ with kernel size $D_K$ . Standard convolution’s FLOPs:

\text{FLOPs}_{\text{std}}=D_K \times D_K \times M \times N \times D_F \times D_F

After splitting into depthwise + $1\times1$ :

\text{FLOPs}_{\text{dw+pw}}=D_K \times D_K \times M \times D_F \times D_F + M \times N \times D_F \times D_F

Their ratio is:

\frac{\text{FLOPs}_{\text{dw+pw}}}{\text{FLOPs}_{\text{std}}}=\frac{1}{N}+\frac{1}{D_K\times D_K}

For the most common $3\times3$ kernel ( $D_K=3$ ), when output channels $N$ are large this split cuts conv FLOPs to as little as $\tfrac{1}{9}$ of standard convolution. But note: 9× fewer FLOPs ≠ 9× faster in practice — inference frameworks usually implement convolution via im2col+GEMM or Winograd, and depthwise convolution doesn’t really reduce memory access, so the measured speedup is quite limited. The parameter analysis is analogous, with the same ratio $\tfrac{1}{N}+\tfrac{1}{D_K^2}$ ; the parameter reduction is more “real” than the FLOPs one, since it genuinely saves disk space — unlike FLOPs, whose reduction is bottlenecked by the software implementation.

ShuffleNet: grouped 1×1 convolution + channel shuffle

ShuffleNet arrived a few months after MobileNet and goes further at the $1\times1$ convolution: besides keeping the depthwise $3\times3$ conv, it also groups the standard $1\times1$ conv, then adds channel shuffle so information across groups can fuse in the next layer. Note channel shuffle isn’t a true random permutation — in PyTorch it’s implemented with reshape + permute.

Channel shuffle in ShuffleNet — Channel shuffle: with grouped conv alone, groups don’t communicate (a); after shuffling, channel info fuses across groups (c).

ShuffleNet stacks many blocks:

ShuffleNet block: replace 1×1 convs with grouped convs and add channel shuffle.

Quantitatively: channel shuffle itself doesn’t change FLOPs or parameters, but it changes the feature map’s memory layout, adding strided memory access for the next layer. A grouped $1\times1$ conv with $G$ groups has FLOPs $\tfrac{M\times N\times D_F\times D_F}{G}$ — both FLOPs and parameters drop to $\tfrac{1}{G}$ of a standard $1\times1$ conv. At fixed total FLOPs, more groups means you can use more filters; getting the best accuracy means trading off carefully between the two.

MobileNetV2: Inverted Residual

MobileNetV2 folds ResNet’s residual connection and bottleneck into MobileNet, but with an inverted bottleneck: a $1\times1$ conv first expands the channels, then a depthwise conv operates, then a second $1\times1$ conv brings the channels back down to the input size.

MobileNetV2 structure — MobileNetV2: t is the expansion ratio, c output channels, n the repeat count, s the stride.

It invents no new compact operation; instead it folds capacity-boosting ideas (residual connections, bottlenecks) into a compact network, pushing accuracy further. Notably, MobileNetV2 later became the backbone for many architecture-search methods (both MobileNetV3 and EfficientNet below build on it).

ShuffleNetV2: four rules from memory-access cost

ShuffleNetV2 focuses on an often-overlooked metric — memory-access cost (MAC) — and derives four design rules for compact networks.

Rule 1: keep a conv’s input and output channels equal. For a $1\times1$ conv with input/output channels $c_1,c_2$ over an $h\times w$ map, $\text{FLOPs}=h\,w\,c_1\,c_2$ . At fixed FLOPs, the memory-access cost

\text{MAC}=h\,w\,(c_1+c_2)+c_1 c_2 \ \ge\ 2\sqrt{h\,w\cdot\text{FLOPs}}+\frac{\text{FLOPs}}{h\,w}

By the AM-GM inequality, MAC has a lower bound, reached when $c_1=c_2$ . Experiments confirm it: at the same FLOPs, a $1:1$ input/output channel ratio is fastest (the same on GPU and ARM).

Speed for different input/output channel ratios — Rule 1: at equal FLOPs, a 1:1 input/output channel ratio is fastest.

Rule 2: too many groups increases MAC. For a grouped $1\times1$ conv with $g$ groups:

\text{MAC}=h\,w\,(c_1+c_2)+\frac{c_1 c_2}{g}=h\,w\,c_1+\frac{\text{FLOPs}\cdot g}{c_1}+\frac{\text{FLOPs}}{h\,w}

At fixed FLOPs, MAC grows with $g$ .

Speed for different numbers of groups — Rule 2: at equal FLOPs, more groups means slower.

Rule 3: too much fragmentation hurts parallelism. Many small serial/parallel branches weaken parallel computation. But deeper structures often give higher accuracy, so it’s a trade-off between accuracy and parallel speedup.

Serial vs parallel structures — Fragment structures of different counts, serial and parallel.

Speed for different fragment counts and connections — Rule 3: at equal FLOPs, fewer fragments are faster; for the same count, serial beats parallel.

Rule 4: don’t ignore the cost of element-wise operations. An operation’s time has two parts, MAC and FLOPs. For convolution, FLOPs far exceeds MAC; but for low-FLOPs operations like element-wise add and ReLU, MAC is the dominant cost and can’t be ignored.

Speed with ReLU or residual connection removed — Rule 4: element-wise ops like ReLU and residual connections have non-trivial cost too.

Following these, ShuffleNetV2’s block is almost entirely different from V1’s: no more grouped $1\times1$ convs, replaced by channel split (keeping the residual connection without the extra MAC of grouped convolution — Rule 2); and the block’s Channel Split, Concat, and Channel Shuffle can fuse into one operation to cut MAC (Rule 4).

ShuffleNetV2 block — ShuffleNetV1 (a, b) vs. ShuffleNetV2 (c, d) blocks.

MobileNetV3: h-swish

MobileNetV3 is derived from MobileNetV2 via automated search (MnasNet + NetAdapt compression), mainly tweaking the number of conv layers, kernel sizes, channels, and adding SE modules in some layers. It also swaps ReLU for hard-swish in deeper layers:

\text{h-swish}(x)=x\,\frac{\text{ReLU6}(x+3)}{6}

The original swish is $\text{swish}(x)=x\cdot\sigma(x)$ , but sigmoid is expensive; hard-swish approximates it for nearly the same effect at far lower cost.

The hard-swish activation — hard-swish: approximate swish with ReLU6, avoiding sigmoid’s heavy compute.

MobileNetV3-small (there’s also a large version).

EfficientNet: compound scaling

EfficientNet arrived around the same time as MobileNetV3, also searched on top of MnasNet, but its search used only simple grid search. Its core is a network scaling method: if a compact network reaches decent accuracy at small FLOPs, then for higher accuracy just scale it up. Scaling spans three dimensions: width, depth, and input resolution.

Network scaling — Width / depth / resolution scaling, and their compound combination.

Each dimension’s scale factor decouples into a relative factor and a global factor $\phi$ :

\text{depth}:d=\alpha^{\phi},\quad \text{width}:w=\beta^{\phi},\quad \text{resolution}:r=\gamma^{\phi}

\text{s.t.}\quad \alpha\cdot\beta^2\cdot\gamma^2\approx2,\quad \alpha,\beta,\gamma\ge1

The constraint $\alpha\beta^2\gamma^2\approx2$ exists because FLOPs grow quadratically with width and resolution but linearly with depth; fixing $\alpha\beta^2\gamma^2$ to a constant keeps FLOPs controllable under any scaling. During search, fix $\phi=1$ and grid-search the best $\alpha,\beta,\gamma$ ; afterward, dialing $\phi$ scales the network to any FLOPs level.

Compound vs single-dimension scaling — Compound scaling beats scaling a single dimension at the same FLOPs.

EfficientNet-B0 architecture — EfficientNet-B0 (the base network for scaling, searched by MnasNet); B1–B7 scale up from it.

GhostNet: cheap “ghost” features

GhostNet also takes a decomposition route, but from a distinctive observation: visualizing feature maps shows that many channels are very similar to one another — so there’s no need to compute them all with expensive standard convolution.

Many standard-conv feature maps are similar — Feature maps from standard convolution — many are highly similar to each other.

So GhostNet splits standard convolution into two steps: first compute part of the output with fewer standard-conv filters, then “generate” the rest from it via cheap operations (linear transforms / depthwise convs), and concat the two parts.

Standard convolution vs the Ghost module — Standard convolution (a) vs. Ghost module (b): a few standard convs + cheap ops generate the rest.

Quantitatively: with kernel $k$ , input channels $c$ , and output $n\times h\times w$ , standard convolution has FLOPs $n\,h\,w\,c\,k^2$ . If the Ghost module computes $m$ channels in the first step, uses a depthwise conv of size $d$ for the cheap ops, and sets $s=\tfrac{n}{m}$ , the speedup is

R_{\text{FLOPs}}=\frac{n\,h\,w\,c\,k^2}{m\,h\,w\,c\,k^2+(n-m)\,h\,w\,d^2}=\frac{c\,k^2}{\tfrac{1}{s}c\,k^2+\tfrac{s-1}{s}d^2}\approx s

(the approximations use $d\approx k$ and $s\ll c$ ). The parameter compression ratio is also roughly $s$ .

GhostNet block: like MobileNetV2, expand then squeeze the feature map in the middle.

From MobileNet taking convolution apart, to ShuffleNetV2 watching memory access, to GhostNet reusing similar features — the through-line of compact design is endlessly trading off FLOPs, memory access, and accuracy. And MobileNetV3 and EfficientNet already replace “humans designing” with “automatic search” — which is exactly the direction of neural architecture search (NAS).

References

Howard, Andrew G., et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861, 2017.
Zhang, Xiangyu, et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CVPR, 2018.
Ma, Ningning, et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. ECCV, 2018.
Sandler, Mark, et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR, 2018.
Howard, Andrew, et al. Searching for MobileNetV3. arXiv:1905.02244, 2019.
Tan, Mingxing, Le, Quoc V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946, 2019.
Han, Kai, et al. GhostNet: More Features from Cheap Operations. CVPR, 2020.
Tan, Mingxing, et al. MnasNet: Platform-Aware Neural Architecture Search for Mobile. CVPR, 2019.
Jia, Yangqing, et al. Caffe: Convolutional Architecture for Fast Feature Embedding. ACM MM, 2014.
Lavin, Andrew, Gray, Scott. Fast Algorithms for Convolutional Neural Networks (Winograd). CVPR, 2016.

Tech

2020 · 12 · 06