ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Abstract

This paper introduces a highly efficient network, ShuffleNet, which centers on two operations—pointwise group convolution and channel shuffle—that drastically cut computation while maintaining accuracy. It outperforms prior networks on both ImageNet and COCO.

Introduction

Building deeper and wider neural networks is the trend in tackling visual recognition. This paper, however, proposes another extreme: achieving the best possible performance under limited computational resources. Many existing methods (pruning, compression, etc.) merely operate on a baseline neural network, whereas here we propose an efficient network architecture for a given computational budget.

Current state-of-the-art networks are fairly computation-hungry, mainly because of the overly dense 1x1 convolutions. This paper uses 1x1 group convolutions to reduce the amount of computation, and uses channel shuffle to increase communication among the channels, which lets the network encode more information. Built on these two techniques and design principles, ShuffleNet achieves better performance on both ImageNet and COCO, and also delivers a clear speedup on real hardware platforms.

Efficient Model Designs: GoogLeNet increases network depth but reduces computation relative to a simple stack of convolution layers. SqueezeNet significantly reduces computation while maintaining accuracy. ResNet leverages an effective bottleneck structure to achieve high performance. SENet introduces a structural unit that boosts performance at little computational cost. In addition, recent work uses reinforcement learning and model search to explore efficient model designs; the resulting NASNet achieves solid accuracy, but its performance is rather mediocre at smaller FLOPs.

Group Convolution: Group convolution was first proposed in AlexNet, where it was used for distributed training across multiple GPUs, and it later demonstrated its effectiveness in ResNeXt. The depthwise separable convolution in Xception generalizes the ideas of the Inception family of networks, and MobileNet subsequently used depthwise separable convolution to achieve state-of-the-art results.

Channel Shuffle Operations: The idea of channel shuffle has rarely appeared in prior work, even though cuda-convnet supports a “random sparse convolution” layer, which is equivalent to a random channel shuffle followed by a group convolution layer. Recent work applies this idea to two-stage convolutions but does not consider the effectiveness of channel shuffle itself in small-model design.

Model Acceleration: This aims to speed up inference while preserving model accuracy. Pruning network connections or channels can remove redundant connections while preserving performance. Quantization and factorization can reduce redundancy in computation and thereby accelerate inference. In addition, without modifying the parameters, optimized convolution algorithms such as FFT can reduce the actual time cost in practice. Distillation can also train a small model from a large model, making it easier to train the small model.

Approach

Channel Shuffle for Group Convolutions

Many current state-of-the-art networks, such as Xception and ResNeXt, have too many 1x1 convolutions, which account for a large share of the computation in each block. Under a constrained computational budget, this forces the network to have fewer channels, which lowers model accuracy. To address this, one can use sparse connections between channels—group convolution being one example. However, this causes the outputs within a group to depend only on the inputs within that same group, weakening the connections between channel groups. Channel shuffle solves this by letting a group convolution obtain information from different groups. Specifically, the operation reshapes the dimensions to (g, n), transposes, flattens, and then reshapes back to (g, n), where g is the number of groups and n is the number of channels per group. It remains effective even when two convolutions have different numbers of groups, and the shuffle operation is also differentiable.

ShuffleNet Unit

This unit is designed specifically for small networks and uses a bottleneck structure. There are two kinds of units. One has no stride, with identical input and output dimensions; it adopts a residual structure and ends with a summation. The other has a stride; it likewise uses a residual structure but ends with a concatenation. This unit is more efficient than those of ResNet and ResNeXt. For example, given an input dimension of $c*h*w$ and a bottleneck channel count of $m$ , the FLOPs of ResNet are $hw(2cm+9m^{2})$ , the FLOPs of ResNeXt are $hw(2cm+9m^2/g)$ , and the FLOPs of the ShuffleNet unit are $hw(2cm/g+9m)$ . In other words, under a given computational budget, the ShuffleNet unit can use wider feature maps. Furthermore, although depthwise convolution has only a small theoretical complexity, its FLOPs/MAC ratio is fairly small—a point raised later in ShuffleNetV2—so depthwise convolution is only used at the bottleneck.

Network Architecture

Throughout the network, for simplicity the bottleneck channel count is set to $1/4$ of the input. We provide a reference model that is as simple as possible, although further hyperparameter tuning may yield better results. The parameter $g$ controls the sparsity of the connections; more groups may help encode more information, but for a single convolution it may degrade performance. ShuffleNetV2 also discusses this issue from another angle: too many groups increase MAC and thereby reduce speed. In addition, a hyperparameter $s$ is used to scale the network, achieved by scaling the number of channels.

Experiments

The experiments are mainly conducted on ImageNet. Because small networks are more prone to overfitting, only very mild data augmentation is used.

Ablation Study

This analysis focuses on two aspects: pointwise group convolution and channel shuffle. These are the core components of ShuffleNet.

For pointwise group convolution, networks of different scales (scaling factor s) and different numbers of groups are compared. The experiments show that for larger networks, once the number of groups reaches a certain value, the number of input channels for a single convolution becomes too small, which hurts representational performance. However, when the scaling factor is small and the network is small, increasing the number of groups improves performance more noticeably, because wider feature maps bring greater benefits for small networks.

Results of the 1x1 group convolution experiment

For channel shuffle, the experiments demonstrate that it consistently and clearly improves the classification score, with even better performance when the number of groups is large, since cross-group information exchange becomes more important in that case.

Results of the channel shuffle experiment

Comparison with Other Structure Units

Several units are compared within the existing ShuffleNet framework, with the experimental results shown in the figure.

different units — Comparison of different structure units

Comparison with MobileNets and Other Frameworks

This section mainly compares ShuffleNet with other classic architectures. Notably, in the comparison between ShuffleNet and MobileNet, even reducing the depth of ShuffleNet still clearly outperforms MobileNet, which shows that ShuffleNet’s advantage lies in the design of its unit rather than network depth. In addition, ShuffleNet can be combined with other excellent designs such as the SE (Squeeze-and-Excitation) module to further boost performance, though it becomes somewhat slower.

comparison with other structures — Comparison with other structures

Generalization Ability

To investigate ShuffleNet’s generalization and transfer learning performance, it is tested on COCO using the Faster-RCNN framework. On this task it again performs much better than MobileNet, possibly because the network architecture is designed without excessive redundant ornamentation, giving it good generalization performance.

experiment on generalization performance — Results of the generalization performance experiment

Actual Speedup Evaluation

This evaluation is conducted on an ARM platform. A larger number of groups (e.g., g=8) gives the best performance, but g=3 offers a good balance between performance and inference time—and it still outperforms other networks.

Technology

2018 · 10 · 10