ShuffleNetV2

Abstract

Many network designs today consider only indirect metrics of computational complexity (such as FLOPs), yet direct metrics (such as speed) are not determined by FLOPs alone—MAC (memory access cost) and platform characteristics also influence speed. This paper argues for measuring directly on a specific platform, which is far better than considering FLOPs alone. Based on a series of controlled experiments, it proposes several guidelines for efficient networks, and from those guidelines derives a new architecture, ShuffleNetV2. Comprehensive ablation experiments show the model achieves a state-of-the-art trade-off between performance and accuracy.

Introduction

Ever since AlexNet achieved strong results on ImageNet, classification accuracy on ImageNet has been further improved by a number of new neural network architectures, such as VGG, GoogLeNet, ResNet, DenseNet, ResNeXt, and SE-Net, as well as by automatic network architecture search. Beyond accuracy, computational complexity is another factor that must be considered. Real-world applications are often constrained by many platforms, which has driven the design of many lightweight neural network architectures and better speed–accuracy trade-offs, such as Xception, MobileNet, MobileNetV2, ShuffleNet, and CondenseNet. In all of these works, group conv and depth-wise conv are essential.

To evaluate computational complexity, the commonly used metric is FLOPs (defined in this paper as the number of mult-adds). However, FLOPs is only an approximate, indirect metric, not the direct metric we actually care about—such as speed or latency. This discrepancy has been noticed in recent work. For example, MobileNetV2 and NASNET-A have similar FLOPs, yet MobileNetV2 is much faster. The figure below shows that models can have similar FLOPs but different speeds.

Therefore, using FLOPs alone as a metric is insufficient and may lead to suboptimal designs.

The discrepancy between the indirect metric (FLOPs) and the direct metric (speed) stems mainly from two aspects (the metric and the platform). First, several key factors affecting speed are not accounted for, such as MAC (memory access cost; group conv is an important contributor, and for a powerful compute unit like the GPU this can become its bottleneck) and DOP (degree of parallelism; at the same FLOPs, a model with high DOP can be much faster than one with low DOP). The second reason is platform dependence. For example, an operation like tensor decomposition can reduce FLOPs by 75%, yet after decomposition it can run even slower on the CPU than on the GPU. It was later found that this is because recent CUDNN libraries have special optimizations for 3x3 convolutions, so one cannot simply assume that a 3x3 conv is 9 times slower than a 1x1 conv.

From these observations, we find that efficient neural network design should follow two principles: first, use direct rather than indirect metrics; and second, measure on the specific target platform. Based on these two guidelines, we propose a more efficient neural network architecture. In Section 2 we first analyze the runtime performance of two representative state-of-the-art networks (ShuffleNetV1, MobileNetV2), and then propose four guidelines for efficient network design that go beyond considering FLOPs alone. Since these guidelines are platform-independent, we conduct various controlled experiments to validate them on two platforms (GPU and ARM, with code optimization) to ensure our conclusions are state-of-the-art.

In Section 3 we design ShuffleNetV2, and through the complete validation experiments in Section 4, we demonstrate that it is both more accurate and faster than prior networks on both platforms.

Practical Guidelines for Efficient Network Design

This study is conducted on two widely used hardware platforms, each with an industrial-grade optimized CNN library. Note that our CNN library is more efficient than most open-source libraries, so our observations and conclusions are solid and have practical value in industry. GPU (a single GTX1080TI, with CUDNN7.0 as the convolution library and CUDNN’s benchmarking function enabled to select the fastest algorithm for each convolution); ARM (Qualcomm Snapdragon 810, based on a Neon implementation, validated using a single thread). Other settings: full optimization enabled (tensor fusion and the like); input image size 224x224; all networks randomly initialized; and runtime evaluated 100 times and averaged.

For this initial study we choose MobileNetV2 and ShuffleNetV1. Although there are only two, they represent recent trends; the cores of these two networks are group conv and depth-wise conv, which are also the cores of other state-of-the-art networks. The full runtime breakdown is shown below.

FLOPs counts only the convolution part, and although that accounts for most of the time, other components—such as data I/O, data shuffle, and element-wise operations—also take up a large fraction. Therefore, using FLOPs to estimate time is quite inaccurate. Based on the above observations, we analyze runtime from several angles and arrive at some practical lessons for efficient neural network design.

Equal channel width minimizes memory access cost

Many modern networks adopt depth-wise conv, in which the 1x1 convolution accounts for most of the complexity. Its FLOPs is $B=hwc_1c_2$ . Assuming memory is large enough to hold the input and output features and the convolution kernel weights, the MAC is $MAC=hwc_1+hwc_2+c_1c_2$ , i.e., input features + output features + kernel weights. By the mean inequality we obtain $MAC\ge 2\sqrt{hwB}+\frac{B}{hw}$ , where equality holds when $c_1=c_2$ ; that is, MAC is minimized when the number of output channels equals the number of input channels. Of course this is only theoretical. Using a benchmark network of 10 repeatedly stacked blocks, and adjusting the channel count to keep the total FLOPs constant, the actual experimental results are as follows.

Speed at different channel counts — Speed comparison across different channel counts

Excessive group convolution increases MAC

Group conv is the core of many modern networks. It can reduce FLOPs through sparse connections between channels: on the one hand, at fixed FLOPs it allows using more channels, thereby increasing network capacity (and thus improving accuracy); on the other hand, increasing channels also increases MAC. From the formula in the previous item, the relationship between FLOPs and MAC for a 1x1 conv is $MAC=hwc_1+\frac{Bg}{c_1}+\frac{B}{hw}$ , where g is the number of groups. We can see that with h, w, c1, c2, and B fixed, increasing g increases MAC. By stacking 10 1x1 group convs to construct a benchmark network, the results show that blindly choosing a very large number of groups is not good: the accuracy gains it brings may be offset by the increased computational cost. The experimental results are as follows.

Comparison across different group counts

Network fragmentation reduces degree of parallelism

In the GoogLeNet family and the auto-generated architecture family, “multi-path” structures are widely used within each network block, where many small operations (fragmented operators) are used instead of a few large ones. For example, in NASNET-A the number of fragmented ops (such as individual conv/pooling operations within a block) is 13, whereas in some regular architectures (such as ResNet) it is 2 or 3. Such fragmented structures have been shown to benefit accuracy, but they may reduce efficiency because these operations are unfriendly to highly parallel compute devices like GPUs. They also incur extra overhead such as kernel launch and synchronization. To quantify this effect, we design the experimental blocks shown below.

Networks for the DOP experiment — Network architectures for the degree-of-parallelism experiment

Each block is repeated 10 times. The experimental results show that fragmentation significantly reduces speed on the GPU; relative to the GPU, the speed reduction on ARM is somewhat milder. The experimental results are shown below.

Effect of network fragmentation on speed

Element-wise operations are non-negligible

As shown in the earlier figure, element-wise operations (tensor addition, bias, ReLU, etc.) take up a considerable amount of time, especially on the GPU. They have relatively small FLOPs but relatively large MAC—and in particular depthwise conv counts as an element-wise operation here because it has a large MAC/FLOPs ratio. We conduct experiments on a bottleneck structure, targeting ReLU and the shortcut.

Experimental results for element-wise operations

Conclusion and Discussion

In summary: 1) use balanced convolutions as much as possible (equal input and output channels); 2) be aware of the cost of group conv; 3) reduce the degree of network fragmentation; 3) reduce element-wise operations. Moreover, rather than relying on theory, one should pay more attention to how a network behaves on the platform and apply this in actual network design. Many prior networks violate these rules: for example, ShuffleNetV1 relies too heavily on group conv, violating G2, and its bottleneck design violates G1; MobileNetV2’s bottleneck design violates G1, and its use of ReLU on overly thick feature maps violates G4; and auto-generated architectures are overly fragmented, violating G3.

ShuffleNet V2: an Efficient Architecture

The main challenge in lightweight network design is the limited number of channels under a given computational budget. There are two ways to increase the number of channels under limited FLOPs: 1) pointwise group conv, and 2) the bottleneck structure. A channel shuffle structure was then introduced to increase communication between different channels. But both pointwise group conv and the bottleneck structure violate the earlier guidelines. The question now is how to maintain a large, evenly distributed number of channels without resorting to overly dense convolutions or too many groups.

To meet these requirements, several design choices are made: at the very start, channels are split; then in the 1x1 conv, groups are no longer used and no bottleneck structure is used—instead the input and output channels are made equal, and ReLU and depthwise conv are applied on only one branch. Finally, concat and channel shuffle are performed. Furthermore, concat, channel shuffle, and split are merged into a single element-wise operation. As in V1, the network can be scaled using s by changing the channel count; in addition, for simplicity the split uses an even, half-and-half division. The structure of the ShuffleNetV2 block is as follows.

Comparison of ShuffleNetV1 and V2 blocks — Comparison of ShuffleNetV1 and V2 modules

ShuffleNetV2 is very efficient and can therefore have more channels and greater model capacity. In addition, the half-and-half split of features can be viewed as a form of feature reuse, similar to DenseNet and CondenseNet. The feature-reuse patterns of DenseNet and ShuffleNetV2 are compared in the figure below.

ShuffleNetV2 obtains the same feature-reuse benefits as DenseNet while being more efficient, as the subsequent experiments demonstrate.

Experiment

The hyperparameters and protocol used in the experiments are exactly the same as in ShuffleNetV1. We first present the experimental results here, and summarize the experiments afterward.

Model	Complexity (MFLOPs)	Top-1 err. (%)	GPU Speed (Batches/sec.)	ARM Speed (Images/sec.)
ShuffleNet v2 0.5x (ours)	41	39.7	417	57.0
0.25 MobileNet v1 [13]	41	49.4	502	36.4
0.4 MobileNet v2 [14] (our impl.)*	43	43.4	333	33.2
0.15 MobileNet v2 [14] (our impl.)	39	55.1	351	33.6
ShuffleNet v1 0.5x (g=3) [15]	38	43.2	347	56.8
DenseNet 0.5x [6] (our impl.)	42	58.6	366	39.7
Xception 0.5x [3] (our impl.)	40	44.9	384	52.9
IGCV2-0.25 [27]	46	45.1	183	31.5
ShuffleNet v2 1x (ours)	146	30.6	341	24.4
0.5 MobileNet v1 [13]	149	36.3	382	16.5
0.75 MobileNet v2 [14] (our impl.)**	145	32.1	235	15.9
0.6 MobileNet v2 [14] (our impl.)	141	33.3	249	14.9
ShuffleNet v1 1x (g=3) [15]	140	32.6	213	21.8
DenseNet 1x [6] (our impl.)	142	45.2	279	15.8
Xception 1x [3] (our impl.)	145	34.1	278	16.3
IGCV2-0.5 [27]	156	34.5	132	15.5
IGCV3-D (0.7) [28]	210	31.5	143	11.7
ShuffleNet v2 1.5x (ours)	299	27.4	255	11.8
0.75 MobileNet v1 [13]	325	31.6	314	10.6
1.0 MobileNet v2 [14]	300	28.0	180	8.9
1.4 MobileNet v2 [14] (our impl.)	301	28.3	180	8.9
ShuffleNet v1 1.5x (g=3) [15]	292	28.5	164	10.3
DenseNet 1.5x [6] (our impl.)	295	39.9	274	9.7
CondenseNet (G=C=8) [16]	274	29.0	-	-
Xception 1.5x [3] (our impl.)	305	29.4	219	10.5
IGCV3-D [28]	318	29.4	102	6.3
ShuffleNet v2 2x (ours)	591	25.1	217	6.7
1.0 MobileNet v1 [13]	569	29.4	247	6.5
1.4 MobileNet v2 [14]	585	25.3	137	5.4
1.4 MobileNet v2 [14] (our impl.)	587	26.7	137	5.4
ShuffleNet v1 2x (g=3) [15]	524	26.3	197	6.4
DenseNet 2x [6] (our impl.)	519	34.6	197	6.1
CondenseNet (G=C=4) [16]	529	26.2	-	-
Xception 2x [3] (our impl.)	525	27.6	174	6.7
IGCV2-1.0 [27]	564	29.3	81	4.9
IGCV3-D (1.4) [28]	610	25.5	82	4.5
ShuffleNet v2 2x (ours, with SE [8])	597	24.6	161	5.6
NASNet-A [4] (4 @ 1056, our impl.)	564	26.0	130	4.6
PNASNet-5 [10] (our impl.)	588	25.8	115	4.1

ShuffleNetV2 experimental results (Table 8: comparison of several network architectures over classification error, complexity (FLOPs) and speed, grouped by complexity from ~40 to ~500 MFLOPs; [*] denotes 160×160 input, [**] denotes 192×192 input)

Accuracy vs. FLOPs: ShuffleNetV2 clearly surpasses all other networks, but at 40 MFLOPs with a 224x224 image size it also performs poorly because there are too few channels. Compared with DenseNet, both reuse features, but our model is more efficient.

Inference Speed vs. FLOPs/Accuracy: MobileNetV2 is very slow at small FLOPs, likely due to its excessively high MAC. Although MobileNetV1 is inferior in accuracy, its speed on the GPU is very fast, already surpassing ShuffleNetV2—probably because it better satisfies the earlier design guidelines, especially G3, since MobileNetV1 has even fewer fragments than ShuffleNetV2. In addition, both IGCV2 and IGCV3 are very slow, likely because they use too much group conv; all of these phenomena are consistent with our design guidelines. Current auto-searched neural network architectures are relatively slow, presumably because they have many fragments, which violates G3—though this research direction remains promising. In terms of accuracy vs. speed, ShuffleNetV2 performs best on both the GPU and CPU platforms.

Compatibility with other methods: When ShuffleNetV2 is combined with other structures, such as SE (squeeze and excitation), classification accuracy improves, which should demonstrate its good compatibility.

Generalization to Large Models: ShuffleNetV2 also performs well when used as a large model; the only change is that, when made very deep, residual connections are added to speed up convergence.

Object Detection: We evaluate its generalization performance on the COCO dataset, using the state-of-the-art lightweight detector Light-Head RCNN as the framework. We pretrain on ImageNet and then finetune for the detection task. It was later found that Xception is fairly good at detection tasks, possibly because of the larger receptive field of its blocks. Inspired by this, we add a 3x3 depthwise conv before the first 1x1 conv, which further improves performance at the cost of only a small increase in FLOPs. In addition, for the detection task the speed differences between models are smaller than for classification (after excluding the overhead of data copying and detection-specific overhead). ShuffleNetV2* achieves the best accuracy and is faster than all other methods, which raises another practical question: how to increase the size of the receptive field, which is crucial for object detection on high-resolution images.

Conclusion

Network design should consider direct rather than indirect metrics. We provide effective guidelines for architecture design and an efficient neural network, ShuffleNetV2, and comprehensive experiments fully demonstrate the model’s effectiveness. We hope this work will lead to more platform-aware and practical network design.

Technology

2018 · 10 · 11