MobileNetV2 Close Reading: Inverted Residuals and Linear Bottlenecks

Today I did a close reading of the MobileNetV2 paper. Its core changes are the inverted residual structure and the linear bottleneck, striking a balance among accuracy, FLOPs, and latency. The paper runs experiments on three kinds of tasks: classification, detection, and segmentation. Below I organize my thoughts from reading the paper.

Why ReLU Is Dropped at the Low-Dimensional Bottleneck

This is the question you most need to think through before everything else in the paper. The paper’s starting point is this: for the manifold of interest (the feature distribution we care about) to be fully preserved within some subspace of the high-dimensional activation space, two conditions must hold:

  • If the feature distribution is still non-zero after ReLU, then it corresponds to a linear transformation;
  • ReLU can fully preserve the information of the input distribution only when that input distribution lies within some subspace of the input space.

What these two points mean is: adding ReLU at the low-dimensional bottleneck causes the truncation on the negative half-axis to irreversibly collapse feature information, and later layers cannot recover it. Xception had already shown experimentally that following a depthwise conv directly with ReLU degrades performance, which is consistent with the direction of the analysis above.

Inverted Residual Structure

Based on the analysis above, MobileNetV2’s bottleneck design is “narrow-wide-narrow,” exactly the opposite of ResNet’s “wide-narrow-wide,” which is why it’s called an inverted residual. The specific flow is:

  1. The input is a low-dimensional bottleneck;
  2. A 1×1 conv expands it to high dimension by the expansion rate;
  3. A depthwise conv (3×3) operates in the high-dimensional space;
  4. A 1×1 linear conv (with no activation) maps it back to low dimension;
  5. The residual connection links the two low-dimensional bottlenecks at the ends.

This structure has a small memory footprint and is well suited to mobile deployment. Both ends of the residual connection are low-dimensional tensors, so peak memory is far smaller than that of an ordinary ResNet bottleneck.

That final 1×1 conv adds no ReLU; the paper calls it a linear conv, for exactly the reason analyzed earlier: the output is low-dimensional, and adding ReLU here would irreversibly destroy information. The paper gives a formal proof, though the process is fairly involved.

The paper also points out that this structure allows the input domain and output domain to be decoupled to some degree.

Choosing the Expansion Rate

The expansion rate uses a single constant across the whole network, typically between 5 and 10, with little difference among the choices; small networks can take a slightly smaller value and large networks a slightly larger one.

Overall Network Structure

The whole network first uses an initial conv to raise the channels to 32 (or 64), then stacks 19 residual bottleneck layers, each with BN added.

The figure below is the complete network configuration given in the paper, where t is the expansion rate of each stage:

The bottleneck module comes in two cases: the stride=1 version has a residual connection, while the stride=2 version (used for downsampling) does not:

ReLU6

Inside the module (after the depthwise conv in the high-dimensional space), ReLU6 is used rather than ordinary ReLU. ReLU6 clips activation values to a maximum of 6.

The reason is quantization friendliness on mobile: mobile devices commonly use float16 or even int8 inference, and if the dynamic range of the activations is very large (it can reach hundreds or even thousands), the limited precision of float16 introduces representation error. Restricting activations to the fixed interval [0, 6] lets float16’s numerical resolution be fully used, with smaller precision loss.

Experimental Results

On ImageNet, with a 224×224 input and a width multiplier of 1, the whole network has about 300M multi-adds and around 3.4M parameters. The multi-scale test results in the paper are as follows: