Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation

Abstract

Earlier networks for segmentation were either too slow or too inaccurate. This paper designs an EDANet module that combines asymmetric convolution, dilated convolution, and dense connectivity. It outperforms FCN across the board, and does so without a decoder structure, a context module, a post-processing scheme, or a pretrained model. Experiments are run on Cityscapes and CamVid.

Introduction

A comparison between EDANet and several other networks is shown below.

Performance comparison between EDANet and other networks

EDANet has a few key components: asymmetric convolution, the densely connected structure of DenseNet, and dilated convolution.

Asymmetric convolution: splitting an nxn conv into a 1xn conv and an nx1 conv, which reduces the number of parameters with only a small drop in performance.

Densely connected structure: borrowed from DenseNet. Although it was originally designed for image classification, its fusion of multi-layer features is very useful for segmentation tasks.

Dilated convolution: enlarges the receptive field.

To balance efficiency and accuracy, no decoder structure, context module, or post-processing scheme is added.

CNNs were initially used for image classification tasks. FCN was the first network to apply CNNs to semantic segmentation, replacing the FC layers in VGG with convolutional layers to perform pixel-level semantic segmentation, which ushered semantic segmentation into the CNN-based era.

Among high-accuracy networks, UNet uses an encoder-decoder structure, gathering spatial information from shallow network layers to enrich the deeper information. DeconvNet proposes a decoder that mirrors the encoder, upsampling the encoder’s output; however, such a network involves a large amount of computation due to the heavy decoder. Dilation10 stacks dilated convolution layers with progressively increasing dilation rates, creating a context module that aggregates multi-scale contextual information. DeepLab introduces an ASPP module that uses multiple parallel dilated convolution kernels to explore multi-scale representations. Both modules require large amounts of computation and inference time, which makes them impractical.

Among high-efficiency networks, ENet was the first network aimed at real-time semantic segmentation. It inherits the structure of ResNet and prunes the number of convolutions to reduce computation, while ESPNet places a 1x1 conv in front of the spatial pyramid to reduce computation. Both are very efficient but not very accurate.

As for densely connected network architectures, DenseNet achieved great results on image classification tasks. Some works have already leveraged DenseNet for semantic segmentation. FC-DenseNet uses DenseNet as the encoder and then builds an additional decoder structure. SDN uses DenseNet as the backbone and combines it with a stacked deconv structure; this method makes a simple improvement to DenseNet without any additional optimization, and this modification also increases computation and runtime.

Here, asymmetric convolution is used to reduce the number of parameters and computational cost, and the idea of dense connectivity is likewise applied in the design of this network’s architecture. EDANet is able to maintain high accuracy while achieving high inference speed.

Method

The structure of the entire network is shown in the figure.

It is mainly divided into a few modules: the Downsampling Block, the EDA Block, and the final Projection Layer. The EDA Block in turn contains multiple EDA modules. The structure of the EDA module is shown below:

It contains two groups of asymmetric convolutions: the first group is a normal conv, and the second group is a dilated conv. This kind of asymmetric conv can reduce computation by 33%, with only a small drop in performance.

Another technique is the connection scheme from DenseNet, which concatenates the newly learned features with the input, i.e. $y_m=[H_{m}(y_{m-1}),y_{m-1}]$ , where m denotes the m-th module. This connection structure can greatly improve processing efficiency, and it is well known that deeper layers have larger receptive fields—for example, stacking two 3x3 convs is equivalent to the receptive field of a single 5x5 conv. Dense connectivity can therefore concatenate features from modules with different receptive fields, allowing the network to gather more information, which gives the network better segmentation results even under low computation.

For the network architecture design, ENet’s initial block is used as the downsampling block, which is split into two modes, represented as follows.

This kind of downsampling block gives the network a larger receptive field for gathering contextual information; however, reducing the resolution of the feature map loses some detail, which is very harmful to pixel-level segmentation, so only 3 downsampling blocks are used here. In the end, relative to the full-resolution input image, the feature size becomes 1/8, whereas in other networks such as SegNet the feature size becomes 1/32.

For the sake of computational speed, no decoder is used here; instead, a 1x1 conv is added at the end as the projection layer, and the image is resized back to full resolution using bilinear interpolation. This slightly reduces accuracy but saves a large amount of computation.

A post-activation scheme is used here, i.e. the conv-bn-relu order, and this structure is applied to all conv layers. A dropout of 0.02 is also added during training for regularization.

Experiment

Here we focus mainly on the experiments on Cityscapes. During training, images are downsampled to a size of 512x1024 for training, and during validation the output features are upsampled via bilinear interpolation to the original 1024x2048 size. Some training details are omitted here; see the training details of EDANet. The final experimental results are divided into two parts: one is the results of the ablation study, and the other is the experimental results on Cityscapes compared with other network architectures.

Method	mIoU (%)	Params	Multi-Adds
EDANet	65.10	0.68M	8.97B
(a) Core module
EDA-non-asym	65.11	0.81M	11.41B
EDA-non-dense	63.92	0.73M	8.87B
(b) Extra context module
EDA-shallow	58.09	0.55M	7.77B
EDA-ASPP	60.64	3.41M	41.42B
(c) Decoder
EDA-ERFdec	65.56	0.78M	12.95B
(d) Downsampling block
EDA-DenseDown	61.63	0.42M	8.51B

Results of the ablation study

Experimental results on Cityscapes compared with other networks.

Method	Extra data	Sub	mIoU (%)	Time	Speed (FPS)	Params
ESPNet (Mehta et al. 2018)	no	2	60.3	8.9ms	112.9	0.36M
ENet (Paszke et al. 2016)	no	2	58.3	13ms	76.9	0.36M
ERFNet (Romera et al. 2017)	no	2	68.0	24ms	41.7	2.1M
ContextNet (Poudel et al. 2018)	no	2	66.1	55ms	18.3	0.85M
SegNet (Badrinarayanan, Kendall, and Cipolla 2015)	ImN	4	56.1	60ms	16.7	29.5M
FCN-8s (Long, Shelhamer, and Darrell 2015)	ImN	2	65.3	0.5s	2	134.5M
Dilation10 (Yu and Koltun 2016)	ImN	2	67.1	4s	0.25	140.8M
DFN (Yu et al. 2018)	ImN + coa.	2	80.3	n/a	n/a	n/a
DeepLabv3+ (Chen et al. 2018)	ImN + coa.	2	82.1	n/a	n/a	n/a
EDANet (ours)	no	2	67.3	12.3ms	81.3	0.68M

Comparison experiment results on Cityscapes against other networks (“Sub”: downsampling factor of the input images; “ImN”: ImageNet dataset; “coa.”: coarse annotation set of Cityscapes)

Technology

2018 · 10 · 30

Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation

Abstract

Introduction

Related Work

Method

Experiment