NCNN Peak Memory Benchmark: A Layer-by-Layer Analysis of MobileNet

I’ve recently been studying the memory overhead of inference engines, trying to understand how much peak memory a model actually consumes during real inference, and how various optimization options (light_mode, fp16, int8) affect that peak. Using MobileNet as the primary subject, I ran a systematic set of tests with NCNN on the x86 platform, and along the way I also collected data for several common networks.

Layer-by-Layer Memory Derivation for MobileNet

Let’s start with the structure of MobileNet. The total parameter count is about 17M bytes. Tallying each layer by “input + output + convolution parameters,” the per-layer memory usages add up to roughly 57M bytes, and the theoretical maximum single-layer memory usage is about 4.8M bytes.

Layer	Conv Size	Input Size	Memory Usage (bs=1, input + output + conv params)
conv/s2	3×3×3×32	224×224×3	2211200 bytes
conv dw/s1	3×3×32	112×112×32	3212416 bytes
conv/s1	1×1×32×64	112×112×32	4825088 bytes
conv dw/s2	3×3×64	112×112×64	4016384 bytes
conv/s1	1×1×64×128	56×56×64	2441216 bytes
conv dw/s1	3×3×128	56×56×128	3215872 bytes
conv/s1	1×1×128×128	56×56×128	3276800 bytes
conv dw/s2	3×3×128	56×56×128	2011648 bytes
conv/s1	1×1×128×256	28×28×128	1335296 bytes
conv dw/s1	3×3×256	28×28×256	1614848 bytes
conv/s1	1×1×256×256	28×28×256	1867776 bytes
conv dw/s2	3×3×256	28×28×256	1012736 bytes
conv/s1	1×1×256×512	14×14×256	1126400 bytes
conv dw/s1	3×3×512	14×14×512	821248 bytes
conv/s1	1×1×512×512	14×14×512	1851392 bytes
… ×5
conv dw/s2	3×3×512	14×14×512	520192 bytes
conv/s1	1×1×512×1024	7×7×512	2398208 bytes
conv dw/s2	3×3×1024	7×7×1024	438272 bytes
conv/s1	1×1×1024×1024	7×7×1024	4595712 bytes
avg pool		7×7×1024
fc	1024×1000	1×1×1024	4104096 bytes

I found a similar set of statistics in MCUNet, but it differs considerably from my own calculations, so I’ve sent an email to ask about it. Looking at the MCUNet code, its counting method only includes the largest input + output activations and does not account for the weights themselves. I recalculated using that approach as well, but the result still doesn’t quite match up. By rights, weights also have to be loaded into memory when they participate in computation, so it seems more reasonable to count them in.

	Cloud AI (NVIDIA V100)	Mobile AI (iPhone 11)	Tiny AI (STM32F746)	ResNet-50	MobileNetV2	MobileNetV2 (int8)
Memory	16 GB	4 GB	320 kB	7.2 MB	6.8 MB	1.7 MB
Storage	TB~PB	>64 GB	1 MB	102 MB	13.6 MB	3.4 MB

Memory: Cloud→Mobile about 4×, Mobile→Tiny about 3100×; Storage: Cloud→Mobile about 1000×, Mobile→Tiny about 64000×. There is a huge gap between Tiny AI’s memory budget (320 kB) and the actual footprint of the three models on the right.

Experimental Setup and Cross-Network Comparison

The tests were run on an x86 Linux platform with an input image size of 224×224×3, using the NCNN framework, with models converted from ONNX. I hit a few snags here: some operations aren’t supported when converting from ONNX to NCNN, and the online ONNX simplifier isn’t great to work with — I recommend cloning it and converting locally instead.

First I ran a baseline experiment: loading only the same runtime libraries without performing any inference, the peak memory was about 16M bytes (stripping out some libraries could compress this further). Running inference in a loop multiple times does not increase peak memory. After loading the model, NCNN’s baseline peak memory was about 76M bytes.

The table below shows the measured data for each network, as reported by VmPeak:

Neural Network	VGG16	AlexNet	GoogleNet	ResNet18	ResNet50	DenseNet161	ShuffleNetV2	MobileNet	MobileNetV2
Model Size (onnx file)	527MB	233MB	25.2MB	44.5MB	97.4MB	110MB	8.67MB	16.1MB	13.5MB
Peak Memory (VmPeak)	1601.4MB	549.7MB	—	410.8MB	473.3MB	504.4MB	48.2MB	74.2MB	73.4MB

How light_mode and fp16/int8 Affect Peak Memory

Taking MobileNetV2 as an example, I toggled NCNN’s quantization options one by one and measured peak memory, with the following results:

light_mode=false, peak about 86M:

PID	USER	PR	NI	VIRT	RES	SHR	S	%CPU	%MEM	TIME+	COMMAND
18179	root	20	0	114668	94992	5468	R	49.5	4.7	0:09.71	check_peak_memo

light_mode=true, peak about 76M:

PID	USER	PR	NI	VIRT	RES	SHR	S	%CPU	%MEM	TIME+	COMMAND
18597	root	20	0	75140	55440	5508	R	49.3	2.7	0:10.27	check_peak_memo

light_mode=true, with fp16 also enabled (use_fp16_packed, use_fp16_storage, and use_fp16_arithmetic all true), peak still about 76M:

PID	USER	PR	NI	VIRT	RES	SHR	S	%CPU	%MEM	TIME+	COMMAND
18871	root	20	0	75140	55360	5436	R	49.3	2.7	0:15.72	check_peak_memo

light_mode=true, fp16 fully enabled, plus int8 (use_int8_storage and use_int8_arithmetic also true), and the peak still doesn’t drop noticeably:

PID	USER	PR	NI	VIRT	RES	SHR	S	%CPU	%MEM	TIME+	COMMAND
19148	root	20	0	75140	55296	5364	R	49.2	2.7	0:10.44	check_peak_memo

This leads to the first conclusion: enabling fp16 on its own has no effect on reducing peak memory. It must be paired with light_mode=true to see an improvement of roughly 10M — and this is achieved through memory reuse by promptly releasing intermediate feature maps, which has nothing to do with quantization precision.

Differences in How “Peak Memory” Is Defined

There’s a noteworthy issue here: when papers (especially work targeting embedded devices, such as MCUNet) compute peak memory, they only count the memory footprint of the input and output feature maps and do not include the operators’ own weights — in MCU scenarios, the weights typically reside in Flash and are read directly during inference without occupying SRAM, so this definition has its own justification. But when measuring on a Linux-platform inference engine (such as NCNN), the weights stay resident in memory, so the numbers under the two conventions are naturally not on the same order of magnitude. This is something to watch out for in particular when comparing data from different sources.

Model Compression
Inference Engine
MobileNet

2021 · 06 · 12