NCNN Peak Memory Benchmark: A Layer-by-Layer Analysis of MobileNet
I’ve recently been studying the memory overhead of inference engines, trying to understand how much peak memory a model actually consumes during real inference, and how various optimization options (light_mode, fp16, int8) affect that peak. Using MobileNet as the primary subject, I ran a systematic set of tests with NCNN on the x86 platform, and along the way I also collected data for several common networks.
Layer-by-Layer Memory Derivation for MobileNet
Let’s start with the structure of MobileNet. The total parameter count is about 17M bytes. Tallying each layer by “input + output + convolution parameters,” the per-layer memory usages add up to roughly 57M bytes, and the theoretical maximum single-layer memory usage is about 4.8M bytes.
| Layer | Conv Size | Input Size | Memory Usage (bs=1, input + output + conv params) |
|---|---|---|---|
| conv/s2 | 3×3×3×32 | 224×224×3 | 2211200 bytes |
| conv dw/s1 | 3×3×32 | 112×112×32 | 3212416 bytes |
| conv/s1 | 1×1×32×64 | 112×112×32 | 4825088 bytes |
| conv dw/s2 | 3×3×64 | 112×112×64 | 4016384 bytes |
| conv/s1 | 1×1×64×128 | 56×56×64 | 2441216 bytes |
| conv dw/s1 | 3×3×128 | 56×56×128 | 3215872 bytes |
| conv/s1 | 1×1×128×128 | 56×56×128 | 3276800 bytes |
| conv dw/s2 | 3×3×128 | 56×56×128 | 2011648 bytes |
| conv/s1 | 1×1×128×256 | 28×28×128 | 1335296 bytes |
| conv dw/s1 | 3×3×256 | 28×28×256 | 1614848 bytes |
| conv/s1 | 1×1×256×256 | 28×28×256 | 1867776 bytes |
| conv dw/s2 | 3×3×256 | 28×28×256 | 1012736 bytes |
| conv/s1 | 1×1×256×512 | 14×14×256 | 1126400 bytes |
| conv dw/s1 | 3×3×512 | 14×14×512 | 821248 bytes |
| conv/s1 | 1×1×512×512 | 14×14×512 | 1851392 bytes |
| … ×5 | |||
| conv dw/s2 | 3×3×512 | 14×14×512 | 520192 bytes |
| conv/s1 | 1×1×512×1024 | 7×7×512 | 2398208 bytes |
| conv dw/s2 | 3×3×1024 | 7×7×1024 | 438272 bytes |
| conv/s1 | 1×1×1024×1024 | 7×7×1024 | 4595712 bytes |
| avg pool | 7×7×1024 | ||
| fc | 1024×1000 | 1×1×1024 | 4104096 bytes |
I found a similar set of statistics in MCUNet, but it differs considerably from my own calculations, so I’ve sent an email to ask about it. Looking at the MCUNet code, its counting method only includes the largest input + output activations and does not account for the weights themselves. I recalculated using that approach as well, but the result still doesn’t quite match up. By rights, weights also have to be loaded into memory when they participate in computation, so it seems more reasonable to count them in.
| Cloud AI (NVIDIA V100) | Mobile AI (iPhone 11) | Tiny AI (STM32F746) | ResNet-50 | MobileNetV2 | MobileNetV2 (int8) | |
|---|---|---|---|---|---|---|
| Memory | 16 GB | 4 GB | 320 kB | 7.2 MB | 6.8 MB | 1.7 MB |
| Storage | TB~PB | >64 GB | 1 MB | 102 MB | 13.6 MB | 3.4 MB |
Memory: Cloud→Mobile about 4×, Mobile→Tiny about 3100×; Storage: Cloud→Mobile about 1000×, Mobile→Tiny about 64000×. There is a huge gap between Tiny AI’s memory budget (320 kB) and the actual footprint of the three models on the right.
Experimental Setup and Cross-Network Comparison
The tests were run on an x86 Linux platform with an input image size of 224×224×3, using the NCNN framework, with models converted from ONNX. I hit a few snags here: some operations aren’t supported when converting from ONNX to NCNN, and the online ONNX simplifier isn’t great to work with — I recommend cloning it and converting locally instead.
First I ran a baseline experiment: loading only the same runtime libraries without performing any inference, the peak memory was about 16M bytes (stripping out some libraries could compress this further). Running inference in a loop multiple times does not increase peak memory. After loading the model, NCNN’s baseline peak memory was about 76M bytes.
The table below shows the measured data for each network, as reported by VmPeak:
| Neural Network | VGG16 | AlexNet | GoogleNet | ResNet18 | ResNet50 | DenseNet161 | ShuffleNetV2 | MobileNet | MobileNetV2 |
|---|---|---|---|---|---|---|---|---|---|
| Model Size (onnx file) | 527MB | 233MB | 25.2MB | 44.5MB | 97.4MB | 110MB | 8.67MB | 16.1MB | 13.5MB |
| Peak Memory (VmPeak) | 1601.4MB | 549.7MB | — | 410.8MB | 473.3MB | 504.4MB | 48.2MB | 74.2MB | 73.4MB |
How light_mode and fp16/int8 Affect Peak Memory
Taking MobileNetV2 as an example, I toggled NCNN’s quantization options one by one and measured peak memory, with the following results:
light_mode=false, peak about 86M:
| PID | USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 18179 | root | 20 | 0 | 114668 | 94992 | 5468 | R | 49.5 | 4.7 | 0:09.71 | check_peak_memo |
light_mode=true, peak about 76M:
| PID | USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 18597 | root | 20 | 0 | 75140 | 55440 | 5508 | R | 49.3 | 2.7 | 0:10.27 | check_peak_memo |
light_mode=true, with fp16 also enabled (use_fp16_packed, use_fp16_storage, and use_fp16_arithmetic all true), peak still about 76M:
| PID | USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 18871 | root | 20 | 0 | 75140 | 55360 | 5436 | R | 49.3 | 2.7 | 0:15.72 | check_peak_memo |
light_mode=true, fp16 fully enabled, plus int8 (use_int8_storage and use_int8_arithmetic also true), and the peak still doesn’t drop noticeably:
| PID | USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 19148 | root | 20 | 0 | 75140 | 55296 | 5364 | R | 49.2 | 2.7 | 0:10.44 | check_peak_memo |
This leads to the first conclusion: enabling fp16 on its own has no effect on reducing peak memory. It must be paired with light_mode=true to see an improvement of roughly 10M — and this is achieved through memory reuse by promptly releasing intermediate feature maps, which has nothing to do with quantization precision.
Differences in How “Peak Memory” Is Defined
There’s a noteworthy issue here: when papers (especially work targeting embedded devices, such as MCUNet) compute peak memory, they only count the memory footprint of the input and output feature maps and do not include the operators’ own weights — in MCU scenarios, the weights typically reside in Flash and are read directly during inference without occupying SRAM, so this definition has its own justification. But when measuring on a Linux-platform inference engine (such as NCNN), the weights stay resident in memory, so the numbers under the two conventions are naturally not on the same order of magnitude. This is something to watch out for in particular when comparing data from different sources.