Notes on TensorRT Inference Acceleration
I’ve been looking into model deployment lately, and there’s no getting around TensorRT. It isn’t on the same level as a training framework—its purpose is very clear: accelerating inference for an already-trained neural network. Let me lay out how it works.
What it optimizes
TensorRT’s speedups come mainly from two things—combining layers and optimizing kernel selection. The optimization targets can be broken down into the metrics that actually matter at inference time:
- latency
- throughput
- efficiency (power consumption)
- memory usage
- accuracy
When the hardware allows, it can also mix bit widths and use low precision (e.g. FP16/INT8) to squeeze down time and memory even further.
The overall framework looks roughly like this, and the blocks in the middle are the main optimization points:
A few optimizations on the graph
Once TensorRT has the network’s computation graph, it performs a series of rewrites at the graph level:
- Eliminating dead layers: layers whose outputs are never used are simply dropped.
- Layer fusion: common consecutive operations like convolution, bias, and ReLU are fused into one, saving the reads and writes of the intermediate results.
- Aggregating from a common source: operations with sufficiently similar parameters that come from the same source tensor are computed together—a classic example being the parallel 1×1 convolutions in the GoogLeNet/Inception modules.
- Merging concat layers: instead of actually performing the concatenation, the outputs of the upstream layers are written directly to where they ultimately belong, saving an explicit concatenation.
The rough workflow
TensorRT can convert models built with various frameworks (TensorFlow, Caffe, ONNX, etc.) into its own engine. It works in both C++ and Python, and there are just a few main steps:
- Create a TensorRT network definition (describing the network structure)
- Call TensorRT’s builder to compile the network into an engine
- Serialize / deserialize the engine (so you don’t have to rebuild every time and can load it directly at deployment)
- Feed data into the engine and run inference
In short, what TensorRT does is: take a trained model, cut away everything it can at the graph level, then pick the fastest set of kernels for the current GPU, drop precision when needed, and finally package it all into an engine that can be deployed directly and loaded over and over.