A Quantitative Analysis of PyTorch Training Acceleration

Training a neural network is often the most time-consuming part of the entire pipeline, especially when training a randomly initialized network to convergence on a large dataset. Generally speaking, better hardware can make training faster — for example, using more and better GPUs, using larger and faster SSDs, and replacing Ethernet with InfiniBand in distributed training. Beyond the hardware, software and algorithms matter just as much; a good software stack should make the fullest possible use of the existing hardware resources. Starting from a baseline, this article progressively optimizes training speed through a variety of methods. The baseline we choose is training ResNet18 on the ImageNet dataset for a total of 90 epochs, with the initial compute platform shown in Table 1:

Alt text
Table 1: Baseline configuration

Analyzing the Time Bottlenecks During Training

Here we use the NVIDIA Tools Extension Library (NVTX) to measure the time cost of each part of the training process. To use it, you only need to insert a few lines of code into the training script, then launch training with nsys profile python3 main.py, which produces a .qdrp file. Here we measure, within a single batch, the time spent on data loading, the forward pass, the backward pass, and gradient descent respectively. To save time, we collected statistics over 300 batches of training.

import torch.cuda.nvtx as nvtx
nvtx.range_push("Batch 0")
nvtx.range_push("Load Data")
for i, (input_data, target) in enumerate(train_loader):
    input_data = input_data.cuda(non_blocking=True)
    target = target.cuda(non_blocking=True)
    nvtx.range_pop(); nvtx.range_push("Forward")
    output = model(input_data)
    nvtx.range_pop(); nvtx.range_push("Calculate Loss/Sync")
    loss = criterion(output, target)
    prec1, prec5 = accuracy(output, target, topk=(1, 5))
    optimizer.zero_grad()
    nvtx.range_pop(); nvtx.range_push("Backward")
    loss.backward()
    nvtx.range_pop(); nvtx.range_push("SGD")
    optimizer.step()
    nvtx.range_pop(); nvtx.range_pop()
    nvtx.range_push("Batch " + str(i+1)); nvtx.range_push("Load Data")
nvtx.range_pop()
nvtx.range_pop()

When training with the baseline code, we visualized the generated .qdrp file with Nvidia Nsight Systems, with the results shown in Figure 1. We can see that training 300 batches took 336.524s. It is worth noting, however, that the CUDA kernels actually only ran for about a quarter of the total time, while three quarters of the time was spent on data loading — during which both CUDA kernel and CPU utilization were nearly 0. This became the performance bottleneck when training with the baseline code.

Alt text
Figure 1: Profiling results for the baseline

Looking more closely at the training time within an individual batch (for example, batch 122), we can see that transferring data from host memory to device memory took 13.6ms (data size of 224*224*3*256*4 = 154,140,672 bytes, bandwidth 11.3GB/s); in the forward pass, launching CUDA kernels took 15.4ms while the kernels actually ran for 67.3ms; in the backward pass, launching CUDA kernels took 19.9ms while the kernels actually ran for 164.7ms; in SGD, launching CUDA kernels took 8.0ms while the kernels actually ran for 1.8ms.

Speeding Up Data Loading

From the experimental results in the previous section, we can see that three quarters of the entire training process was spent on data loading, during which both CPU and GPU utilization were nearly 0, wasting hardware resources. We therefore start by optimizing the I/O. The data loading in the baseline training code is slow mainly because, during training, PyTorch reads raw images one by one in formats such as .jpeg: these not only need to be decoded, but the reads are also non-contiguous, with many cache misses, so the I/O time is long. Frameworks like TensorFlow, on the other hand, have their own .tfrecord format — and there are similar formats such as hdf5 and lmdb — which store the entire dataset in one large binary file, so data can be read contiguously and much more efficiently.

When optimizing I/O, you need to choose a reasonable approach according to the hardware environment. If the compute platform’s memory is not large enough to hold the entire dataset, you can use lmdb (hdf5 requires loading the entire file into memory). We ran two groups of comparison experiments — random reads (shuffled training data) and contiguous reads (unshuffled training data) — with the results shown in Table 2. If the dataset is not shuffled, reading with lmdb is very fast; the profiling results are shown in Figure 2, where after training about 50 batches the CUDA kernels are almost continuously busy, thanks to the reduced cache miss rate from contiguous reads. With shuffling, however, it actually becomes slower.

Alt text
Table 2: Results of reading data with lmdb
Alt text
Figure 2: Profiling results when reading with lmdb (no shuffle)

If you have enough memory, you can simply and bluntly put the entire dataset directly into memory. Concretely, this can be done by mounting a tmpfs filesystem and placing the entire dataset in that folder. If you do not have permission to mount, you can also place the dataset under the /dev/shm directory, which is likewise a tmpfs filesystem; a size of 160G is usually enough to hold the entire ImageNet dataset.

# With mount permission
mkdir -p /userhome/memory_data/imagenet
mount -t tmpfs -o size=160G tmpfs /userhome/memory_data/imagenet
root_dir="/userhome/memory_data/imagenet"
# Without mount permission
mkdir -p /dev/shm/imagenet
root_dir="/dev/shm/imagenet"
mkdir -p "${root_dir}/train"
mkdir -p "${root_dir}/val"
tar -xvf /userhome/data/ILSVRC2012_img_train.tar -C "${root_dir}/train"
tar -xvf /userhome/data/ILSVRC2012_img_val.tar -C "${root_dir}/val"

After placing the dataset in a tmpfs filesystem, we retrained with the baseline code, with the results shown in Figure 3. The CUDA kernels are now almost continuously at full load, and the overall training time was shortened from the baseline’s 336.524s to 87.989s (just 26% of the baseline) — a very noticeable speedup.

Alt text
Figure 3: Profiling results after optimizing I/O

Then, again, we further measured the training time within each batch: transferring data from host memory to device memory took 15.4ms (1.8ms ↑); in the forward pass, launching CUDA kernels took 7.9ms (7.5ms ↓) while the kernels actually ran for 67.5ms (0.2ms ↑); in the backward pass, launching CUDA kernels took 16.6ms (3.3ms ↑) while the kernels actually ran for 164.3ms (0.4ms ↑); in SGD, launching CUDA kernels took 5.5ms (2.5ms ↑) while the kernels actually ran for 1.9ms (0.1ms ↑). As we can see, the time spent on computation is almost identical to the baseline.

Mixed-Precision Training

As the previous section showed, after optimizing data loading the CUDA kernels can stay at full load, so further accelerating training can only come from the computation speed itself (mainly the forward and backward passes). But PyTorch uses CUDNN kernels for computation, so we cannot optimize the kernels themselves; we can only speed things up by converting part of the FP32 computation to FP16 — that is, mixed-precision training.

A relatively convenient approach to mixed-precision training is to use Apex (A PyTorch Extension), which only requires adding a few lines of code to the original script and also lets you choose among different FP16 training schemes.

from apex import amp, optimizers
# Allow Amp to perform casts as required by the opt_level
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
# loss.backward() becomes:
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
...

There are several choices for opt_level: O0 trains entirely in FP32; O1 uses FP16 for some operations; O2 uses FP16 for the vast majority of operations; and O3 uses FP16 for everything. With O1 and O2, FP16 operators (convolution, fully connected layers, etc.) use FP16 for the forward and backward passes, but a set of FP32 weights is maintained for updates during gradient descent (Figure 4), while FP32 operators (BN, etc.) use FP32 for the forward and backward passes; note that both kinds of operators use FP32 for gradient descent. Training with FP16 carries some risk of divergence, so which scheme to choose must be a trade-off between speedup and preventing divergence, based on the actual situation. Building on the fastest scheme from the previous section, we ran further experiments with FP16 training, with the results shown in Table 3. We can see that FP16 training noticeably improves the forward and backward speed compared to FP32, and the overall training time also drops fairly noticeably.

Alt text
Figure 4: Mixed-precision training
Alt text
Table 3: Speed/accuracy comparison of different FP16 optimization schemes

Up to this section, we have been training on a single GPU, and the network’s training speed has been progressively accelerated by various methods: starting from the baseline’s 336.524s (Figure 1), accelerated to 87.989s after optimizing I/O (Figure 3), and then accelerated to 70.1s through mixed-precision training without any loss of accuracy.

Single-Machine Multi-GPU Parallel Training

Building on the optimized data loading and mixed-precision training, this section further accelerates with single-machine multi-GPU parallel training. For now we focus only on data parallelism, and do not yet consider model parallelism or even operator parallelism.

There are generally two ways to implement single-machine multi-GPU training in PyTorch: one is to use nn.DataParallel, the other is to use nn.parallel.DistributedDataParallel. The former uses only a single process, while the latter uses multiple processes to train in parallel and is also applicable to multi-machine multi-GPU setups. Besides nn.parallel.DistributedDataParallel, there are also some third-party libraries for multi-process parallel training in PyTorch, such as APEX and Horovod.

Using nn.DataParallel is the simplest approach, requiring just a single line of code:

model = torch.nn.DataParallel(model)

When training in parallel with nn.DataParallel, the entire batch of data is first loaded onto one master GPU; then, via peer-to-peer (PtoP) copies, each chunk of data (BS/GPU_NUM) is copied in turn from the master GPU to the other GPUs; next, the network’s parameters are broadcast from the master GPU to the other GPUs via the NVIDIA Collective Communications Library (NCCL) Broadcast (Figure 5); then each GPU performs its own forward and backward passes; and finally, via Reduce (Figure 6), the gradients from the other GPUs are reduced onto the master GPU, which performs gradient descent to obtain the optimized network.

Alt text
Figure 5: Broadcast
Alt text
Figure 6: Reduce

We first ran experiments based on configuration [1], using the same batch size (for example, with a total batch size of 256, each card uses 128 with two cards and 64 with four cards). The results are shown in Table 4 (the experiments iterated over 500 batches, and the total time of the middle 300 batches is taken as the total time in Table 4; the DtoD time includes two parts — the time for the master GPU to transfer each chunk of data PtoP to the other GPUs, plus the combined time for the master GPU to Broadcast the network parameters and for each GPU to Reduce its gradients onto the master GPU). The profiling result for [7] with BS=512 is shown in Figure 7.

Alt text
Table 4: Speed comparison with different GPU counts using nn.DataParallel
Alt text
Figure 7: Profiling result of 4-GPU parallel training at BS=512 (configuration [7])

From Table 4 and Figure 7 above, we can roughly observe the following phenomena:

  1. Using multiple GPUs can effectively reduce the forward and backward time.
  2. When the number of GPUs grows to a certain point, the GPUs end up waiting for data loading after they finish computing.
  3. As the number of GPUs increases, the share of time spent on inter-GPU communication grows, and additional GPUs yield increasingly limited gains in overall speed.

Regarding the second point, the data loading time is made up of two parts: on one hand the I/O time, which we have already solved by putting the dataset into memory; on the other hand the data preprocessing time, which we can address by increasing the number of CPU threads and the number of data-loading threads. When training inside a docker container, you can try the former by adjusting the --cpus parameter of docker run, and the latter by appropriately increasing num_workers when creating the torch.utils.data.DataLoader object. We further improved on configuration [7], and the experimental results at BS=512 are shown in Table 5. We can see that if the GPUs are still idle waiting for data during training, appropriately increasing the number of CPU threads and data-loading threads can effectively reduce the total time (mainly by reducing the GPUs’ idle time waiting for data). However, excessively increasing the number of CPU threads causes some waste, because beyond a certain point the GPUs’ idle wait for data is already very small; moreover, over-increasing the thread count can become slower due to the overhead of the threads themselves (for example, [10] and [11] in the table below).

Alt text
Table 5: Speed comparison with different CPU/data-loading thread counts

As for the third point, we note that in nn.DataParallel parallel training, all data is first transferred from memory onto one master GPU, and then transferred PtoP in turn from the master GPU to the other GPUs. This is clearly less efficient than transferring each chunk of data directly from memory to its corresponding GPU, which would save the time of copying each chunk PtoP from the master GPU to the other GPUs (the first part of the DtoD time in Table 5, which under configuration [7] accounts for nearly 20%). In addition, on each iteration nn.DataParallel first Broadcasts the model parameters from the master GPU to the other GPUs and finally Reduces the gradients onto the master GPU. If instead we use All-Reduce to send the reduced gradients to every GPU (Figure 8), then each GPU can perform its own gradient descent, which eliminates the time overhead of nn.DataParallel broadcasting the model parameters at the start (roughly 2%).

Alt text
Figure 8: All-Reduce

The multi-process-based nn.parallel.DistributedDataParallel implements exactly this approach; for details on how to use it, refer to the official example. We ran experiments based on configuration [10], with the results shown in Table 6. When using DistributedDataParallel, the DtoD time was not measured because the synchronization time between different GPUs fluctuates considerably. We can see that using DistributedDataParallel reduces the time by about one third. From the profiling results (Figure 9), we can see that the CUDA kernel utilization of configuration [12] is far higher than that of configuration [7]. This is partly because we used more CPU threads and data-loading threads, and partly because the parallel training is implemented with multiple processes, which — unlike a multi-threaded implementation — can get around Python’s GIL (Global Interpreter Lock) and thereby further improve efficiency.

Alt text
Table 6: Speed comparison of different parallelization methods
Alt text
Figure 9: Profiling result of 4-GPU parallel training at BS=512 (using DDP), configuration [12]

Looking more closely, Figure 10 and Figure 11 respectively show the detailed time cost of training a particular batch with DataParallel and with DistributedDataParallel. When using DistributedDataParallel, the size of the HtoD data in Figure 9 shrinks from the full BS to BS/GPU_NUM, so this part of the time can in theory be reduced to 1/GPU_NUM of DataParallel’s (with slight fluctuations in the transfer speed across GPUs). In addition, the DtoD time can be eliminated entirely. Finally, the Broadcast and Reduce times are replaced by All-Reduce (which, because All-Reduce requires all GPUs to synchronize, fluctuates considerably in cost). The two methods show no noticeable difference in the time spent on the actual computation (Forward/Backward/SGD).

Alt text
Figure 10: Profiling result of DataParallel parallel computation (configuration [10])
Alt text
Figure 11: Profiling result of DistributedDataParallel parallel computation (configuration [12])

Besides DistributedDataParallel, we mentioned earlier that third-party libraries such as APEX and Horovod implement the same multi-process parallelism, and we ran comparison experiments with these two libraries as well, with the results shown in Table 7. We can see that the three implementations differ little in speed. It is worth noting that APEX can synchronize BN operations — that is, when the BN layer computes the batch’s mean/standard deviation, it synchronizes across the data on all GPUs, so the resulting mean/standard deviation is computed from the statistics of the entire batch rather than from the single chunk of data on one GPU. This makes training more stable. However, this approach requires all GPUs to synchronize (sync) every time the mean/variance is computed, which brings some additional time overhead. That said, from the profiling results, the main reason the GPUs’ computations fall out of sync is the differing initial HtoD time; after synchronizing at the first BN layer, the synchronization time required by subsequent BN layers is very short (provided that all GPUs have roughly equal compute power).

Alt text
Table 7: Speed differences between different multi-process parallelization methods

Finally, combining the FP16 mixed-precision training from the previous section with the multi-GPU training from this section — bringing these two orthogonal acceleration methods together — we can push the training speed to its limit. Concretely, Table 8 shows that DistributedDataParallel has the highest parallelization efficiency, so it can be combined with FP16 (O2); the experimental results show that this method can further accelerate to 38.8s, with the profiling result shown in Figure 12.

Alt text
Table 8: Results of mixed-precision multi-process parallel training
Alt text
Figure 12: Profiling result of combining DistributedDataParallel parallel training with mixed-precision training (configuration [16])

Multi-Machine Multi-GPU Parallel Training

All of the previous experiments were run on a single machine, but in practice, to accelerate further, multi-machine multi-GPU distributed training is also needed. For example, in a two-machine, four-card-per-machine setup, you can use the following code to train:

# Assume running on 2 machines, each with 4 available cards
#    Machine 1:
python -m torch.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node 4 \
  --master_adderss $my_address --master_port $my_port main.py
#    Machine 2:
python -m torch.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node 4 \
  --master_adderss $my_address --master_port $my_port main.py

We tested the training speed under different hardware configurations, the main difference being whether InfiniBand is used, with the experimental results shown in Table 9. Because the data on the different machines needs to be synchronized during training, part of the training time comes from inter-machine communication; InfiniBand can achieve lower latency than Ethernet (although Ethernet is more universal, while InfiniBand is relatively better suited to parallel-computing scenarios).

Alt text
Table 9: Multi-machine multi-GPU training speed

Summary

In this article, by adjusting the configuration step by step, we progressively optimized training to make the fullest possible use of the GPUs’ compute power. First, using the baseline code on a single GPU with BS=512, training 300 batches took 161.0s (configuration [0]); then, simply using DataParallel to train on 4 GPUs accelerated it to 81.7s (configuration [7]); next, increasing the number of CPU threads and data-loading threads accelerated it to 60.2s by improving the data-loading speed (configuration [10]); on top of that, replacing DataParallel with DistributedDataParallel accelerated it to 41.6s by reducing inter-GPU communication time (configuration [12]); then, combining this with mixed-precision training accelerated it to 38.8s (configuration [16]); and finally, using two such nodes communicating over InfiniBand for parallel training accelerated it to 22.1s (configuration [18]).