DoReFa-Net: Notes on Low-Bit Quantized Training

The core idea of the DoReFa-Net paper is to quantize the weights, activations, and gradients of a convolutional neural network separately to low bit-width, thereby compressing the vast majority of the computation in both training and inference down to low-bit bit operations. Below is a walkthrough of how each module works, following the structure of the paper.

The Relationship Between Bit Operations and Dot Products

The paper’s starting point is that the dot product between two integers can be expressed using bit operations.

\mathbf{x} \cdot \mathbf{y} = \mathrm{bitcount}(\mathrm{and}(\mathbf{x}, \mathbf{y})), \quad x_i, y_i \in \{0,1\}\ \forall i.

Extending this idea to the dot product between two vectors, the integers in the two vectors can also be represented with different numbers of bits. This lays the foundation for using different bit-widths for weights and activations.

\begin{aligned} \mathbf{x} \cdot \mathbf{y} &= \sum_{m=0}^{M-1} \sum_{k=0}^{K-1} 2^{m+k}\, \mathrm{bitcount}\big[\mathrm{and}(c_m(\mathbf{x}), c_k(\mathbf{y}))\big], \\ c_m(\mathbf{x})_i,\ &c_k(\mathbf{y})_i \in \{0,1\}\ \forall i, m, k. \end{aligned}

The Straight-Through Estimator (STE)

The quantization operation itself (rounding/clipping) is not differentiable, so backpropagation requires an approximation. For approximating the gradient, the paper uses the STE (Straight-Through Estimator), which essentially amounts to defining custom forward and backward functions: forward performs the quantization, while backward passes the gradient straight through.

The STE in this paper corresponds to a rounding function, namely quantize_k; see the code implementation for the specific algorithm.

\begin{aligned} \textbf{Forward:}\quad & r_o = \frac{1}{2^k - 1}\,\mathrm{round}\big((2^k - 1) r_i\big) \\ \textbf{Backward:}\quad & \frac{\partial c}{\partial r_i} = \frac{\partial c}{\partial r_o}. \end{aligned}

Weight Quantization

When quantizing weights, a scaling factor is usually introduced. XNOR-NET uses a scaling factor that varies with the input, whereas DoReFa-Net uses a fixed scaling factor, which is simpler to implement.

\begin{aligned} \textbf{Forward:}\quad & r_o = \mathrm{sign}(r_i) \times \mathbf{E}(|r_i|) \\ \textbf{Backward:}\quad & \frac{\partial c}{\partial r_i} = \frac{\partial c}{\partial r_o}. \end{aligned}

When the number of bits is greater than 1, weights are quantized as follows, where the quantize_k is the STE implementation introduced above. When k=1, the specific form differs slightly from the multi-bit case, but experiments show that the two yield little difference in performance.

\begin{aligned} \textbf{Forward:}\quad & r_o = f_\omega^k(r_i) = 2\,\mathrm{quantize}_k\!\left(\frac{\tanh(r_i)}{2\max(|\tanh(r_i)|)} + \frac{1}{2}\right) - 1. \\ \textbf{Backward:}\quad & \frac{\partial c}{\partial r_i} = \frac{\partial r_o}{\partial r_i}\frac{\partial c}{\partial r_o}. \end{aligned}

Activation Quantization

Activations are quantized as follows.

f_\alpha^k(r) = \mathrm{quantize}_k(r).

Gradient Quantization

Quantizing gradients is more involved than quantizing weights and activations, mainly because gradients have a larger numerical range and no clear bounds. For this reason, the paper applies stochastic quantization to gradients.

\tilde{f}_\gamma^k(\mathrm{d}r) = 2\max_0(|\mathrm{d}r|)\left[\mathrm{quantize}_k\!\left(\frac{\mathrm{d}r}{2\max_0(|\mathrm{d}r|)} + \frac{1}{2}\right) - \frac{1}{2}\right].

Stochastic quantization simply means adding an extra noise term during quantization.

N(k) = \frac{\sigma}{2^k - 1}\ \text{where}\ \sigma \sim Uniform(-0.5, 0.5).

Experiments show that this noise term is fairly important; removing it leads to a noticeable drop in performance.

f_\gamma^k(\mathrm{d}r) = 2\max_0(|\mathrm{d}r|)\left[\mathrm{quantize}_k\!\left(\frac{\mathrm{d}r}{2\max_0(|\mathrm{d}r|)} + \frac{1}{2} + N(k)\right) - \frac{1}{2}\right].

It is worth noting that gradient quantization is performed only during backpropagation.

\begin{aligned} \textbf{Forward:}\quad & r_o = r_i \\ \textbf{Backward:}\quad & \frac{\partial c}{\partial r_i} = f_\gamma^k\!\left(\frac{\partial c}{\partial r_o}\right). \end{aligned}

The Overall Algorithm Flow

Putting the three parts above together, the overall algorithm is as follows.

Algorithm 1  Training an L-layer DoReFa-Net with W-bit weights and A-bit
activations using G-bit gradients. Weights, activations and gradients are
quantized according to Eqn 9, Eqn 11, Eqn 12, respectively.

Require: a minibatch of inputs and targets (a_0, a*), previous weights W,
         learning rate eta
Ensure:  updated weights W^{t+1}

  {1. Computing the parameter gradients:}
  {1.1 Forward propagation:}
   1: for k = 1 to L do
   2:     W_k^b  <- f_omega^W(W_k)
   3:     a-bar_k <- forward(a_{k-1}^b, W_k^b)
   4:     a_k    <- h(a-bar_k)
   5:     if k < L then
   6:         a_k^b <- f_alpha^A(a_k)
   7:     end if
   8:     Optionally apply pooling
   9: end for

  {1.2 Backward propagation:}
  Compute g_{a_L} = dC/da_L knowing a_L and a*.
  10: for k = L to 1 do
  11:     Back-propagate g_{a_k} through activation function h
  12:     g_{a_k}^b   <- f_gamma^G(g_{a_k})
  13:     g_{a_{k-1}} <- backward_input(g_{a_k}^b, W_k^b)
  14:     g_{W_k^b}   <- backward_weight(g_{a_k}^b, a_{k-1}^b)
  15:     Back-propagate gradients through pooling layer if there is one
  16: end for

  {2. Accumulating the parameter gradients:}
  17: for k = 1 to L do
  18:     g_{W_k} = g_{W_k^b} * (dW_k^b / dW_k)
  19:     W_k^{t+1} <- Update(W_k, g_{W_k}, eta)
  20: end for

Here, all the heavy operations—including forward, backward_input, and backward_weight—are low-bit operations, which is the crux of the paper’s claim of significantly reduced training cost.

On the special handling of the first and last layers:

The first convolutional layer connects directly to the network input, so its weights are not quantized, but the activations after the convolution still need to be quantized.
The final fully connected layer is also left unquantized when the number of classes is small, but the gradients flowing out of the FC layer do need to be quantized.

Fused Optimization at Inference Time

In practice, steps 3, 4, and 6 of the algorithm can be fused together to reduce the memory and time cost at inference time (effectively eliminating the intermediate step h()). After fusion, the result can be obtained directly via comparison.

a_k^b = f_\alpha(h(a_k))

Paper Notes
Quantization
Model Compression

2019 · 05 · 06