Compressing Neural Networks with the Hashing Trick

Abstract

As deep networks are increasingly deployed on mobile devices, a dilemma becomes ever more apparent: the trend in deep learning is to develop models that can absorb larger and larger datasets, yet mobile devices have limited storage and cannot hold overly large models. This paper proposes HashedNets, which reduce model size by exploiting the inherent redundancy inside neural networks. HashedNets use a low-cost hash function to randomly group connection weights into different hash buckets, and all connections that fall into the same bucket share a single parameter value. These parameters are tuned during standard backpropagation, and the hashing process introduces no extra memory overhead. Performance on a range of benchmark datasets shows that HashedNets can substantially reduce storage requirements while preserving generalization performance.

Introduction

Over the past decade, deep neural networks have set new performance standards across many application domains, including object classification, speech recognition, image captioning, and domain adaptation. As the amount of data grows, the number of parameters in neural networks also grows in order to capture more information, and training happens on industrial clusters and high-performance GPUs—this is one direction of progress. Another trend is that machine learning is beginning to migrate to mobile and embedded devices, which have relatively low power consumption and, crucially, only a small amount of working memory.

The divergence between these two trends creates a dilemma for deploying state-of-the-art networks on mobile devices. The most powerful models trained on industrial clusters may exceed the runtime memory of a mobile device. One solution in the speech recognition domain is to transmit the processed audio to a compute center and run the computation on the server side. However, this requires high bandwidth. Another approach is to train small networks specifically for mobile applications, but this sacrifices accuracy (as of 2018, this approach is still viable: on one hand, the compute power of mobile devices is growing quickly, and on the other, lightweight networks such as ShuffleNet also achieve quite good accuracy).

This dilemma has driven the development of network compression. Research shows that neural networks contain a great deal of redundancy; the authors show that only a small fraction of it is useful, and they exploit this using a low-rank decomposition of the weight matrix. Other research shows that a deep network can be compressed into a single shallow network, achieved by training the small network on the outputs of a fully trained network (from a 2018 perspective this is distillation; the term was probably first coined by Hinton in 2015, although related ideas appeared as early as 2006). Still other approaches achieve this using reduced bit precision, and in 1989 LeCun proposed removing unimportant weights from a neural network. In short, many of the weight parameters in a neural network are useless.

Many benchmarks demonstrate that HashedNets can substantially reduce model size while only slightly affecting accuracy. Under the same memory budget, the hashing approach has more tunable parameters than low-rank decomposition. It is also found that, given a limited parameter set, reusing parameters and expanding the neural network is beneficial. Moreover, expanding a HashedNets network does not interfere with other network architecture design choices (such as dropout regularization).

Feature Hashing

Notation

This section mainly introduces the notation for vectors, matrices, and so on, as well as the meaning of the symbols used in the formulas.

HashedNets

This chapter mainly introduces HashedNets, a variant of neural networks with substantially reduced model size and memory requirements. It first introduces the method of random weight sharing among network connections, then describes how the hashing trick is used to improve it so as to avoid additional memory overhead.

Fully connected network layers have many parameters. If each layer has a memory budget and that budget is exceeded, the memory used by each layer must be reduced. There are usually two ways to do this: 1. reduce the number of connected nodes; 2. reduce the bit precision of the weight matrix. However, if the memory budget is already very tight, both of these methods will degrade the network’s generalization ability. Here we propose a new method: use a virtual weight matrix and reduce its effective memory footprint through weight sharing. The specific architecture is shown in the figure below:

Weight sharing in a single-hidden-layer fully connected network

Hashed Neural Nets (HashedNets)

One naive implementation of random weight sharing is to use a second matrix to record the grouping of each connection. However, such an overly detailed representation does not perform well when it comes to saving memory. Here we plan to implement weight sharing via hashing. Let $w^{l}_i$ store the $i$ -th effective weight of layer $l$ , and let $V^l_{i,j}$ denote a virtual weight. We then use a hash function $h^l(i,j)$ to map the indices of the virtual weights onto the effective weights, i.e. $V^{l}_{ij}=w^{l}_{h^l(i,j)}$ . This functionality is implemented using the open-source xxHash.

The single-layer computation of HashedNets is given by $z_i=\sum^{m}_{j=1}V_{ij}a_j=\sum^{m}_{j=1}w_{h(i,j)}a_j$ . HashedNets can equally well be explained through feature hashing, namely by mapping the m-dimensional input features into k dimensions via hashing, which yields the same result as weight sharing. This can be regarded as an alternative interpretation of HashedNets.

In feature hashing, the hash function is expressed as

[\phi_{i}(a)]_{k}=\sum_{j:h(i,j)=k}a_j

where $i$ denotes the $i$ -th row of the weight matrix and $[\phi_{i}(a)]_{k}$ denotes the $k$ -th element of the mapped k-dimensional feature space. We can write $z_i$ in the following form:

z_i=\sum_{k=1}^{K}w_{k}[\phi_{i}(a)]_{k}=\sum_{k=1}^{K}w_{k}\sum_{j:h(i,j)=k}a_j

Continuing the transformation, we further obtain:

z_{i}=\sum^{m}_{j=1}\sum_{k=1}^{K}w_{k}a_{j}\delta_{[h(i,j)=k]}=\sum^{m}_{j=1}w_{h(i,j)a_{j}}

This expression is the same as the one obtained from weight sharing, so the two are in fact equivalent.

Sign factor: Because of the equivalence between random weight sharing and feature hashing, HashedNets inherit some of the advantages of feature hashing. For example, in feature hashing one can introduce $\xi(i,j)$ to eliminate the bias in the inner product caused by collisions.

Sparsity: Prior research has shown that feature hashing is most useful with sparse feature vectors, because collisions are then less frequent. Using sparsity-inducing activation functions in the hidden layers, such as ReLU, can further promote this effect. Here we use ReLU activations throughout, for both their sparsity-inducing property and strong generalization performance.

Alternative neural network architectures: This work focuses mainly on fully connected feedforward networks, but HashedNets can also be applied to other networks such as RNNs, and can likewise be combined with other network compression methods, such as using low-bit storage, removing certain edges, or training HashedNets on the outputs of another large network.

Training HashedNets

This is presented in three parts: 1. computing the output of a hash layer during the forward pass; 2. propagating gradients from the output layer back to the input layer; 3. computing the gradient with respect to the weights $W^l$ during backpropagation.

The forward-pass formula:

a_i^{l+1}=f(\sum^{n^l}_{j}w^{l}_{h^{l}(i,j)}\xi^{l}(i,j)a_{j}^l)

The error term:

\delta_{j}^{l}=(\sum_{i=1}^{n^{l+1}}\xi^{l}(i,j)w^{l}_{h^{l}(i,j)}\delta_{i}^{l+1})f'(z_{j}^l )

The gradient of the loss function with respect to the parameters:

\frac{\partial L}{\partial V_{ij}^{l}}=a^{l}_{j}\delta_{i}^{l+1}\\ \frac{\partial V_{ij}^l}{\partial w_{k}^{l}}=\xi^{l}(i,j)\delta_{h^{l}(i,j)=k}

In the formulas above, the first is the derivative of the loss with respect to the virtual weights, and the second is the derivative of the virtual weights with respect to the real weights. Combining the two via the chain rule gives:

\frac{\partial L}{\partial w_{k}^{l}}=\sum_{i,j}\frac{\partial L}{\partial V_{ij}^l}\frac{\partial V_{ij}^{l}}{\partial w_{k}^{l}}=\sum_{i=1}^{n^{l+1}}\sum_{j}a_{j}^{l}\delta_{i}^{l+1}\xi^{l}(i,j)\delta_{h^{l}(i,j)=k}

Deep neural networks have made enormous progress in many real-world applications, including image classification, object detection, image retrieval, speech recognition, and text representation. Many attempts have already been made to reduce the complexity of neural networks in various contexts. The most popular method is the convolutional neural network, which uses the same filter within each receptive field, reducing model size and improving generalization performance at the same time. Combined with pooling, it can also reduce the number of connections between layers by exposing only part of the input features. Autoencoders, in turn, share the same weights between the encoder and the decoder.

Other methods can reduce the number of free parameters in a network but do not necessarily reduce memory overhead. There is a soft weight-sharing method used for regularization that models the weights as a Gaussian mixture: the weights are clustered so that weights in the same group take the same value. Because the weights are not known before training, the clustering of the weights is performed during training. This method differs from HashedNets in that it requires a large number of parameters to record which group each weight value belongs to.

Another method is to directly remove unimportant weights, but this likewise requires a large number of parameters to store the sparse weight values. There is also a method that experimentally shows that randomly removing connections in a network leads to good empirical performance, which is similar in spirit to the idea behind HashedNets.

Some methods reduce model size by reducing numerical precision—for example, a 16-bit fixed-point representation requires only a quarter of the storage of a double-precision floating-point representation, with only a slight drop in performance. This method can be coupled with other methods.

Recent research shows, by directly learning a low-rank decomposition of each layer, that there is a large amount of redundancy in neural network parameters. A network reconstructed directly from a low-rank decomposition suffers only a slight performance drop compared with the case where all weight values are treated as free parameters. Subsequent work used similar techniques to speed up CNN inference at test time, rather than to reduce storage and memory overhead. HashedNets is complementary to this work, and the two methods can be combined.

Another method is distillation, where the authors show that a distilled model generalizes better than one trained only on labels. There is also a nested dropout method, which can rank the importance of hidden nodes and remove these nodes after training.

One method was the first to propose optimizing NLP models for mobile and embedded devices. It proposed random feature mixing based on a hash function to group features, which can significantly reduce the number of features and parameters. With the help of feature hashing, the Vowpal Wabbit large-scale learning system can scale to large datasets.

Technology

2018 · 10 · 15