A Collection of Original Ideas on NAS and Model Pruning

Lately I’ve accumulated quite a few scattered ideas in the directions of NAS and model compression. Today I’m pulling them together so I can pick from them for experiments later.

At the search-strategy level. Could we optimize the probability distribution in the simplest, most direct way, so that each step moves toward a better point with high probability and converges after infinitely many steps? This reminds me of the idea behind MCMC sampling. Another direction is to search linear structures with a greedy approach, without having to enumerate all possibilities at once.

Interpretability-driven NAS. I want to do NAS under an interpretability objective, that is, to make “whether the network structure itself is interpretable” one of the evaluation dimensions of the search. Going further, could we cast the interpretability problem in the form of image translation? We could also consider handling interpretability in a hierarchical manner: first achieve interpretability at the high-level semantic level, then gradually refine it down to the detail level. In addition, generating as large a mask as possible without harming discriminative ability, and using that mask to explain the classification network, is also worth a try.

Introducing an attention mechanism. Currently most structures found by NAS are various combinations of convolutions, which tend to lose long-range information; we could consider adding attention-like mechanisms to the search space. Correspondingly, generating pruning masks using the attention idea should also be a feasible direction. Auto-encoder structures with attention are said to work very well too, and are worth investigating.

Progressively replacing blocks. In NAS we could adopt a progressive-GAN-like approach to gradually replace blocks, rather than fixing the entire structure all at once. Going further, treating a block as a channel and using pruning methods to do NAS means that, from this perspective, NAS and pruning are unified.

Combining NAS with generators. Applying NAS to the generator of a GAN, trying to make each network node generate a different part of the image, effectively giving the architecture search itself a flavor of interpretability.

Unsupervised/weakly-supervised settings. Applying NAS to unsupervised or weakly-supervised tasks. The idea is to let the AI first do unsupervised learning to find intrinsic patterns, then perform annotation and proceed to the next step of training.

Considering NAS from the dataset angle. Most current NAS work targets fixed datasets; guiding architecture search starting from the characteristics of the dataset itself is a direction worth exploring.

Statistics-based channel pruning. We could directly determine, via statistical methods, how many channels a given convolutional layer can drop, or perform a sorting-like operation on channels during training and prune by threshold. Concretely, prune according to the statistical properties of the weights: as shown in the figure below, draw a horizontal line in the middle, and determine the relative pruning ratio according to the value on the horizontal axis.

Redundancy-based channel pruning. Consider how redundant a layer is: if the feature maps output by a layer differ little from each other, it means some kernels are fairly similar and can be removed. We can take a batch of data to compute statistics, or directly compute the L2-norm matrix of the convolution-kernel parameters and then average it; the smallest values indicate that the kernels are fairly similar and highly redundant. Going further, we can flatten the convolution kernels and compute their relative angles as a similarity measure.

Pruning based on feature-map sparsity. Pruning using the sparse terms on feature maps rather than the sparse terms on convolution kernels is conceptually different and worth a comparative experiment.

Progressive pruning. Changing pruning from “remove all at once” to a slow shifting approach should work much better. We can also, during the pruning process, gradually convert convolution kernels into depthwise or group-conv forms, combining structural simplification with parameter reduction.

An EM algorithm for pruning. Fit the pruning problem into the EM framework: the E-step estimates which channels/kernels are redundant, the M-step updates the network weights, and the two iterate in a loop.

Using NAS for fast pruning. Since we can take gradients of the mask for unstructured pruning, we can likewise take gradients for structured pruning, then sort everything together; the whole process amounts to a NAS process.

Train a large network first, then prune and retrain. Try training a larger network first, then pruning and re-fine-tuning, and validate the effectiveness of this pipeline.

Layer-by-layer distillation to build a small network. Build the small network from the bottom up, making each layer’s output of the small network as close as possible to that of the large network, i.e., layer-by-layer distillation. Local knowledge distillation is also worth trying: many complex structures merely make training easier; once training is done, we can consider removing these auxiliary structures and then transferring to a simpler structure via distillation.

Using the EfficientNet idea to determine pruning ratios. Use the EfficientNet idea to find the relative pruning ratio (i.e., relative redundancy) of each layer, initializing them to the same ratio and then adjusting according to performance. This ratio determination can be done with a genetic algorithm, or called “neural transform search for pruning”: instead of directly searching for a fixed structure, search for a transformation pattern (conditional NAS). There’s no need to run a full search for every condition; finding a single set of functions suffices, which amounts to relaxing the process.

Dynamic Networks and Adaptive Inference

Dynamic channel selection based on the SkipNet idea. SkipNet skips certain blocks; we could consider making it skip certain channels, taking pruning a step further: choose the number of channels to pass to the next layer according to activation values, with the final metric being the expected inference time over the entire dataset. Correspondingly, we could also learn the skip policy with an adversarial approach.

Dynamic networks based on path statistics. Decide whether to proceed to the next step based on the preceding computation results, forming a dynamic inference mechanism based on path statistics.

Judging input complexity solely from internal neuron responses. Judge the complexity of the input solely from the responses of internal neurons, and thereby decide what kind of network to use for subsequent processing.

Cascaded small-to-large network inference. Use a front-end small network for a preliminary judgment (poor locality); if the entropy of the output is large, then invoke a larger network for processing.

Neural Decision Tree. Take a batch of pretrained models ranging from small to large, and train a classifier based on “the index of the smallest model that can correctly classify the sample,” so that it can automatically find the most suitable model given an input. We could also train this selection policy with REINFORCE.

Shallow classifier + self-ensemble. Add a classifier at a shallow layer of the network to decide at which layer to output the result, with the loss encouraging output at as shallow a layer as possible while also being more accurate. Along the way, we can try self-ensemble: using the shallow classifier’s output to weight the later layers.

Using ReLU as a channel sampler. Use the ReLU function for pruning: multiply during training, index during inference, framed as a channel sampler, and try to accelerate MobileNet, trading space for time to speed up the network.

Obtaining loss and w via adversarial learning. Use adversarial learning to obtain a suitable loss and network weights w; the specific form still needs to be designed.

Training an auto-encoder via frequency decomposition. Train an auto-encoder using frequency decomposition, explicitly introducing frequency-domain information into the reconstruction objective.

Ensembles of heterogeneous neural networks. Ensembles of heterogeneous neural networks, as well as experiments on random network structures; we can look into the related work of Zhou Zhihua.

Handling high-frequency detail in GANs. GANs handle high-frequency detail poorly; we can try cropping the image, upscaling the crops to the same scale and concatenating them, then reducing dimensionality with a 1×1 conv, which amounts to magnifying local information. Consider designing such a block.

Self-paced learning and curriculum learning. Learn easy samples first and hard samples later; this idea can be combined with the dynamic inference above.

Reducing neural-network complexity and Occam’s razor. Constrain neural-network complexity from the perspective of Occam’s razor, as a theoretical starting point for regularization.

Quantization-friendliness of the output distribution. Add a loss to make the distribution of the output values as uniform as possible or as close to a normal distribution as possible (the maximum-entropy idea), to facilitate subsequent quantization; or add a BN-like layer to transform the distribution into something close to a normal distribution.

Neural-network hashing. Use a neural network to perform a hashing-like algorithm, using the extracted features to compare the similarity between two images.

Fine-tuning in the StackGAN manner. Fine-tune in the manner of StackGAN, progressively improving resolution and detail in stages.

Testing with transposed h-w dimensions. Concatenate images that have had their h-w dimensions transposed and run tests to see what effect it brings.

Trying channel sensitivity across different datasets. It has been shown that under different initializations the channel-sensitivity distribution is the same; what about across different datasets? In pruning experiments, also factor in pruning speed.

Reading Notes: AutoML Survey

I recently read an AutoML survey—Taking the Human out of Learning Applications: A Survey on Automated Machine Learning:

applicationindustryacademic
automated model selectionAuto-sklearn[7], [8]
neural architecture searchGoogle’s Cloud[9], [10]
automated feature engineeringFeature Labs[11], [12]

Examples of AutoML approaches in industry and academic.

The paper points out that NAS has already been used in Google’s Cloud AutoML. On automated feature engineering, existing work includes the Data Science Machine (DSM), ExploreKit, and FeatureHub, and there is even a commercial product, FeatureLabs. DSM focuses on the relationships within a dataset, an idea that has something in common with considering NAS from the dataset angle.