Milestones in Neural Architecture Search (NAS)

Let me recommend a survey (everyone keeps mentioning it, but I’ll recommend it anyway):

Neural Architecture Search: A Survey

Roughly following the timeline, the content is divided into a few parts: brute force wins, for the masses, and deployment.

Brute Force Wins

The first NAS work that became widely known should be Google’s NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING [ICLR’17], which fully showcased Google’s strength and deep pockets (strike that). The idea is also simple and clear: previously, things like the number and size of filters in each layer had to be set by hand, so you can turn it into an optimization problem—lay out a few options and let it pick the best one. Combining these choices gives you a token that encodes a particular network architecture. To make it more flexible and support variable-length tokens, an RNN is used as the controller, and then policy gradient is used to maximize the expected reward of the networks the controller samples—namely, the validation accuracy.

Alt text
The idea is very direct and the results are good, but Google used 800 GPUs running for nearly a month (and only on CIFAR-10)… an instant turn-off.

Later, these heavyweights probably felt that running CIFAR-10 was too naive and that they should do the whole of ImageNet. But searching directly on ImageNet the way they did before would surely not run—forget 800 cards, even 8,000 wouldn’t cut it.

So Google put out another paper, Learning Transferable Architectures for Scalable Image Recognition [CVPR’18]. Instead of searching the whole structure, just search for a cell and stack it up—after all, VGG/Inception/ResNet are built that way too. So they reused the earlier idea, continuing to use an RNN and policy gradient to find the best structure in a new search space: searching for cells on CIFAR, then stacking them up to use on ImageNet. The search target only includes two kinds of cells, one a normal cell and the other a reduction cell used for downsampling. Each cell contains some blocks, and the new search space includes choosing which ops these blocks use, what to use as inputs, and so on. Much later work followed this NASNet Search Space approach.

Alt text
The final overall structure and the searched cell look like this—namely, NASNet.
Alt text
Performance is also very good. The best CIFAR-10 run reached a 2.19% error rate, and under the same parameter budget it surpassed hand-designed networks on ImageNet as well.
Alt text
This time was much faster than before: 500 GPUs ran only 4 days to finish (another turn-off).

Afterward, the heavyweights probably felt that since the tokens were now fixed-length, perhaps the RNN + policy gradient optimization approach could be swapped for some other algorithm (or maybe they felt that tuning reinforcement learning was a damn pain). So Google put out yet another paper, Regularized Evolution for Image Classifier Architecture Search [AAAI’19]. The search space follows the NASNet Search Space above, only the optimization algorithm is swapped for regularized evolution (for the specific algorithm, just read the paper… writing it out feels like too much content (actually I’m just lazy)). The searched structure is called AmoebaNet. Experiments show that evolution really does work better: faster convergence, better results, and not as painful to tune.

Alt text
The two on the right are on different datasets.

This AmoebaNet approach used 450 GPUs running for 7 days—still a turn-off…

Later, Google itself also felt that the earlier methods took too much time, so they came up with Progressive Neural Architecture Search [ECCV’18], whose main goal is speedup. By shrinking the search space and using a predictor function to predict network accuracy, the time was finally cut to about 225 GPU-days—a dozen-fold reduction compared to AmoebaNet—but for ordinary players it’s still a turn-off…

For the Masses

For the masses, this kind of brute-force play is basically unaffordable, but with the compute problem sitting right there, naturally some solutions emerged to speed up the process.

A fairly representative one is weight sharing to speed up validation, from Efficient Neural Architecture Search via Parameter Sharing [ICML’18] (huh, still Google’s Quoc V. Le and team). The rough idea is to turn NAS’s sampling process into sampling a subgraph within a DAG, where the operations (convolutions, etc.) in the sampled subgraph share parameters. The main difference from earlier methods is that in ENAS, a subgraph’s parameters are inherited from the DAG rather than reinitialized, so you don’t need to train from scratch to validate the subgraph’s accuracy—greatly accelerating the NAS process.

Alt text
Sampling a subgraph from the DAG. Here, 1 is the input node, and 3/6 are the output nodes.

To evaluate the performance of the sampled subgraph on the dataset, the DAG’s parameters (weights, etc.) need to be optimized. In addition, to find the best subgraph, the sampler (here an RNN) also needs to be optimized. ENAS alternately optimizes the parameters in the DAG and the parameters in the sampler. The DAG parameters are optimized by sampling a subgraph, then doing a forward-backward pass to compute the gradients of the corresponding operations, and then applying those gradients to the corresponding operations in the DAG to update the parameters via gradient descent. The sampler RNN is updated using the earlier policy gradient optimization approach.

Alt text
ENAS searches the entire CNN; the RNN samples the operation at each node and the predecessor node used as input.

ENAS proposed two modes (here I only introduce the CNN one): one searches the entire network structure, and the other searches for a cell and then stacks it up in the earlier fashion. When searching for a cell, it only takes one card running for half a day to finish. That said, among the recent ICLR’20 submissions there are many dissenting voices about weight sharing; if interested, you can take a look on openreview.

Another NAS work that runs very quickly searches in a differentiable way: DARTS: Differentiable Architecture Search [ICLR’19], which got modified in all sorts of ways this year (I suspect it’s very likely because the PyTorch code was open-sourced, and the code is written so beautifully that it’s relatively easy to modify). DARTS is very similar to ENAS, also finding subgraphs in a DAG and likewise adopting weight sharing; the main differences lie in the search space and search algorithm. The search space is fairly similar to the NASNet search space, while the search algorithm adopts a differentiable approach.

DARTS also searches for a cell and then stacks it up in a certain pattern. The biggest difference from earlier methods in the search algorithm is that when DARTS selects an operation, it doesn’t sample according to the RNN’s output probabilities as before; instead, it sums all operations weighted by their weights (not directly weighted—the weights are first normalized via softmax). This way the weights end up inside the computation graph, so after computing the validation loss you can backward through it to compute the gradients of the weights and directly optimize them via gradient descent. The search process is the process of optimizing the weights, and in the end, keeping the operation with the largest weight gives the final searched structure.

Alt text
A cell in DARTS; edges of different colors represent different operations, the shade of color represents different weights, and in the end the one with the largest weight is kept.

The optimization of the weights and the optimization of the weights within the DAG are similar to ENAS, also done alternately (optimizing the DAG weights on the training set and the architecture weights on the validation set). However, DARTS does forward-backward over the entire DAG (the downside being relatively large memory overhead) rather than first sampling a subgraph as ENAS does.

Algorithm 1: DARTS – Differentiable Architecture Search

Create a mixed operation ō^(i,j) parametrized by α^(i,j) for each edge (i,j)
while not converged do
  1. Update architecture α by descending
       ∇_α L_val( w − ξ ∇_w L_train(w, α),  α )
       (ξ = 0 if using first-order approximation)
  2. Update weights w by descending ∇_w L_train(w, α)
Derive the final architecture based on the learned α.

DARTS’s search algorithm—very beautiful work.

A single DARTS search only needs one 1080 Ti running for a day—no pressure at all for players among the masses, hahaha.

These two works let many players among the masses, who only have a few GPUs, afford NAS too.

Deployment

The NAS works introduced above rarely show any trace of hand-designed networks; they basically all redesign a new search space. From the experimental results, networks found by NAS can indeed achieve very good results, but they aren’t necessarily suitable for real deployment (for example, DARTS can reach very high accuracy with a very small number of parameters, but the actual forward pass is in fact very slow). Since there are already many excellent hand-designed network structures, NAS can introduce this prior knowledge to accelerate or achieve better real-deployment performance.

Let me first give the most intuitive example. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks [ICML’19] is simple and crude: since MobileNet is fast at the same accuracy, why not take it and scale it up a bit—would the accuracy be a bit higher at the same speed? The answer is yes.

The question is how to scale up this network. Deepening / widening / increasing resolution alone gives mediocre results, so EfficientNet searched for the relative scaling factors among width / depth / resolution and scaled them up according to a certain ratio. It turns out this approach (which I think should also count as NAS), while simple, is effective.

Alt text
Alt text
EfficientNet (a scaled-up MobileNetV2) has brutal performance, and the way the scaling is combined matters a lot.

There’s another work that modifies MobileNetV2, from Song Han’s group at MIT: ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [ICLR’19]. The point of this work isn’t modifying MobileNetV2 (kernel size, expansion ratio, etc.), but rather the proxyless philosophy. If you want the result for a certain dataset on a certain hardware, just search directly on the corresponding dataset and hardware, rather than, like earlier work, first searching on CIFAR-10 and then transferring to large datasets like ImageNet. It also adds a hardware-latency loss term to the objective function as a multi-objective NAS, directly targeting the deployment hardware.

Alt text

Another series of works also comes from Google: MnasNet: Platform-Aware Neural Architecture Search for Mobile [CVPR’19], used to modify the number of layers / convolution operations / adding attention, etc. in MobileNetV2, and then optimized using the same earlier RNN + policy gradient approach.

Alt text

Beyond that, NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [ECCV’18] compresses an existing network for specific hardware, which amounts to doing a structural fine-tune of the network on the hardware.

Alt text

Google later combined these two works and released Searching for MobileNetV3 [ICCV’19], taking the MobileNetV2 modified by MnasNet and further fine-tuning it with NetAdapt, then adding some engineering tricks to obtain the final MobileNetV3.

Alt text
Performance is very strong, surpassing both ProxylessNAS and MnasNet.

Finally, let me introduce one more recent work, also from Song Han’s group: Once for All: Train One Network and Specialize it for Efficient Deployment [Arxiv]. It directly trains one large network, and whichever kind of network you need you just take from it, decoupling the training process from the search process—equivalent to only needing to train once.

Alt text

The rather magical part is that after training, you can get very good results without fine-tuning, which should be related to the progressive shrinking training approach below. If you do fine-tune, the accuracy will be even higher; the version on openreview even beats MobileNetV3 by 4 points.

Alt text
progressive shrinking

It feels like a very good idea. Most of the time in NAS is spent on validating/comparing network accuracy, so spending all that time on training one large network and then picking the smaller network you need might be a more efficient / better-performing approach. Once for All also uses quite a few tricks, so I recommend everyone read the original paper.

Finally, let me mention one more conclusion from Rethinking the Value of Network Pruning [ICLR’19]: for structured pruning, the accuracy obtained from the Training---Pruning---Fine-tuning pipeline is not as high as training the target network from scratch. In other words, the structural design of the pruned target network is actually the most important thing—which is exactly what NAS is doing. That said, when training from scratch is fairly expensive, the pruning and fine-tuning strategies also matter a lot, striving to maintain the original accuracy with as little compute as possible.

Finally done! I’ll keep updating this whenever I come across interesting papers in the future~