Notes on DARTS and ProxylessNAS

I’ve been studying neural architecture search (NAS) recently, reading through DARTS and ProxylessNAS in detail, so here are my notes. The code for both papers is open source:

Core Idea

The basic idea of DARTS is to use a softmax to relax the discrete choice of operations into a continuous weighted mixture, then differentiate with respect to the architecture parameters α and use gradient descent to optimize the network structure. This turns the originally non-differentiable discrete search problem into a continuous optimization problem.

Operation Set

DARTS has 8 candidate operations in total:

  • 3×3 separable convolution
  • 5×5 separable convolution
  • 3×3 dilated separable convolution
  • 5×5 dilated separable convolution
  • 3×3 max pooling
  • 3×3 average pooling
  • identity (skip connection)
  • zero (no connection)

Cell Structure

DARTS also searches only for a cell, then stacks cells into the full network.

A cell is a directed acyclic graph composed of several nodes, with 2 input nodes and 1 output node. The output node is the concat of all intermediate-node outputs. In the paper, each cell has 7 nodes; removing the input and output nodes leaves 4 intermediate nodes.

During the search phase, a cell is a fully connected directed graph—there is a “mixed operation” edge between any two nodes, where the mixed operation is the weighted sum of the outputs of all candidate operations, with the weights normalized from α via softmax. As a result, the feature maps and parameters of all operations must be kept in memory simultaneously, which incurs a large memory overhead. ProxylessNAS improves on exactly this point: instead of a mixed operation, it directly samples a single concrete operation according to the probabilities.

After the search is complete:

  • For each intermediate node, based on the “strength” of all edges pointing to it (defined as the weight value of the highest-weighted operation on that edge), keep the 2 strongest edges.
  • For each retained edge, take the argmax to keep the single operation with the highest weight, yielding the final discrete structure.

Bilevel Optimization

The gradient iteration alternates between the weights w and the architecture parameters α: w is updated on the training set, and α is updated on the validation set. The paper splits the CIFAR training set evenly into two halves, serving as the train and val sets here.

The gradient computation formula for the architecture parameters is as follows:

αLval(w,α)ξα,w2Ltrain(w,α)wLval(w,α)\nabla_\alpha \mathcal{L}_{val}(w', \alpha) - \xi\, \nabla^2_{\alpha,w}\mathcal{L}_{train}(w, \alpha)\, \nabla_{w'} \mathcal{L}_{val}(w', \alpha)

where w=wξwLtrain(w,α)w' = w - \xi \nabla_w \mathcal{L}_{train}(w,\alpha).

Here the Hessian matrix is computed quickly using an approximation:

α,w2Ltrain(w,α)wLval(w,α)αLtrain(w+,α)αLtrain(w,α)2ϵ\nabla^2_{\alpha,w}\mathcal{L}_{train}(w, \alpha)\, \nabla_{w'}\mathcal{L}_{val}(w', \alpha) \approx \frac{\nabla_\alpha \mathcal{L}_{train}(w^{+}, \alpha) - \nabla_\alpha \mathcal{L}_{train}(w^{-}, \alpha)}{2\epsilon}

where w±=w±ϵwLval(w,α)w^{\pm} = w \pm \epsilon\, \nabla_{w'}\mathcal{L}_{val}(w', \alpha).

Alternatively, one could try computing this gradient using the REINFORCE method, estimating the expectation based on Monte Carlo sampling.

Normal Cell and Reduction Cell

The search phase produces only one cell structure, but it is differentiated into two uses, the normal cell and the reduction cell: the normal cell keeps the feature-map size unchanged, while the reduction cell changes the first operation on the cell’s input nodes to stride=2, performing downsampling. In the full network, reduction cells are placed at the 1/3 and 2/3 depth positions, and normal cells everywhere else.

Search Cost and ImageNet Transfer

A full search takes about 1 day (on a single GTX 1080 Ti). For experimental stability, the paper ran it once each with 4 different random seeds, for 4 days in total.

After completing the search on CIFAR, the discovered architecture is transferred to ImageNet and retrained:

ProxylessNAS feels like it mainly builds on DARTS, while also running more comprehensive experiments. A few important contributions:

  1. It addresses DARTS’s excessive memory consumption.
  2. It does not use transfer from CIFAR to ImageNet, but searches directly on ImageNet.
  3. For updating the architecture parameters, it proposes two methods of computing the gradient: the BinaryConnect approach and the REINFORCE approach.
  4. It runs multi-objective search experiments targeting mobile platforms, treating latency as one of the optimization objectives, and predicts latency with a neural network so that it becomes differentiable.

Solving the Memory Problem

DARTS uses mixed operations, so the feature maps after all operations are stored, consuming a lot of memory. The idea of ProxylessNAS is:

  • When updating w: freeze the architecture parameters α, compute the sampling probability p of each operation from α, then sample a deterministic single-path structure and update only the parameters along that path.
  • When updating α: freeze w, compute the gradient of α, and perform gradient descent.

The way nodes are connected is the same as in DARTS: each intermediate node takes two predecessor nodes as input, and the output node is the concat of all intermediate-node outputs. The way the final structure is generated is also the same as DARTS: each operation takes the one with the highest weight.

But there is still a memory problem, because computing the gradient of α requires the following binarized weights:

L/gi\partial L / \partial g_i

The operation on each edge is determined by the formula below, where both g and o are kept in the computation graph—even though it is essentially sampling, the implementation is still mixed weights, just with the weights binarized, so the memory consumption is about the same as DARTS:

mOBinary(x)=i=1Ngioi(x)={o1(x)with probability p1oN(x)with probability pN.m^{\mathrm{Binary}}_{\mathcal{O}}(x) = \sum_{i=1}^{N} g_i\, o_i(x) = \begin{cases} o_1(x) & \text{with probability } p_1 \\ \quad\cdots & \\ o_N(x) & \text{with probability } p_N. \end{cases}

A more efficient scheme is: when updating α, instead of putting all N operations into the computation graph, draw a multinomial sample of 2 operations from the N operations according to α, temporarily compressing N down to 2, and then perform the parameter update with the implementation above. This means α is not updated for all operations at once, but rather updated in groups of 2 operations at a time.

Why not just sample 1 operation? Because the sampling operation itself is not in the computation graph—with only 1 sample there is no way to compute the gradient of α, whereas with 2 samples there is a way to form a comparison and compute the gradient direction of α. After updating the α values corresponding to these 2 operations, a rescaling is applied across all α so that the weights corresponding to the other α (obtained from α via softmax) stay unchanged. Through the accumulation of many pairwise comparisons, if one operation is more suitable than all the others, it will keep surviving the pairwise comparisons.

Searching Directly on ImageNet

Thanks to the memory optimization, ProxylessNAS can search directly on ImageNet, no longer relying on CIFAR as a proxy. Here it does not stack cells directly, but uses MobileNet V2 as the backbone framework and searches over the operation choices for each block within it.

Two Methods for Estimating the Architecture-Parameter Gradient

Method 1: The BinaryConnect approach

Borrowing the idea from BinaryConnect for taking gradients of discrete weights:

Lαi=j=1NLpjpjαij=1NLgjpjαi=j=1NLgj ⁣(exp(αj)kexp(αk))αi=j=1NLgjpj(δijpi)\frac{\partial L}{\partial \alpha_i} = \sum_{j=1}^{N} \frac{\partial L}{\partial p_j}\frac{\partial p_j}{\partial \alpha_i} \approx \sum_{j=1}^{N} \frac{\partial L}{\partial g_j}\frac{\partial p_j}{\partial \alpha_i} = \sum_{j=1}^{N} \frac{\partial L}{\partial g_j}\frac{\partial\!\left(\frac{\exp(\alpha_j)}{\sum_k \exp(\alpha_k)}\right)}{\partial \alpha_i} = \sum_{j=1}^{N} \frac{\partial L}{\partial g_j}\, p_j(\delta_{ij} - p_i)

Method 2: The REINFORCE approach

Use Monte Carlo sampling to estimate the expected gradient:

J(α)=Egα[R(Ng)]=ipiR(N(e=oi)),αJ(α)=iR(N(e=oi))αpi=iR(N(e=oi))piαlog(pi)=Egα[R(Ng)αlog(p(g))]1Mi=1MR(Ngi)αlog(p(gi))\begin{aligned} J(\alpha) &= \mathbb{E}_{g\sim\alpha}[R(\mathcal{N}_g)] = \sum_i p_i\, R(\mathcal{N}(e = o_i)), \\ \nabla_\alpha J(\alpha) &= \sum_i R(\mathcal{N}(e = o_i))\, \nabla_\alpha p_i = \sum_i R(\mathcal{N}(e = o_i))\, p_i\, \nabla_\alpha \log(p_i) \\ &= \mathbb{E}_{g\sim\alpha}[R(\mathcal{N}_g)\, \nabla_\alpha \log(p(g))] \approx \frac{1}{M}\sum_{i=1}^{M} R(\mathcal{N}_{g^i})\, \nabla_\alpha \log(p(g^i)) \end{aligned}

In the formula above, the part that computes the gradient can be computed directly using the analytic form of the softmax.

Bringing Latency into Differentiable Optimization

On a phone, a small neural network is trained to predict latency using a number of features (operation type, input/output feature-map sizes, and other hyperparameters). Latency thus becomes a differentiable function of α, and can be brought into multi-objective optimization together with the accuracy objective.

Experimental results of the multi-objective search:

This experiment used multiple optimization objectives—besides accuracy, also latency—and reached SOTA directly on accuracy. The final model’s performance: