Deriving the SVM (Part 2)

In the previous post (Part 1) we discussed the derivation of the hard-margin SVM and its dual form, whose dual problem can be simplified into the following form:

\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{N}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{N}\alpha_{i}y_{i}=0\\ &\alpha_{i}\ge 0\\ &i=1,2,...,N \end{align*}

This problem can be regarded as a quadratic programming problem with $\alpha$ as the optimization variable. There are many mature methods for solving quadratic programming problems, and for SVM optimization the most efficient one is the SMO (Sequential Minimal Optimization) algorithm.

The SMO Sequential Optimization Algorithm

The SMO sequential optimization algorithm first initializes all the variables in $\alpha$ , for example by setting $\alpha_{1},\alpha_{2},...,\alpha_{N}=0，$ and then treats two of the components of $\alpha$ as variables, for example $\alpha_{1},\alpha_{2}$ (when selecting the two components $\alpha_{i},\alpha_{j}$ , one usually first picks as $\alpha_{i}$ the one that most severely violates the KKT conditions mentioned above, and then chooses as the second variable the $\alpha_{j}$ corresponding to the $x_{j}$ that is farthest in margin from $x_{i}$ ), while keeping the remaining $\alpha_{3},\alpha_{4},...,\alpha_{N}$ fixed. Then, by the constraint $\sum_{i=1}^{N}\alpha_{i}y_{i}=0$ , we obtain $\alpha_{1}=-y_{1}\sum_{i=2}^{N}\alpha_{i}y_{i}$ . The above problem can thus be reduced to a quadratic programming problem in two variables (letting $K_{ij}=x_{i}\cdot x_{j}$ ):

\begin{align*} min_{\alpha_{1},\alpha_{2}}\quad W(\alpha_{1},\alpha_{2})=&\frac{1}{2}K_{11}\alpha_{1}^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{1}y_{2}K_{12}\alpha_{1}\alpha_{2}\\ &-(\alpha_{1}+\alpha_{2})+y_{1}\alpha_{1}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i1}+y_{2}\alpha_{2}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i2}\\ s.t.\quad &\alpha_{1}y_{1}+\alpha_{2}y_{2}=-\sum_{i=3}^{N}y_{i}\alpha_{i}=\zeta\\ &\alpha_{1},\alpha_{2}\ge0 \end{align*}

In the above quadratic programming problem, since $\alpha_{1}y_{1}+\alpha_{2}y_{2}=\zeta$ , we obtain $\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}$ . Substituting this constraint into $W(\alpha_{1},\alpha_{2})$ yields a single-variable quadratic programming problem. If we set aside the inequality constraints for the moment, we can obtain a closed-form solution directly, without resorting to numerical methods, which greatly improves computational speed.

Letting $v_{i}=\sum_{j=3}^{N}\alpha_{j}y_{j}K(x_{i},x_{j})$ , substituting $\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}$ into $W(\alpha_{1},\alpha_{2})$ gives:

W(\alpha_{2})=\frac{1}{2}K_{11}(\zeta-\alpha_{2}y_{2})^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{2}K_{12}(\zeta-\alpha_{2}y_{2})\alpha_{2}-(\zeta-\alpha_{2}y_{2})y_{1}-\alpha_{2}+v_{1}(\zeta-\alpha_{2}y_{2})+y_{2}v_{2}\alpha_{2}

Setting $\frac{\partial W}{\partial\alpha_{2}}=0$ directly, we obtain the closed-form solution for $\alpha_{2}$ as $\hat\alpha_{2}=\alpha_{2}+\frac{y_{2}(E_{1}-E_{2})}{\eta}$ , where $E_{i}=\sum_{j=1}^{N}\alpha_{j}y_{j}K_{ij}+b-y_{i}$ and $\eta=K_{11}+K_{22}-2K_{12}$ . The $\hat\alpha_{2}$ obtained here does not yet account for the inequality constraints $\alpha_{1},\alpha_{2}\ge 0$ . From $\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}\ge0$ together with $\alpha_{2}\ge0$ , we can solve the inequalities to obtain the upper bound $H$ and lower bound $L$ of $\alpha_2$ ; that is, after clipping we obtain the closed-form solution for $\alpha_{2}$ as:

\alpha_{2}^{*}= \begin{cases} H,\quad \hat\alpha_{2}>H\\ \hat\alpha_{2},\quad L\le\hat\alpha_{2}\le H\\ L,\quad \hat\alpha_{2}<L \end{cases}

In addition, from $\alpha_{1}^{*}=(\zeta-y_{2}\alpha_{2}^{*})y_{1}$ we can obtain $\alpha_{1}^{*}$ , which completes the update of one group of variables in the SMO algorithm. Repeating the process of variable selection, closed-form solving, and variable clipping until all variables of $\alpha$ satisfy the KKT conditions from Part 1, we can then use the formulas for $w$ and $b$ given in Part 1 to obtain the trained hyperplane. This completes the mathematical derivation of the hard-margin SVM. Future posts will continue to introduce the derivation of the soft-margin SVM and the application of kernel methods. To be continued…

Technology

2018 · 08 · 18