Deriving the SVM (Part 2)

In the previous post (Part 1) we discussed the derivation of the hard-margin SVM and its dual form, whose dual problem can be simplified into the following form:

minα12i=1Nj=1Nαiαjyiyjxixji=1Nαis.t.i=1Nαiyi=0αi0i=1,2,...,N\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{N}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{N}\alpha_{i}y_{i}=0\\ &\alpha_{i}\ge 0\\ &i=1,2,...,N \end{align*}

This problem can be regarded as a quadratic programming problem with α\alpha as the optimization variable. There are many mature methods for solving quadratic programming problems, and for SVM optimization the most efficient one is the SMO (Sequential Minimal Optimization) algorithm.

The SMO Sequential Optimization Algorithm

The SMO sequential optimization algorithm first initializes all the variables in α\alpha, for example by setting α1,α2,...,αN=0\alpha_{1},\alpha_{2},...,\alpha_{N}=0, and then treats two of the components of α\alpha as variables, for example α1,α2\alpha_{1},\alpha_{2} (when selecting the two components αi,αj\alpha_{i},\alpha_{j}, one usually first picks as αi\alpha_{i} the one that most severely violates the KKT conditions mentioned above, and then chooses as the second variable the αj\alpha_{j} corresponding to the xjx_{j} that is farthest in margin from xix_{i}), while keeping the remaining α3,α4,...,αN\alpha_{3},\alpha_{4},...,\alpha_{N} fixed. Then, by the constraint i=1Nαiyi=0\sum_{i=1}^{N}\alpha_{i}y_{i}=0, we obtain α1=y1i=2Nαiyi\alpha_{1}=-y_{1}\sum_{i=2}^{N}\alpha_{i}y_{i}. The above problem can thus be reduced to a quadratic programming problem in two variables (letting Kij=xixjK_{ij}=x_{i}\cdot x_{j}):

minα1,α2W(α1,α2)=12K11α12+12K22α22+y1y2K12α1α2(α1+α2)+y1α1i=3NyiαiKi1+y2α2i=3NyiαiKi2s.t.α1y1+α2y2=i=3Nyiαi=ζα1,α20\begin{align*} min_{\alpha_{1},\alpha_{2}}\quad W(\alpha_{1},\alpha_{2})=&\frac{1}{2}K_{11}\alpha_{1}^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{1}y_{2}K_{12}\alpha_{1}\alpha_{2}\\ &-(\alpha_{1}+\alpha_{2})+y_{1}\alpha_{1}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i1}+y_{2}\alpha_{2}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i2}\\ s.t.\quad &\alpha_{1}y_{1}+\alpha_{2}y_{2}=-\sum_{i=3}^{N}y_{i}\alpha_{i}=\zeta\\ &\alpha_{1},\alpha_{2}\ge0 \end{align*}

In the above quadratic programming problem, since α1y1+α2y2=ζ\alpha_{1}y_{1}+\alpha_{2}y_{2}=\zeta, we obtain α1=(ζy2α2)y1\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}. Substituting this constraint into W(α1,α2)W(\alpha_{1},\alpha_{2}) yields a single-variable quadratic programming problem. If we set aside the inequality constraints for the moment, we can obtain a closed-form solution directly, without resorting to numerical methods, which greatly improves computational speed.

Letting vi=j=3NαjyjK(xi,xj)v_{i}=\sum_{j=3}^{N}\alpha_{j}y_{j}K(x_{i},x_{j}), substituting α1=(ζy2α2)y1\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1} into W(α1,α2)W(\alpha_{1},\alpha_{2}) gives:

W(α2)=12K11(ζα2y2)2+12K22α22+y2K12(ζα2y2)α2(ζα2y2)y1α2+v1(ζα2y2)+y2v2α2W(\alpha_{2})=\frac{1}{2}K_{11}(\zeta-\alpha_{2}y_{2})^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{2}K_{12}(\zeta-\alpha_{2}y_{2})\alpha_{2}-(\zeta-\alpha_{2}y_{2})y_{1}-\alpha_{2}+v_{1}(\zeta-\alpha_{2}y_{2})+y_{2}v_{2}\alpha_{2}

Setting Wα2=0\frac{\partial W}{\partial\alpha_{2}}=0 directly, we obtain the closed-form solution for α2\alpha_{2} as α^2=α2+y2(E1E2)η\hat\alpha_{2}=\alpha_{2}+\frac{y_{2}(E_{1}-E_{2})}{\eta}, where Ei=j=1NαjyjKij+byiE_{i}=\sum_{j=1}^{N}\alpha_{j}y_{j}K_{ij}+b-y_{i} and η=K11+K222K12\eta=K_{11}+K_{22}-2K_{12}. The α^2\hat\alpha_{2} obtained here does not yet account for the inequality constraints α1,α20\alpha_{1},\alpha_{2}\ge 0. From α1=(ζy2α2)y10\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}\ge0 together with α20\alpha_{2}\ge0, we can solve the inequalities to obtain the upper bound HH and lower bound LL of α2\alpha_2; that is, after clipping we obtain the closed-form solution for α2\alpha_{2} as:

α2={H,α^2>Hα^2,Lα^2HL,α^2<L\alpha_{2}^{*}= \begin{cases} H,\quad \hat\alpha_{2}>H\\ \hat\alpha_{2},\quad L\le\hat\alpha_{2}\le H\\ L,\quad \hat\alpha_{2}<L \end{cases}

In addition, from α1=(ζy2α2)y1\alpha_{1}^{*}=(\zeta-y_{2}\alpha_{2}^{*})y_{1} we can obtain α1\alpha_{1}^{*}, which completes the update of one group of variables in the SMO algorithm. Repeating the process of variable selection, closed-form solving, and variable clipping until all variables of α\alpha satisfy the KKT conditions from Part 1, we can then use the formulas for ww and bb given in Part 1 to obtain the trained hyperplane. This completes the mathematical derivation of the hard-margin SVM. Future posts will continue to introduce the derivation of the soft-margin SVM and the application of kernel methods. To be continued…