Derivation of SVM (1)

Language:中文/EN

Derivation of SVM (1)

SVM is a classic method in machine learning, which includes variants like hard-margin SVM, soft-margin SVM, and kernel tricks. This article mainly introduces the derivation of hard-margin SVM.

Assuming two classes of sample points can be accurately separated, a hard-margin SVM can be used for classification. Suppose the equation of the separating hyperplane is wx+b=0w\cdot x+b=0. The distance from each sample point xix_{i} to this hyperplane is wxi+b|w\cdot x_{i}+b|. If the points with a positive distance to the hyperplane are classified as positive, i.e., yi=+1y_{i}=+1, and those with a negative distance are classified as negative, i.e., yi=1y_{i}=-1, then the distance from the sample point to the separating hyperplane can be expressed as γ^i=yi(wxi+b)\hat{\gamma}_{i}=y_{i}(w\cdot x_{i}+b), known as the functional margin.

Let γ^=min(γ^i)\hat{\gamma}=min(\hat{\gamma}_{i}), which is the minimum functional margin. Note that when both ww and bb are scaled by the same factor, the functional margin increases but the hyperplane remains unchanged. Therefore, it is necessary to constrain ww of the hyperplane, for example, by setting w=1||w||=1. We can redefine the distance as γi=yi(wwxi+bw)\gamma_{i}=y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||}), called the geometric margin, and let γ=min(γi)\gamma=min(\gamma_{i}), which gives γ=γ^w\gamma=\frac{\hat\gamma}{||w||}.

The optimization problem of maximizing the separation margin can be expressed as:

maxw,bγs.t.yi(wwxi+bw)γ, i=1,2...n\begin{align*} &max_{w,b}\quad \gamma \\ &s.t.\quad y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||})\ge\gamma, \ i=1,2...n \end{align*}

Which is equivalent to:

maxw,bγ^ws.t.yi(wxi+b)γ^, i=1,2...n\begin{align*} &max_{w,b}\quad \frac{\hat\gamma}{||w||} \\ &s.t.\quad y_{i}(w\cdot x_{i}+b)\ge\hat \gamma, \ i=1,2...n \end{align*}

Note that the value of the functional margin γ^\hat\gamma does not affect the solution to the optimization problem. We can assume γ^=1\hat\gamma=1, and notice that maximizing 1w\frac{1}{||w||} is equivalent to minimizing 12w2\frac{1}{2}||w||^{2}. Thus, the problem can be reformulated as:

minw,b12w2s.t.yi(wxi+b)10\begin{align*} &min_{w,b}\quad \frac{1}{2}||w||^{2}\\ &s.t.\quad y_{i}(w\cdot x_{i}+b)-1\ge0 \end{align*}

This is an optimization problem with inequality constraints, which can be solved using the Lagrange multiplier method and KKT conditions. Let α=[α1,α2,α3,...,αn]T\alpha=[\alpha_{1},\alpha_{2},\alpha_{3},...,\alpha_{n}]^{T}, and construct the Lagrangian function: L(w,b,α)=12w2iαiyi(wxi+b)+iαiL(w,b,\alpha)=\frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}

Assuming W,b,αW^{*},b^{*},\alpha^{*} are the optimal solutions, according to the KKT conditions, they must satisfy the following equations:

wL(w,b,α)=wi=1Nαiyixi=0bL(w,b,α)=i=1Nαiyi=0αi(yi(wxi+b)1)=0yi(wxi+b)10αi0\begin{align} &\nabla_{w}L(w^{*},b^{*},\alpha^{*})=w^{*}-\sum_{i=1}^{N}\alpha^{*}_{i}y_{i}x_{i}=0\tag{1}\\ &\nabla_{b}L(w^{*},b^{*},\alpha^{*})=-\sum_{i=1}^{N}\alpha_{i}^{*}y_{i}=0\tag{2}\\ &\alpha_{i}^{*}(y_{i}(w^{*}\cdot x_{i}+b^{*})-1)=0\tag{3}\\ &y_{i}(w^{*}\cdot x_{i}+b^{*})-1\ge 0\tag{4}\\ &\alpha_{i}^{*}\ge0\tag{5} \end{align}

The KKT conditions include several aspects: 1. The gradient of the Lagrangian function with respect to the original optimization variables is zero, as in (1) and (2). 2. The product of the Lagrange multipliers and the left side of the inequality constraints (in standard form) is zero, as in (3). 3. The original problem’s constraints, as in (4). 4. The non-negativity of the Lagrange multipliers, as in (5).

From (1) and (2), we obtain:

w=iNαiyixiiNαiyi=0w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}\\ \sum_{i}^{N}\alpha_{i}^{*}y_{i}=0

Using the Lagrangian duality, the original problem can be transformed into:

minw,b maxα 12w2iαiyi(wxi+b)+iαi,αi0min_{w,b}\ max_{\alpha}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

Which is equivalent to:

maxα minw,b 12w2iαiyi(wxi+b)+iαi,αi0max_{\alpha}\ min_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

The optimal solution of minw,b 12w2iαiyi(wxi+b)+iαimin_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i} can be obtained using the KKT conditions as w=iNαiyixiw^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}, and iNαiyi=0\sum_{i}^{N}\alpha_{i}^{*}y_{i}=0. Substituting into the above expression, the original problem becomes:

maxα12ijαiαjyiyj(xixj)+iαis.t.αi0\begin{align*} max_{\alpha}\quad&-\frac{1}{2}\sum_{i}\sum_{j}\alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}\cdot x_{j})+\sum_{i}\alpha_{i}\\ s.t.\quad&\alpha_{i}\ge 0 \end{align*}

After solving for α\alpha^{*}, we can determine ww^{*} using w=iNαiyixiw^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}. Since the parameters of the separating hyperplane are determined by w,bw^{*},b^{*}, and the separating hyperplane is determined by αi0\alpha_{i}\neq0, αi,xi,yi\alpha_{i},x_{i},y_{i}, we can select any jj such that αj0\alpha_{j}\neq0 to find b=yjwxjb^{*}=y_{j}-w^{*}\cdot x_{j}. This gives us the separating hyperplane wx+b=0w^{*}\cdot x+b^{*}=0.

For solving the optimization problem for α\alpha^{*}, we will introduce the SMO algorithm in the next section. To be continued…