Derivation of SVM (1)

SVM is a classic method in machine learning, which includes variants like hard-margin SVM, soft-margin SVM, and kernel tricks. This article mainly introduces the derivation of hard-margin SVM.

Assuming two classes of sample points can be accurately separated, a hard-margin SVM can be used for classification. Suppose the equation of the separating hyperplane is $w\cdot x+b=0$ . The distance from each sample point $x_{i}$ to this hyperplane is $|w\cdot x_{i}+b|$ . If the points with a positive distance to the hyperplane are classified as positive, i.e., $y_{i}=+1$ , and those with a negative distance are classified as negative, i.e., $y_{i}=-1$ , then the distance from the sample point to the separating hyperplane can be expressed as $\hat{\gamma}_{i}=y_{i}(w\cdot x_{i}+b)$ , known as the functional margin.

Let $\hat{\gamma}=min(\hat{\gamma}_{i})$ , which is the minimum functional margin. Note that when both $w$ and $b$ are scaled by the same factor, the functional margin increases but the hyperplane remains unchanged. Therefore, it is necessary to constrain $w$ of the hyperplane, for example, by setting $||w||=1$ . We can redefine the distance as $\gamma_{i}=y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||})$ , called the geometric margin, and let $\gamma=min(\gamma_{i})$ , which gives $\gamma=\frac{\hat\gamma}{||w||}$ .

The optimization problem of maximizing the separation margin can be expressed as:

\begin{align*} &max_{w,b}\quad \gamma \\ &s.t.\quad y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||})\ge\gamma, \ i=1,2...n \end{align*}

Which is equivalent to:

\begin{align*} &max_{w,b}\quad \frac{\hat\gamma}{||w||} \\ &s.t.\quad y_{i}(w\cdot x_{i}+b)\ge\hat \gamma, \ i=1,2...n \end{align*}

Note that the value of the functional margin $\hat\gamma$ does not affect the solution to the optimization problem. We can assume $\hat\gamma=1$ , and notice that maximizing $\frac{1}{||w||}$ is equivalent to minimizing $\frac{1}{2}||w||^{2}$ . Thus, the problem can be reformulated as:

\begin{align*} &min_{w,b}\quad \frac{1}{2}||w||^{2}\\ &s.t.\quad y_{i}(w\cdot x_{i}+b)-1\ge0 \end{align*}

This is an optimization problem with inequality constraints, which can be solved using the Lagrange multiplier method and KKT conditions. Let $\alpha=[\alpha_{1},\alpha_{2},\alpha_{3},...,\alpha_{n}]^{T}$ , and construct the Lagrangian function: $L(w,b,\alpha)=\frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}$

Assuming $W^{*},b^{*},\alpha^{*}$ are the optimal solutions, according to the KKT conditions, they must satisfy the following equations:

\begin{align} &\nabla_{w}L(w^{*},b^{*},\alpha^{*})=w^{*}-\sum_{i=1}^{N}\alpha^{*}_{i}y_{i}x_{i}=0\tag{1}\\ &\nabla_{b}L(w^{*},b^{*},\alpha^{*})=-\sum_{i=1}^{N}\alpha_{i}^{*}y_{i}=0\tag{2}\\ &\alpha_{i}^{*}(y_{i}(w^{*}\cdot x_{i}+b^{*})-1)=0\tag{3}\\ &y_{i}(w^{*}\cdot x_{i}+b^{*})-1\ge 0\tag{4}\\ &\alpha_{i}^{*}\ge0\tag{5} \end{align}

The KKT conditions include several aspects: 1. The gradient of the Lagrangian function with respect to the original optimization variables is zero, as in (1) and (2). 2. The product of the Lagrange multipliers and the left side of the inequality constraints (in standard form) is zero, as in (3). 3. The original problem’s constraints, as in (4). 4. The non-negativity of the Lagrange multipliers, as in (5).

From (1) and (2), we obtain:

w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}\\ \sum_{i}^{N}\alpha_{i}^{*}y_{i}=0

Using the Lagrangian duality, the original problem can be transformed into:

min_{w,b}\ max_{\alpha}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

Which is equivalent to:

max_{\alpha}\ min_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

The optimal solution of $min_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}$ can be obtained using the KKT conditions as $w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}$ , and $\sum_{i}^{N}\alpha_{i}^{*}y_{i}=0$ . Substituting into the above expression, the original problem becomes:

\begin{align*} max_{\alpha}\quad&-\frac{1}{2}\sum_{i}\sum_{j}\alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}\cdot x_{j})+\sum_{i}\alpha_{i}\\ s.t.\quad&\alpha_{i}\ge 0 \end{align*}

After solving for $\alpha^{*}$ , we can determine $w^{*}$ using $w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}$ . Since the parameters of the separating hyperplane are determined by $w^{*},b^{*}$ , and the separating hyperplane is determined by $\alpha_{i}\neq0$ , $\alpha_{i},x_{i},y_{i}$ , we can select any $j$ such that $\alpha_{j}\neq0$ to find $b^{*}=y_{j}-w^{*}\cdot x_{j}$ . This gives us the separating hyperplane $w^{*}\cdot x+b^{*}=0$ .

For solving the optimization problem for $\alpha^{*}$ , we will introduce the SMO algorithm in the next section. To be continued…