Derivation of SVM (1)
SVM is a classic method in machine learning, which includes variants like hard-margin SVM, soft-margin SVM, and kernel tricks. This article mainly introduces the derivation of hard-margin SVM.
Assuming two classes of sample points can be accurately separated, a hard-margin SVM can be used for classification. Suppose the equation of the separating hyperplane is w⋅x+b=0. The distance from each sample point xi to this hyperplane is ∣w⋅xi+b∣. If the points with a positive distance to the hyperplane are classified as positive, i.e., yi=+1, and those with a negative distance are classified as negative, i.e., yi=−1, then the distance from the sample point to the separating hyperplane can be expressed as γ^i=yi(w⋅xi+b), known as the functional margin.
Let γ^=min(γ^i), which is the minimum functional margin. Note that when both w and b are scaled by the same factor, the functional margin increases but the hyperplane remains unchanged. Therefore, it is necessary to constrain w of the hyperplane, for example, by setting ∣∣w∣∣=1. We can redefine the distance as γi=yi(∣∣w∣∣w⋅xi+∣∣w∣∣b), called the geometric margin, and let γ=min(γi), which gives γ=∣∣w∣∣γ^.
The optimization problem of maximizing the separation margin can be expressed as:
maxw,bγs.t.yi(∣∣w∣∣w⋅xi+∣∣w∣∣b)≥γ, i=1,2...n
Which is equivalent to:
maxw,b∣∣w∣∣γ^s.t.yi(w⋅xi+b)≥γ^, i=1,2...n
Note that the value of the functional margin γ^ does not affect the solution to the optimization problem. We can assume γ^=1, and notice that maximizing ∣∣w∣∣1 is equivalent to minimizing 21∣∣w∣∣2. Thus, the problem can be reformulated as:
minw,b21∣∣w∣∣2s.t.yi(w⋅xi+b)−1≥0
This is an optimization problem with inequality constraints, which can be solved using the Lagrange multiplier method and KKT conditions. Let α=[α1,α2,α3,...,αn]T, and construct the Lagrangian function: L(w,b,α)=21∣∣w∣∣2−∑iαiyi(w⋅xi+b)+∑iαi
Assuming W∗,b∗,α∗ are the optimal solutions, according to the KKT conditions, they must satisfy the following equations:
∇wL(w∗,b∗,α∗)=w∗−i=1∑Nαi∗yixi=0∇bL(w∗,b∗,α∗)=−i=1∑Nαi∗yi=0αi∗(yi(w∗⋅xi+b∗)−1)=0yi(w∗⋅xi+b∗)−1≥0αi∗≥0(1)(2)(3)(4)(5)
The KKT conditions include several aspects: 1. The gradient of the Lagrangian function with respect to the original optimization variables is zero, as in (1) and (2). 2. The product of the Lagrange multipliers and the left side of the inequality constraints (in standard form) is zero, as in (3). 3. The original problem’s constraints, as in (4). 4. The non-negativity of the Lagrange multipliers, as in (5).
From (1) and (2), we obtain:
w∗=i∑Nαi∗yixii∑Nαi∗yi=0
Using the Lagrangian duality, the original problem can be transformed into:
minw,b maxα 21∣∣w∣∣2−i∑αiyi(w⋅xi+b)+i∑αi,αi≥0
Which is equivalent to:
maxα minw,b 21∣∣w∣∣2−i∑αiyi(w⋅xi+b)+i∑αi,αi≥0
The optimal solution of minw,b 21∣∣w∣∣2−∑iαiyi(w⋅xi+b)+∑iαi can be obtained using the KKT conditions as w∗=∑iNαi∗yixi, and ∑iNαi∗yi=0. Substituting into the above expression, the original problem becomes:
maxαs.t.−21i∑j∑αiαjyiyj(xi⋅xj)+i∑αiαi≥0
After solving for α∗, we can determine w∗ using w∗=∑iNαi∗yixi. Since the parameters of the separating hyperplane are determined by w∗,b∗, and the separating hyperplane is determined by αi=0, αi,xi,yi, we can select any j such that αj=0 to find b∗=yj−w∗⋅xj. This gives us the separating hyperplane w∗⋅x+b∗=0.
For solving the optimization problem for α∗, we will introduce the SMO algorithm in the next section.
To be continued…