12. Support Vector Machines

Support Vector Machines

Optimization objective

SVM hypothesis:
logistic regression:

y = \frac{1}{1+e^{-\theta^Tx}}

cost function:

\min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2

Large Margin Intution

\min_\theta C\sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac{1}{2}\sum_{i=1}^n\theta_j^2

If y=1, we want \theta^Tx\ge1 (not just \ge0)
If y=0, we want \theta^Tx\le-1 (not just \le0)

If C is too large, the deasion boundary will be sensitive by outliers

The mathematics behind large margin classification (optional)

Vector Inner Product

SVM Decision Boundary

\min\limits_\theta\frac{1}{2}\sum\limits_{j=1}^n\theta_j^2=\frac{1}{2}||\theta||^2

Kernels I

Non-liner decision boundary:

Given x, compute new feature feature depending on proximity to landmarks defined manually.

Kernels and Similarity (Gaussian kernel):

f_1=similarity(x,l^{(1)})=\exp (-\frac{||x-l^{(1)}||^2}{2\sigma^2})=\exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(1)})^2}{2\sigma^2})

If x\approx l^{(1)}\qquad f_1\approx 1
If x far from l^{(1)} f_1\approx0


Kernels II

Choosing the land marks:
Where to get l ?
Give (x^{(1)},y^{(i)}),(x^{(2)},y^{(2)}),...(x^{(n)},y^{(n)}
choose l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(n)}=x^{(n)}

For training examples (x^{(i)},y^{(i)})

f_m^{(i)} = sim(x^{(i)},l^{(m)}) f_0 =1

SVM with Kernels

Hypothesis: Given x, compute features f\in R^{m+1}
Predict 'y=1' if \theta^Tf\ge0
Training:\min\limits_\theta C\sum\limits_{i=1}^my^{(i)}cost_1(\theta^Tf^{(i)})+(1-y^{(i)})cost_0(\theta^Tf^{(i)})+\frac{1}{2}\sum\limits_{j=1}^n\theta_j^2\quad (n=m)

Kernels ususally were used with SVM, although it can be used with logistic regressin, it runs slowly.

SVM parameters

C :

  • Large C: Lower bias, high variance.
  • Small C: Higher bias, low variance.

\sigma^2 :

  • Larger \sigma^2: Features f_i vary more smoothly. Higher bias, lower variance.(Underfit)
  • Small \sigma^2: Feaugers f_i vary less smoothly. Lower bias, higher variance. (Overfit)

Using an SVM

Need to specify:

  • Choice of parameter C
  • Choice of kernel (similarity function)

Note: Do perform feature scaling before using the Gaussian kernel.

Other choices of kernel

Not all similarity functions similarity(x,l) make valid kernels. (Need to satisfy technical condition called "Mercer's Theorem") to make sure SVM packages' optimizations run correctly, and do not diverge.

Many off-the-shelf kernels avaliable:

  • Polynomial kernel: k(x,l) = (x^Tl+constant)^degree
  • String kernel
  • chi-square kernel
  • histogram intersection kernel

Multi-class classification

Many SVM packages already have build-in multi-class classification functionality.

Logistic regression vs. SVM

n = number of features, m = number of training examples.

  • If n is large (relative m):
    Use logistc regression, or SVM without a kernel.
  • If n is small m is intermediate:
    Use SVM with Gaussian kernel
  • If n is small, m is large:
    Create/add more features, then use logistic regression or SVM without a kernel.
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容