支持向量機(jī)
支持向量機(jī)
本部分練習(xí),我們將在2D示例數(shù)據(jù)集上使用支持向量機(jī)。通過(guò)在這些數(shù)據(jù)集上使用支持向量機(jī),將幫助我們初識(shí)支持向量機(jī)的運(yùn)行原理,以及如何使用高斯核函數(shù)支持向量機(jī)。
任務(wù)一 示例數(shù)據(jù)集1
本任務(wù)要求我們修改參數(shù)C的值,觀察支持向量機(jī)對(duì)數(shù)據(jù)集的判定邊界。ex6.m文件中已將相關(guān)代碼備好,其代碼如下:
%% =============== Part 1: Loading and Visualizing Data ================
% We start the exercise by first loading and visualizing the dataset.
% The following code will load the dataset into your environment and plot
% the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data1:
% You will have X, y in your environment
load('ex6data1.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ==================== Part 2: Training Linear SVM ====================
% The following code will train a linear SVM on the dataset and plot the
% decision boundary learned.
%
% Load from ex6data1:
% You will have X, y in your environment
load('ex6data1.mat');
fprintf('\nTraining Linear SVM ...\n')
% You should try to change the C value below and see how the decision
% boundary varies (e.g., try C = 1000)
C = 1;
model = svmTrain(X, y, C, @linearKernel, 1e-3, 20);
visualizeBoundaryLinear(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
當(dāng)參數(shù)C = 1時(shí),其運(yùn)行結(jié)果如下:

當(dāng)參數(shù)C = 100時(shí),其運(yùn)行結(jié)果如下:

任務(wù)二 高斯核函數(shù)支持向量機(jī)
本部分我們使用高斯核函數(shù)支持向量機(jī)對(duì)數(shù)據(jù)集進(jìn)行非線性地劃分。
- 首先我們需要使用高斯核函數(shù)構(gòu)建新的特征變量fi,i = 1, 2, 3, ......因此,我們需要在gaussianKernel.m文件中,將高斯核函數(shù)補(bǔ)充完整。

參考代碼如下:
sim = exp(-(x1 - x2)' * (x1 - x2) / (2 * sigma * sigma));
在ex6.m文件中運(yùn)行如下部分代碼,可得到的結(jié)果為:0.324652。
%% =============== Part 3: Implementing Gaussian Kernel ===============
% You will now implement the Gaussian kernel to use
% with the SVM. You should complete the code in gaussianKernel.m
%
fprintf('\nEvaluating the Gaussian Kernel ...\n')
x1 = [1 2 1]; x2 = [0 4 -1]; sigma = 2;
sim = gaussianKernel(x1, x2, sigma);
fprintf(['Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = %f :' ...
'\n\t%f\n(for sigma = 2, this value should be about 0.324652)\n'], sigma, sim);
fprintf('Program paused. Press enter to continue.\n');
pause;
- 我們需要導(dǎo)入新的數(shù)據(jù)集-樣例數(shù)據(jù)集2,通過(guò)使用高斯核函數(shù)支持向量機(jī),對(duì)該數(shù)據(jù)集進(jìn)行非線性劃分。如下為ex6.m文件中已備好的相關(guān)代碼。
%% =============== Part 4: Visualizing Dataset 2 ================
% The following code will load the next dataset into your environment and
% plot the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data2:
% You will have X, y in your environment
load('ex6data2.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ========== Part 5: Training SVM with RBF Kernel (Dataset 2) ==========
% After you have implemented the kernel, we can now use it to train the
% SVM classifier.
%
fprintf('\nTraining SVM with RBF Kernel (this may take 1 to 2 minutes) ...\n');
% Load from ex6data2:
% You will have X, y in your environment
load('ex6data2.mat');
% SVM Parameters
C = 1; sigma = 0.1;
% We set the tolerance and max_passes lower here so that the code will run
% faster. However, in practice, you will want to run the training to
% convergence.
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
visualizeBoundary(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
其運(yùn)行結(jié)果為:

- 為了更進(jìn)一步熟悉使用高斯核函數(shù)支持向量機(jī),我們導(dǎo)入新的數(shù)據(jù)集——樣例數(shù)據(jù)集3。在ex6data3.mat文件中,其將該數(shù)據(jù)集分為了兩部分,即訓(xùn)練集和交叉驗(yàn)證集。我們需要將dataset3Params.m文件補(bǔ)充完整,即我們需要在參數(shù)C∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}和參數(shù)σ∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}中找到一個(gè)最優(yōu)值,使其能夠?qū)?shù)據(jù)集進(jìn)行正確地劃分。
dataset3Params.m文件的參考代碼如下:
C_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];
sigma_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];
error_val = zeros(length(C_temp), length(sigma_temp));
for i = 1 : length(C_temp)
for j = 1 : length(sigma_temp)
model = svmTrain(X, y, C_temp(i), @(x1, x2) gaussianKernel(x1, x2, sigma_temp(j)));
predictions = svmPredict(model, Xval);
error_val(i, j) = mean(double(predictions ~= yval));
end
end
[I, J] = find(error_val == min(error_val(:))); % 找最小元素位置
C = C_temp(I) % 1
sigma = sigma_temp(J) % 0.100
ex6.m文件中本部分代碼如下:
%% =============== Part 6: Visualizing Dataset 3 ================
% The following code will load the next dataset into your environment and
% plot the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data3:
% You will have X, y in your environment
load('ex6data3.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ========== Part 7: Training SVM with RBF Kernel (Dataset 3) ==========
% This is a different dataset that you can use to experiment with. Try
% different values of C and sigma here.
%
% Load from ex6data3:
% You will have X, y in your environment
load('ex6data3.mat');
% Try different SVM Parameters here
[C, sigma] = dataset3Params(X, y, Xval, yval);
% Train the SVM
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
visualizeBoundary(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
其運(yùn)行結(jié)果為:

垃圾郵件分類器
通過(guò)使用支持向量機(jī)實(shí)現(xiàn)垃圾郵件過(guò)濾器。
任務(wù)一 電子郵件預(yù)處理
首先,我們導(dǎo)入樣本數(shù)據(jù),其內(nèi)容如下:

眾所周知,每封電子郵件都包含有類似于URLs和電子郵件地址等內(nèi)容,但其主體內(nèi)容是不同的。因此,為了提高分類效率,我們對(duì)這些類似的內(nèi)容進(jìn)行歸一化處理,例如:將URLs用“httpaddr”文本代替等。該功能代碼已在processEmail.m文件中準(zhǔn)備好了。
我們運(yùn)行ex6_spam.m文件中該部分代碼,結(jié)果如下:
==== Processed Email ====
anyon know how much it cost to host a web portal well it depend on how mani
visitor you re expect thi can be anywher from less than number buck a month
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb
if your run someth big to unsubscrib yourself from thi mail list send an
email to emailaddr
任務(wù)二 詞匯表
本部分我們利用已有的詞匯表與樣例郵件中的單詞構(gòu)建映射關(guān)系。因此,我們需要對(duì)processEmail.m文件進(jìn)行補(bǔ)充,構(gòu)建單詞表映射關(guān)系,即樣例郵件中的某一單詞屬于詞匯表,則將其添加至word_indices變量;若不屬于則跳過(guò),驗(yàn)證下一單詞。
processEmail.m文件的參考代碼如下:
for i = 1 : length(vocabList)
if (strcmp(vocabList{i}, str))
word_indices = [word_indices; i];
end
end
任務(wù)三 提取樣例郵件的特征變量
本部分將word_indices變量中的數(shù)據(jù)轉(zhuǎn)換為特性變量x的向量模式,即

其中,若第i個(gè)單詞在樣例郵件中,則xi = 1;否則xi = 0。
emailFeatures.m文件中的參考代碼如下:
x(word_indices) = 1;
任務(wù)四 使用支持向量機(jī)訓(xùn)練
本部分通過(guò)使用支持向量機(jī)模型對(duì)訓(xùn)練集進(jìn)行訓(xùn)練,對(duì)交叉驗(yàn)證集進(jìn)行驗(yàn)證。本部分代碼如下:
%% =========== Part 3: Train Linear SVM for Spam Classification ========
% In this section, you will train a linear classifier to determine if an
% email is Spam or Not-Spam.
% Load the Spam Email dataset
% You will have X, y in your environment
load('spamTrain.mat');
fprintf('\nTraining Linear SVM (Spam Classification)\n')
fprintf('(this may take 1 to 2 minutes) ...\n')
C = 0.1;
model = svmTrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
%% =================== Part 4: Test Spam Classification ================
% After training the classifier, we can evaluate it on a test set. We have
% included a test set in spamTest.mat
% Load the test dataset
% You will have Xtest, ytest in your environment
load('spamTest.mat');
fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
pause;
任務(wù)五 預(yù)測(cè)垃圾郵件
本部分使用支持向量機(jī)對(duì)新的電子郵件進(jìn)行預(yù)測(cè)。本部分代碼為:
%% =================== Part 6: Try Your Own Emails =====================
% Now that you've trained the spam classifier, you can use it on your own
% emails! In the starter code, we have included spamSample1.txt,
% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
% The following code reads in one of these emails and then uses your
% learned SVM classifier to determine whether the email is Spam or
% Not Spam
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'spamSample1.txt';
% Read and predict
file_contents = readFile(filename);
word_indices = processEmail(file_contents);
x = emailFeatures(word_indices);
p = svmPredict(model, x);
fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');