K均值算法與主成分分析算法
K均值分析算法
在本部分練習(xí)中,你將實(shí)現(xiàn)K均值算法并將該算法用于圖像壓縮。最初,你通過使用2D數(shù)據(jù)集幫助你理解K均值算法。在此之后,你將使用K均值算法以減少顏色數(shù)量的方式壓縮圖像。
任務(wù)一 實(shí)現(xiàn)K均值算法
K均值算法是一種自動(dòng)將相似數(shù)據(jù)聚類在一起的方法,其基本代碼如下:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to c?(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end
上述代碼中,循環(huán)分為了兩步:1)將樣本數(shù)據(jù)x(i)聚類至最近的聚類中心;2)計(jì)算每一聚類的均值再將樣本數(shù)據(jù)重新分配。
Step 1:尋找最近的聚類中心

根據(jù)上述公式,我們在findClosestCentroids.m文件中添加如下代碼:
for i = 1 : length(idx)
T = [];
for j = 1 : K
T = [T; X(i, :)];
end
[t, idx(i)] = min(sum((T - centroids) .^ 2, 2)); % sum(A, 2)表示對矩陣A按列求和
end
Step 2:計(jì)算每一聚類的均值

根據(jù)上述公式,我們在computeCentroids.m文件中添加如下代碼:
for k = 1 : K
T = find(idx == k);
centroids(k, :) = sum(X(T, :)) / size(T, 1); % size(A, 1)表示求矩陣A的行數(shù)
end
最后,我們運(yùn)行該任務(wù)部分代碼,其中ex7.m文件中的代碼如下:
%% ================= Part 1: Find Closest Centroids ====================
% To help you implement K-Means, we have divided the learning algorithm
% into two functions -- findClosestCentroids and computeCentroids. In this
% part, you should complete the code in the findClosestCentroids function.
%
fprintf('Finding closest centroids.\n\n');
% Load an example dataset that we will be using
load('ex7data2.mat');
% Select an initial set of centroids
K = 3; % 3 Centroids
initial_centroids = [3 3; 6 2; 8 5];
% Find the closest centroids for the examples using the
% initial_centroids
idx = [];
idx = findClosestCentroids(X, initial_centroids);
fprintf('Closest centroids for the first 3 examples: \n')
fprintf(' %d', idx(1:3));
fprintf('\n(the closest centroids should be 1, 3, 2 respectively)\n');
fprintf('Program paused. Press enter to continue.\n');
pause;
運(yùn)行結(jié)果為:

任務(wù)二 隨機(jī)初始化(該部分任務(wù)不需提交)
在實(shí)際應(yīng)用中,我們更為推薦對訓(xùn)練集進(jìn)行隨機(jī)初始化操作。我們在kMeansInitCentroids.m文件中添加如下代碼:
% Initialize the centroids to be random examples
% Randomly reorder the indices of examples
randidx = randperm(size(X, 1));
% Take the first K examples as centroids
centroids = X(randidx(1:K), :);
上述代碼通過使用randperm()函數(shù)隨機(jī)排列了樣本數(shù)據(jù)的索引,然后其基于索引的隨機(jī)排列選擇最初的K個(gè)樣本數(shù)據(jù)。這種方式避免了出現(xiàn)兩次隨機(jī)選擇相同樣本數(shù)據(jù)的問題。
任務(wù)三 使用K均值算法對圖像壓縮(該部分任務(wù)不需提交)

在24位彩色圖像中,每個(gè)像素表示為紅色、綠色和藍(lán)色強(qiáng)度值的三個(gè)8位無符號(hào)整數(shù)(其取值范圍為0~255)。我們的圖像包括數(shù)千種顏色,在這部分練習(xí)中,你需將顏色的數(shù)量減少至16種。通過這種方式,可以有效地壓縮圖像。在該方式下,你只需存儲(chǔ)16個(gè)所選顏色地RGB值,并且對于圖像中地每個(gè)像素,你只需存儲(chǔ)該位置地顏色索引即可。
在本練習(xí)中,你將使用K均值算法來選擇用于表示壓縮圖像的16中顏色。具體來說,你應(yīng)將原始圖像中的每個(gè)像素視為樣本數(shù)據(jù),通過使用K均值算法找出最佳的16種顏色。
在Octave或 Matlab中,我們使用如下代碼來導(dǎo)入圖像:
% Load 128x128 color image (bird small.png)
A = imread('bird small.png');
% You will need to have installed the image package to used
% imread. If you do not have the image package installed, you
% should instead change the following line to
%
% load('bird small.mat'); % Loads the image into the variable A
這創(chuàng)建了一個(gè)三維矩陣A,其前兩個(gè)索引標(biāo)識(shí)像素位置,最后一個(gè)索引表示紅色、綠色或藍(lán)色。例如:A(50, 33, 3)表示行50和列33處像素的藍(lán)色強(qiáng)度。
該部分代碼如下:
%% ============= Part 4: K-Means Clustering on Pixels ===============
% In this exercise, you will use K-Means to compress an image. To do this,
% you will first run K-Means on the colors of the pixels in the image and
% then you will map each pixel onto its closest centroid.
%
% You should now complete the code in kMeansInitCentroids.m
%
fprintf('\nRunning K-Means clustering on pixels from an image.\n\n');
% Load an image of a bird
A = double(imread('bird_small.png'));
% If imread does not work for you, you can try instead
% load ('bird_small.mat');
A = A / 255; % Divide by 255 so that all values are in the range 0 - 1
% Size of the image
img_size = size(A);
% Reshape the image into an Nx3 matrix where N = number of pixels.
% Each row will contain the Red, Green and Blue pixel values
% This gives us our dataset matrix X that we will use K-Means on.
X = reshape(A, img_size(1) * img_size(2), 3);
% Run your K-Means algorithm on this data
% You should try different values of K and max_iters here
K = 16;
max_iters = 10;
% When using K-Means, it is important the initialize the centroids
% randomly.
% You should complete the code in kMeansInitCentroids.m before proceeding
initial_centroids = kMeansInitCentroids(X, K);
% Run K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ================= Part 5: Image Compression ======================
% In this part of the exercise, you will use the clusters of K-Means to
% compress an image. To do this, we first find the closest clusters for
% each example. After that, we
fprintf('\nApplying K-Means to compress an image.\n\n');
% Find closest cluster members
idx = findClosestCentroids(X, centroids);
% Essentially, now we have represented the image X as in terms of the
% indices in idx.
% We can now recover the image from the indices (idx) by mapping each pixel
% (specified by its index in idx) to the centroid value
X_recovered = centroids(idx,:);
% Reshape the recovered image into proper dimensions
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);
% Display the original image
subplot(1, 2, 1);
imagesc(A);
title('Original');
% Display compressed image side by side
subplot(1, 2, 2);
imagesc(X_recovered)
title(sprintf('Compressed, with %d colors.', K));
fprintf('Program paused. Press enter to continue.\n');
pause;
運(yùn)行結(jié)果為:

主成分分析算法
在部分練習(xí)中,你將使用主成分分析算法實(shí)現(xiàn)降維。最初,你將在2D數(shù)據(jù)集上了解PCA算法,然后在一個(gè)大數(shù)據(jù)集上使用PCA算法。
任務(wù)一 可視化數(shù)據(jù)集
該部分代碼如下:
%% ================== Part 1: Load Example Dataset ===================
% We start this exercise by using a small dataset that is easily to
% visualize
%
fprintf('Visualizing example dataset for PCA.\n\n');
% The following command loads the dataset. You should now have the
% variable X in your environment
load ('ex7data1.mat');
% Visualize the example dataset
plot(X(:, 1), X(:, 2), 'bo');
axis([0.5 6.5 2 8]); axis square;
fprintf('Program paused. Press enter to continue.\n');
pause;
運(yùn)行結(jié)果為:

任務(wù)二 實(shí)現(xiàn)PCA算法
PCA算法由兩部分組成:首先計(jì)算數(shù)據(jù)的協(xié)方差矩陣,然后使用Octave或Matlab的svd()函數(shù)來計(jì)算特征向量U1, ···, Un。
不過在使用PCA算法之前,我們應(yīng)當(dāng)對訓(xùn)練集進(jìn)行均值歸一化操作,其具體實(shí)現(xiàn)代碼可在featureNormalize.m文件中查看。
求取協(xié)方差矩陣的數(shù)學(xué)公式為:

svd()函數(shù)為:[U, S, V] = svd(Sigma)
我們在pca.m文件中添加如下代碼:
Sigma = X' * X / m;
[U, S, V] = svd(Sigma);
該部分在ex7_pac.m文件的代碼如下:
%% =============== Part 2: Principal Component Analysis ===============
% You should now implement PCA, a dimension reduction technique. You
% should complete the code in pca.m
%
fprintf('\nRunning PCA on example dataset.\n\n');
% Before running PCA, it is important to first normalize X
[X_norm, mu, sigma] = featureNormalize(X);
% Run PCA
[U, S] = pca(X_norm);
% Compute mu, the mean of the each feature
% Draw the eigenvectors centered at mean of data. These lines show the
% directions of maximum variations in the dataset.
hold on;
drawLine(mu, mu + 1.5 * S(1,1) * U(:,1)', '-k', 'LineWidth', 2);
drawLine(mu, mu + 1.5 * S(2,2) * U(:,2)', '-k', 'LineWidth', 2);
hold off;
fprintf('Top eigenvector: \n');
fprintf(' U(:,1) = %f %f \n', U(1,1), U(2,1));
fprintf('\n(you should expect to see -0.707107 -0.707107)\n');
fprintf('Program paused. Press enter to continue.\n');
pause;
運(yùn)行結(jié)果為:

任務(wù)三 使用PCA算法降維
這部分你需要將數(shù)據(jù)集X中的每個(gè)數(shù)據(jù)投影到主成分矩陣U中的前K個(gè)分量上。projectData.m文件中的代碼如下:
U_reduce = U(:, 1 : K);
Z = X * U_reduce;
任務(wù)四 重建數(shù)據(jù)的近似值
recoverData.m添加的代碼如下:
U_reduce = U(:, 1 : K);
X_rec = Z * U_reduce';
任務(wù)三、四部分的運(yùn)行結(jié)果如下:

本次編程作業(yè)需要提交的部分到此結(jié)束。文檔中的后續(xù)任務(wù),此處不再敘述。請自行查閱和動(dòng)手實(shí)踐,謝謝!