Python機(jī)器學(xué)習(xí)入門(mén)

Anne Hathaway

在Windows上安裝Python

Python官網(wǎng):https://www.python.org/
我的電腦是64位的,安裝3.x版本選擇Windows x86-64 executable installer,由于2.x和3.x版本不兼容,考慮到2.x版本的代碼要修改后才能運(yùn)行,所以我選擇的是2.x版本:Windows x86-64 MSI installer


注意選上pipAdd python.exe to Path,然后一路點(diǎn)“Next”即可完成安裝。

默認(rèn)會(huì)安裝到C:\Python27目錄下,然后打開(kāi)命令提示符窗口,敲入python后,看到上面的畫(huà)面,就說(shuō)明Python安裝成功!


如果出現(xiàn):‘python’不是內(nèi)部或外部命令,也不是可運(yùn)行的程序或批處理文件

這是因?yàn)閃indows會(huì)根據(jù)一個(gè)Path的環(huán)境變量設(shè)定的路徑去查找python.exe,如果沒(méi)找到,就會(huì)報(bào)錯(cuò)。如果在安裝時(shí)漏掉了勾選Add python.exe to Path,那就要手動(dòng)把python.exe所在的路徑C:\Python27添加到Path中


Python把環(huán)境變量配置在path所有變量的最前面 導(dǎo)致在加載windows系統(tǒng)的變量的前面所以不起作用,需要重啟 ,但是你只需要把變量移到最后面就不需要重啟。

Python 3 安裝jupyter notebook

python3 -m pip install --upgrade pip
python3 -m pip install jupyter

Python 2 安裝jupyter notebook

python -m pip install --upgrade pip
python -m pip install jupyter

啟動(dòng) Jupyter Notebook

jupyter notebook

安裝numpy

因?yàn)橐泻芏嗟木仃囉?jì)算,所以要安裝numpy包
下載地址:點(diǎn)擊打開(kāi)鏈接

  • 根據(jù)自己安裝的python版本選擇安裝包,intel平臺(tái)的就選擇win32:numpy-1.14.3+mkl-cp27-cp27m-win32.whl
  • 將下載的安裝包拷貝在Python安裝目錄下C:\Python27\Scripts
  • 將Scripts這個(gè)文件夾的地址拷貝下來(lái),然后“右擊計(jì)算機(jī)-屬性-高級(jí)系統(tǒng)設(shè)置-環(huán)境變量-系統(tǒng)變量-path-編輯它”將剛才的路徑粘貼進(jìn)去。
  • 進(jìn)入DOS,輸入pip版本號(hào) install +numpy的路徑+文件名
    例如我的是pip2.7 install C:\Python27\Scripts\numpy-1.14.3+mkl-cp27-cp27m-win32.whl
  • 安裝成功就會(huì)提示successfully installed

安裝的過(guò)程中出現(xiàn)了意想不到的錯(cuò)誤:第二個(gè)按照提示升級(jí)pip即可,但是第一個(gè)錯(cuò)誤是怎么回事呢?
原來(lái)我所安裝的python所支持的whl 文件類型是win32,并不是你操作系統(tǒng)是64位的就選amd64的,所以重新下載一個(gè)win32的numpy包就好了。



安裝Matplotlib

跟安裝numpy一樣,找到Matplotlib包,下載到Python安裝目錄下C:\Python27\Scripts,通過(guò)cmd安裝:pip2.7 install C:\Python27\Scripts\matplotlib-2.2.2-cp27-cp27m-win32.whl

安裝 pandas

pip2.7 install C:\Python27\Scripts\pandas-0.23.0-cp27-cp27m-win32.whl

安裝 seaborn

pip install seaborn

安裝 scipy

pip2.7 install C:\Python27\Scripts\scipy-1.1.0-cp27-cp27m-win32.whl

安裝 sklearn

pip2.7 install C:\Python27\Scripts\scikit_learn-0.19.1-cp27-cp27m-win32.whl

歐式距離應(yīng)用

川菜館排行榜

------------------------------------------------------
         |   紅燒肉 |  水煮牛肉 |  夫妻肺片 |   麻婆豆腐|
------------------------------------------------------
  灶神  |          |           |           |           |
------------------------------------------------------
  食神  |          |           |           |           |
------------------------------------------------------
  賭神  |          |           |           |           |
------------------------------------------------------
  吃貨  |          |           |           |           |
------------------------------------------------------

引入數(shù)據(jù)

import numpy as np

Restr_1 = [[3.5, 3.0, 3.0, 4.0],
           [2.0, 2.5, 2.5, 3.5],
           [3.0, 3.5, 3.0, 4.5],
           [4.0, 3.0, 3.5, 4.0]]

Restr_2 = [[4.5, 4.0, 4.0, 4.5],
           [3.0, 3.5, 3.5, 4.5],
           [4.0, 3.5, 4.0, 4.0],
           [4.5, 4.0, 4.5, 4.5]]

Restr_3 = [[1.5, 2.0, 2.0, 2.5],
           [1.0, 1.5, 1.5, 1.5],
           [2.0, 2.5, 2.0, 2.0],
           [1.5, 2.0, 2.5, 2.5]]

歐氏距離公式

def euclidean_score(param1, param2):
    
    subtracted_diff = np.subtract(param1, param2) 

    squared_diff = np.square( subtracted_diff)
    
    eu_dist = np.sqrt(np.sum(squared_diff))
        
    return eu_dist  , 1 / (1 + eu_dist) 
R12, r12= euclidean_score(Restr_1,Restr_2)
R13, r13= euclidean_score(Restr_1,Restr_3)
R23, r23= euclidean_score(Restr_2,Restr_3)

R12=3.4641016151377544
R13=5.916079783099616
R23=8.717797887081348

KNN

from numpy import *
import operator
import time
import matplotlib.pyplot as plt

def kNN(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort() 
    """
    print(distances)
    print(diffMat)
    print(sqDiffMat)
    print(sqDistances)
    print('index')
    print(sortedDistIndicies)
    """
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
# kNN Example
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']

將數(shù)據(jù)可視化

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(group[:2,0],group[:2,1], s=70, color='b')
ax.scatter(group[2:4,0],group[2:4,1], s=70, color='r')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
kNN([0.3,0.2],group,labels,3)
#out:'B' 說(shuō)明[0.3,0.2]這個(gè)點(diǎn)屬于B類

請(qǐng)根據(jù)前例,對(duì)下表中的電影數(shù)據(jù)采用kNN算法進(jìn)行分類:


group = array([[3.0,104.0],[2.0,100.0],[1,81],[101,10.0],[99,5],[98,2.0]])
labels = ['Romance','Romance','Romance','Action','Action','Action']

kNN([18,90],group,labels,3)

#out:'Romance'

對(duì)文件中的數(shù)據(jù)進(jìn)行分析,歸類


from numpy import *
import matplotlib.pyplot as plt

def file2matrix(filename):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(datingDataMat[:,1], datingDataMat[:,2], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.xlabel('Percentage of Time Spent Playing Video Games')
plt.ylabel('Liters of Ice Cream Consumed Per Week')
plt.show()
plt.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.xlabel('Frequent Flyer Miles Earned Per Year')
plt.ylabel('Liters of Ice Cream Consumed Per Week')
plt.show()
import numpy as np
import matplotlib.pyplot as plt

from matplotlib.ticker import NullFormatter  # useful for `logit` scale

# Fixing random state for reproducibility
np.random.seed(19680801)

# make up some data in the interval ]0, 1[
y = np.random.normal(loc=0.5, scale=0.4, size=1000)
y = y[(y > 0) & (y < 1)]
y.sort()
x = np.arange(len(y))

# plot with various axes scales
plt.figure(1)

# linear
plt.subplot(221)
plt.plot(x, y)
plt.yscale('linear')
plt.title('linear')
plt.grid(True)


# log
plt.subplot(222)
plt.plot(x, y)
plt.yscale('log')
plt.title('log')
plt.grid(True)


# symmetric log
plt.subplot(223)
plt.plot(x, y - y.mean())
plt.yscale('symlog', linthreshy=0.01)
plt.title('symlog')
plt.grid(True)

# logit
plt.subplot(224)
plt.plot(x, y)
plt.yscale('logit')
plt.title('logit')
plt.grid(True)
# Format the minor tick labels of the y-axis into empty strings with
# `NullFormatter`, to avoid cumbering the axis with too many labels.
plt.gca().yaxis.set_minor_formatter(NullFormatter())
# Adjust the subplot layout, because the logit one may take more space
# than usual, due to y-tick labels like "1 - 10^{-3}"
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25,
                    wspace=0.35)

plt.show()

Apriori算法應(yīng)用

根據(jù)Apriori算法編寫(xiě)apriori.py

from numpy import *

def loadDataSet():
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

def createC1(dataSet):
    C1 = []
    for transaction in dataSet:
        #print(transaction)
        for item in transaction:
            #print(item)
            if not [item] in C1:
                #print("C1 before:")
                #print(C1)
                C1.append([item])
                #print("C1 now:")
                #print(C1)
                
    C1.sort()
    return map(frozenset, C1)#use frozen set so we
                            #can use it as a key in a dict    

def scanD(D, Ck, minSupport):
    ssCnt = {}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                #print("ssCnt before:")
                #print(ssCnt)
                if not can in ssCnt: ssCnt[can]=1
                else: ssCnt[can] += 1
                #print("ssCnt now:")
                #print(ssCnt)
    numItems = float(len(list(D)))
    print("numItems:")
    print(numItems)
    retList = []
    supportData = {}
    for key in ssCnt:
        print(key)
        support = ssCnt[key]/numItems
        if support >= minSupport:
            retList.insert(0,key)
        supportData[key] = support
        print(support)
    return retList, supportData

def aprioriGen(Lk, k): #creates Ck
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk): 
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort()
            if L1==L2: #if first k-2 elements are equal
                retList.append(Lk[i] | Lk[j]) #set union
    return retList

def apriori(dataSet, minSupport = 0.5):
    C1 = createC1(dataSet)
    D = list(map(set, dataSet))
    L1, supportData = scanD(D, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Ck = aprioriGen(L[k-2], k)
        Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
        supportData.update(supK)
        L.append(Lk)
        k += 1
    return L, supportData

def generateRules(L, supportData, minConf=0.7):  #supportData is a dict coming from scanD
    bigRuleList = []
    for i in range(1, len(L)):#only get the sets with two or more items
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList         

def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    prunedH = [] #create new list to return
    for conseq in H:
        conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
        if conf >= minConf: 
            print(freqSet-conseq,'-->',conseq,'conf:',conf)
            brl.append((freqSet-conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)): #try further merging
        Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)
            
def pntRules(ruleList, itemMeaning):
    for ruleTup in ruleList:
        for item in ruleTup[0]:
            print(itemMeaning[item])
        print("           -------->")
        for item in ruleTup[1]:
            print(itemMeaning[item])
        print("confidence: %f" % ruleTup[2])
        print(" ")      #print a blank line

引入數(shù)據(jù)

import apriori

dataSet = [["cakes", "beer", "bread"],
           ["cakes", "beer", "bread", "donuts"],
           ["beer", "bread", "pizza"], 
           ["cakes", "bread", "donuts", "pizza"],
           ["donuts", "pizza"]]

C1 = apriori.createC1(dataSet)
list(C1)

C2 = [frozenset({'cakes', 'beer'}),
 frozenset({'cakes', 'beer', 'bread'}),
 frozenset({'cakes', 'beer', 'bread', 'donuts'})]

C3 =[frozenset({'beer', 'bread'}),
 frozenset({'cakes', 'beer', 'bread'}),
 frozenset({'cakes', 'beer', 'bread', 'donuts'}),
 frozenset({'beer', 'bread', 'pizza'})] 

D = list(map(set, dataSet))
D

計(jì)算支持度計(jì)數(shù)

L2, suppData = apriori.scanD(D, C2, 0)
L2

numItems:
5.0
frozenset({'beer', 'cakes'})
0.4
frozenset({'beer', 'bread', 'cakes'})
0.4
frozenset({'donuts', 'beer', 'bread', 'cakes'})
0.2

決策樹(shù)應(yīng)用

根據(jù)決策樹(shù)算法編寫(xiě)trees.py

from math import log
import operator

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt
    
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        print("#", i)
        print("infoGain: ", infoGain)
        print(" ")
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree                            
    
def classify(inputTree,featLabels,testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
    
def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)
    

讀取文件數(shù)據(jù),通過(guò)決策樹(shù)算法進(jìn)行決策樹(shù)構(gòu)建

import trees

fr = open('lenses.txt')
lenses = [inst.strip().split('\t') for inst in fr.readlines()]

# 選擇分類
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']

# 構(gòu)建決策樹(shù)
lensesTree = trees.createTree(lenses, lensesLabels)

可視化決策樹(shù)

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0] ###
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0] ###
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
    
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()

#def createPlot():
#    fig = plt.figure(1, facecolor='white')
#    fig.clf()
#    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
#    plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
#    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
#    plt.show()

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

#createPlot(thisTree)
import treePlotter
treePlotter.createPlot(lensesTree)
tree.png

K-Means與KNN應(yīng)用

1.利用任意編程語(yǔ)言實(shí)現(xiàn)K-Means算法和KNN算法;

  1. 使用K-Means算法對(duì)以上實(shí)驗(yàn)數(shù)據(jù)中前6部電影進(jìn)行分簇;

  2. 輸入表2中最后的“待分類電影”數(shù)據(jù),根據(jù)前一步的分簇結(jié)果對(duì)其分簇


    某電影分類鏡頭統(tǒng)計(jì)數(shù)據(jù)
  3. 根據(jù)K-Means算法編寫(xiě)K-Means.py

from numpy import *

def loadDataSet(fileName):      #general function to parse tab -delimited floats
    dataMat = []                #assume last column is target value
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split('\t')
        fltLine = list(map(float,curLine)) #map all elements to float()
        dataMat.append(fltLine)
    return dataMat

def distEclud(vecA, vecB):
    return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)

def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m,2)))#create mat to assign data points 
                                      #to a centroid, also holds SE of each point
    centroids = createCent(dataSet, k)
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        for i in range(m):#for each data point assign it to the closest centroid
            minDist = inf; minIndex = -1
            for j in range(k):
                distJI = distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI; minIndex = j
            if clusterAssment[i,0] != minIndex: clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        print(centroids)
        for cent in range(k):#recalculate centroids
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean 
    return centroids, clusterAssment

2.裝載數(shù)據(jù)

import kMeans
import numpy as np

dataMat= np.mat([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2],[18,90]])
  1. 用K-Means算法對(duì)以上實(shí)驗(yàn)數(shù)據(jù)進(jìn)行分簇
kMeans.distEclud(dataMat[0],dataMat[1])
myCentroids, clustAssing = kMeans.kMeans(dataMat,2)

4.顯示分簇

A = np.asarray(dataMat[:,0])
B = np.asarray(dataMat[:,1])
CX = np.asarray(myCentroids[:,0])
CY = np.asarray(myCentroids[:,1])
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(A, B, s=50, color='b')
ax.scatter(CX, CY, s=1000, marker = '+', color='r')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

5.編寫(xiě)KNN算法對(duì)最后的“待分類電影”進(jìn)行分類

from numpy import *
import operator
import time
import matplotlib.pyplot as plt

def kNN(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort() 
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
labels = ['Romance','Romance','Romance','Action','Action','Action']

kNN([18,90],group,labels,3)

分類結(jié)果:'Romance'

源碼地址

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容