深度學(xué)習(xí)-從零開始(1) - Pandas相關(guān)用法及KNN

本章背景

本章是來源于coursera課程 python-machine-learning中的作業(yè)1內(nèi)容。

本章參考

本章內(nèi)容

  • Pandas用法
  • DataFrame用法
  • Series用法
  • K最近鄰 (KNN,k-NearestNeighbor)

0. breast cancer 數(shù)據(jù)集

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR) # Print the data set description

1. Pandas.DataFrame

創(chuàng)建DataFrame:

dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
                             columns=cancer.feature_names)

DataFrame切片:

#截取第0-29列(前30列)所有行的數(shù)據(jù)
X = dataFrame.iloc[:, :30]

統(tǒng)計(jì)DataFrame列中某值頻數(shù)
需要進(jìn)行轉(zhuǎn)換list:

malignant_count = list(dataFrame['target']).count(0)

or

malignant_count = list(dataFrame.target).count(0)

2. Pandas.Series

    malignant_count = list(dataFrame['target']).count(0)
    benign_count = list(dataFrame['target']).count(1)
    series = pd.Series(data=[malignant_count, benign_count], index=["malignant", "benign"])

3. train_test_split()

<!--        
<!--        test_size : float, int or None, optional (default=None)-->
<!--        If float, should be between 0.0 and 1.0 and represent the proportion-->
<!--        of the dataset to include in the test split. If int, represents the-->
<!--        absolute number of test samples. If None, the value is set to the-->
<!--        complement of the train size. If ``train_size`` is also None, it will-->
<!--        be set to 0.25.
-->
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)

4. KNN

如下包含所有代碼:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# 加載breast_cancer數(shù)據(jù)集,包含569個(gè)樣本和30個(gè)維度的屬性
cancer = load_breast_cancer()
# 將cancer數(shù)據(jù)集轉(zhuǎn)化為DataFrame,轉(zhuǎn)化后的shape為 (569, 31),其中最后一個(gè)為target(0/1)
dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
                             columns=cancer.feature_names)
dataTarget = pd.DataFrame(data=cancer.target, index=pd.RangeIndex(start=0, stop=569, step=1), columns=['target'])
finalDataFrame = dataFrame.join(dataTarget)


# Your code here
X = finalDataFrame.iloc[:, :30]
y = pd.Series(data=finalDataFrame.target)

# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)
    
   
# Your code here
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
    

# 用各個(gè)屬性的均值嘗試一下預(yù)測
means = cancerdf.mean()[:-1].values.reshape(1, -1)
label = knn.predict(means)
print('label', label)

    
# 評估一下測試集上的表現(xiàn)
score = knn.score(X_test, y_test)
print(score)
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容