本章背景
本章是來源于coursera課程 python-machine-learning中的作業(yè)1內(nèi)容。
本章參考
- Pandas---匯總和頻數(shù)統(tǒng)計(jì)
- Pandas---DataFrame切片
- Pandas---數(shù)據(jù)結(jié)構(gòu)之Series
- sklearn---train_test_split()解析
本章內(nèi)容
- Pandas用法
- DataFrame用法
- Series用法
- K最近鄰 (KNN,k-NearestNeighbor)
0. breast cancer 數(shù)據(jù)集
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print(cancer.DESCR) # Print the data set description
1. Pandas.DataFrame
創(chuàng)建DataFrame:
dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
columns=cancer.feature_names)
DataFrame切片:
#截取第0-29列(前30列)所有行的數(shù)據(jù)
X = dataFrame.iloc[:, :30]
統(tǒng)計(jì)DataFrame列中某值頻數(shù)
需要進(jìn)行轉(zhuǎn)換list:
malignant_count = list(dataFrame['target']).count(0)
or
malignant_count = list(dataFrame.target).count(0)
2. Pandas.Series
malignant_count = list(dataFrame['target']).count(0)
benign_count = list(dataFrame['target']).count(1)
series = pd.Series(data=[malignant_count, benign_count], index=["malignant", "benign"])
3. train_test_split()
<!--
<!-- test_size : float, int or None, optional (default=None)-->
<!-- If float, should be between 0.0 and 1.0 and represent the proportion-->
<!-- of the dataset to include in the test split. If int, represents the-->
<!-- absolute number of test samples. If None, the value is set to the-->
<!-- complement of the train size. If ``train_size`` is also None, it will-->
<!-- be set to 0.25.
-->
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)
4. KNN
如下包含所有代碼:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 加載breast_cancer數(shù)據(jù)集,包含569個(gè)樣本和30個(gè)維度的屬性
cancer = load_breast_cancer()
# 將cancer數(shù)據(jù)集轉(zhuǎn)化為DataFrame,轉(zhuǎn)化后的shape為 (569, 31),其中最后一個(gè)為target(0/1)
dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
columns=cancer.feature_names)
dataTarget = pd.DataFrame(data=cancer.target, index=pd.RangeIndex(start=0, stop=569, step=1), columns=['target'])
finalDataFrame = dataFrame.join(dataTarget)
# Your code here
X = finalDataFrame.iloc[:, :30]
y = pd.Series(data=finalDataFrame.target)
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)
# Your code here
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
# 用各個(gè)屬性的均值嘗試一下預(yù)測
means = cancerdf.mean()[:-1].values.reshape(1, -1)
label = knn.predict(means)
print('label', label)
# 評估一下測試集上的表現(xiàn)
score = knn.score(X_test, y_test)
print(score)