
今天看數(shù)據(jù)預(yù)處理,其實預(yù)處理和不處理,對結(jié)果的得分有很大的影響,最好是先比較兩者的差異,再決定要不要用,預(yù)處理一般包括

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
三個步驟:1導(dǎo)入相關(guān)的預(yù)處理模塊,并初始化,
2? 匹配要處理的數(shù)據(jù)(一般都是因變量 測試的和訓(xùn)練的)
3? 轉(zhuǎn)換匹配處理后的結(jié)果
scaler = Min Max Scaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
這個可以將兩部合為一體:????? X_scaled_d = scaler.fit_transform(X)
但臥槽

還有一種常見的:
##preprocessing using zero mean and unit variance scaling
from sklearn.preprocessing import StandardScaler

Principal Component Analysis (PCA)



Original shape: (569, 30)
Reduced shape: (569, 2)


擦,,看不懂打

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# generate synthetic two-dimensional data
X, y = make_blobs(random_state=1)
# build the clustering model
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)


data_dummies = pd.get_dummies(data)? 生成啞變量
數(shù)字進(jìn)行編碼
demo_df = pd.Data Frame({'Integer Feature': [0, 1, 2, 1],
'Categorical Feature': ['socks', 'fox', 'socks', 'box']})











模型檢測和提高
k-fold cross-validation, 最常用的交叉驗證

最常用的函數(shù)是cross_val_score(), 第一個參數(shù)是選擇的模型,第二個是因變量,第三個是輸出值,默認(rèn)是三重交叉驗證,可以改變重數(shù)
A common way to summarize the cross-validation accuracy is to compute the mean:,最常用的是輸出其均值
print("Average cross-validation score: {:.2f}".format(scores.mean()))


from sklearn.model_selection import Grid Search CV
from sklearn.svm import SVC
grid_search = Grid Search CV(SVC(), param_grid, cv=5)
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, random_state=0)
grid_search.fit(X_train, y_train)
print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))
Test set score: 0.97
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))











Precision-recall curves and ROC curves:
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(
y_test, svc.decision_function(X_test))

Receiver operating characteristics (ROC) and AUC

