訓(xùn)練和測試數(shù)據(jù)集的分布
在開始競賽之前,我們要檢查測試數(shù)據(jù)集的分布與訓(xùn)練數(shù)據(jù)集的分布,如果可能的話,看看它們之間有多么不同。這對模型的進一步處理有很大幫助.
首先導(dǎo)入必須的庫文件
import gc
import itertools
from copy import deepcopy
import numpy as np
import pandas as pd
from tqdm import tqdm
from scipy.stats import ks_2samp
from sklearn.preprocessing import scale, MinMaxScaler
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import FastICA
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn import manifold
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
%matplotlib inline
1. t-SNE分布概述
首先,我將從訓(xùn)練數(shù)據(jù)集和測試數(shù)據(jù)集中取出等量的樣本(來自兩者的4459個樣本,即整個訓(xùn)練集和測試集的樣本),并對組合數(shù)據(jù)執(zhí)行t-SNE。 我用均值方差縮放所有數(shù)據(jù),但對于我們有異常值(> 3x標(biāo)準(zhǔn)差)的列,我也在縮放之前進行對數(shù)變換。
1.0 數(shù)據(jù)預(yù)處理
目前的預(yù)處理程序:
- 從訓(xùn)練集和測試集中獲取4459行并將它們連接起來
- 刪除了訓(xùn)練集中標(biāo)準(zhǔn)差為0的列
- 刪除了訓(xùn)練集中重復(fù)的列
- 對包含異常值(> 3x標(biāo)準(zhǔn)差)的所有列進行對數(shù)變換
- 創(chuàng)建數(shù)據(jù)集:
- 均值 - 方差比例縮放所有列,包括0值!
- 均值 - 方差比例所有列不包括 0值!
合并數(shù)據(jù)集
def combined_data(train, test):
"""
Get the combined data
:param train pandas.dataframe:
:param test pandas.dataframe:
:return pandas.dataframe:
"""
A = set(train.columns.values)
B = set(test.columns.values)
colToDel = A.difference(B)
total_df = pd.concat([train.drop(colToDel, axis=1), test], axis=0)
return total_df
刪除重復(fù)項目
def remove_duplicate_columns(total_df):
"""
Removing duplicate columns
"""
colsToRemove = []
columns = total_df.columns
for i in range(len(columns) - 1):
v = total_df[columns[i]].values
for j in range(i + 1, len(columns)):
if np.array_equal(v, total_df[columns[j]].values):
colsToRemove.append(columns[j])
colsToRemove = list(set(colsToRemove))
total_df.drop(colsToRemove, axis=1, inplace=True)
print(f">> Dropped {len(colsToRemove)} duplicate columns")
return total_df
處理極值
def log_significant_outliers(total_df):
"""
frist master fill na
Log-transform all columns which have significant outliers (> 3x standard deviation)
:return pandas.dataframe:
"""
total_df_all = deepcopy(total_df).select_dtypes(include=[np.number])
total_df_all.fillna(0, inplace=True) # ********
for col in total_df_all.columns:
# print(col)
data = total_df_all[col].values
data_mean, data_std = np.mean(data), np.std(data)
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off
outliers = [x for x in data if x < lower or x > upper]
if len(outliers) > 0:
non_zero_index = data != 0
total_df_all.loc[non_zero_index, col] = np.log(data[non_zero_index])
non_zero_rows = total_df[col] != 0
total_df_all.loc[non_zero_rows, col] = scale(total_df_all.loc[non_zero_rows, col])
gc.collect()
return total_df_all
這步之后我們得到兩個不同的數(shù)據(jù)集合,在極值處理上略有不同
1.1 執(zhí)行PCA
由于有很多特征值,我認為在t-SNE之前執(zhí)行PCA以減少維度是個好主意。 任意地,我選擇包含1000個PCA組件,其中包括數(shù)據(jù)集中大約80%的變化,我認為這可以說明分布,但也加快了t-SNE。 在下面的內(nèi)容中,我只展示了數(shù)據(jù)集中PCA上的繪圖。
def test_pca(data, train_idx, test_idx, create_plots=True):
"""
data, panda.DataFrame
train_idx = range(0, len(train_df))
test_idx = range(len(train_df), len(total_df))
Run PCA analysis, return embeding
"""
data = data.select_dtypes(include=[np.number])
data = data.fillna(0)
# Create a PCA object, specifying how many components we wish to keep
pca = PCA(n_components=len(data.columns))
# Run PCA on scaled numeric dataframe, and retrieve the projected data
pca_trafo = pca.fit_transform(data)
# The transformed data is in a numpy matrix. This may be inconvenient if we want to further
# process the data, and have a more visual impression of what each column is etc. We therefore
# put transformed/projected data into new dataframe, where we specify column names and index
pca_df = pd.DataFrame(
pca_trafo,
index=data.index,
columns=['PC' + str(i + 1) for i in range(pca_trafo.shape[1])]
)
if create_plots:
# Create two plots next to each other
_, axes = plt.subplots(2, 2, figsize=(20, 15))
axes = list(itertools.chain.from_iterable(axes))
# Plot the explained variance# Plot t
axes[0].plot(
pca.explained_variance_ratio_, "--o", linewidth=2,
label="Explained variance ratio"
)
# Plot the explained variance# Plot t
axes[0].plot(
pca.explained_variance_ratio_.cumsum(), "--o", linewidth=2,
label="Cumulative explained variance ratio"
)
# show legend
axes[0].legend(loc='best', frameon=True)
# show biplots
for i in range(1, 4):
# Components to be plottet
x, y = "PC" + str(i), "PC" + str(i + 1)
# plot biplots
settings = {'kind': 'scatter', 'ax': axes[i], 'alpha': 0.2, 'x': x, 'y': y}
pca_df.iloc[train_idx].plot(label='Train', c='#ff7f0e', **settings)
pca_df.iloc[test_idx].plot(label='Test', c='#1f77b4', **settings)
return pca_df
train_idx = range(0, len(train_df))
test_idx = range(len(train_df), len(total_df))
pca_df = test_pca(total_df, train_idx, test_idx)
pca_df_all = test_pca(total_df_all, train_idx, test_idx)
print(">> PCA : (only for np.number)", pca_df.shape, pca_df_all.shape)


看起來很有趣,訓(xùn)練數(shù)據(jù)比在測試數(shù)據(jù)中更加分散,測試數(shù)據(jù)似乎更緊密地聚集在中心周圍。
1.2 運行t-SNE
稍微降低了維度,現(xiàn)在可以在大約5分鐘內(nèi)運行t-SNE,然后在嵌入的2D空間中繪制訓(xùn)練和測試數(shù)據(jù)。 在下文中,將看到任何差異的數(shù)據(jù)集案例執(zhí)行此操作。
def test_tsne(data, ax=None, title='t-SNE'):
"""Run t-SNE and return embedding"""
# Run t-SNE
tsne = TSNE(n_components=2, init='pca')
Y = tsne.fit_transform(data)
# Create plot
for name, idx in zip(["Train", "Test"], [train_idx, test_idx]):
ax.scatter(Y[idx, 0], Y[idx, 1], label=name, alpha=0.2)
ax.set_title(title)
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.legend()
return Y
# Run t-SNE on PCA embedding
_, axes = plt.subplots(1, 2, figsize=(20, 8))
tsne_df = test_tsne(
pca_df, axes[0],
title='t-SNE: Scaling on non-zeros'
)
tsne_df_unique = test_tsne(
pca_df_all, axes[1],
title='t-SNE: Scaling on all entries'
)
plt.axis('tight')
plt.show()

從這看來,如果僅對非零條目執(zhí)行縮放,則訓(xùn)練和測試集看起來更相似。 如果對所有條目執(zhí)行縮放,則兩個數(shù)據(jù)集似乎彼此更加分離。 在以前的筆記本中,我沒有刪除零標(biāo)準(zhǔn)偏差的重復(fù)列或列 - 在這種情況下,觀察到更顯著的差異。 當(dāng)然,根據(jù)我的經(jīng)驗,謹慎對待t-SNE的解釋,可能值得更詳細地研究一下; 無論是在t-SNE參數(shù),預(yù)處理等方面。
1.2.1 t-SNE由行索引或零計數(shù)著色

看起來很有趣 - 似乎較高索引的行位于圖的中心。 此外,我們看到一小部分行,幾乎沒有零條目,右側(cè)圖中還有一些集群。
1.2.2 t-SNE的不同參數(shù)
根據(jù)不同參數(shù),t-SNE可以給出一些不同的結(jié)果。 所以為了確保在下面我檢查了一些不同的perplexity參數(shù)值。
_, axes = plt.subplots(1, 4, figsize=(20, 5))
for i, perplexity in enumerate([5, 30, 50, 100]):
# Create projection
Y = TSNE(init='pca', perplexity=perplexity).fit_transform(pca_df)
# Plot t-SNE
for name, idx in zip(["Train", "Test"], [train_idx, test_idx]):
axes[i].scatter(Y[idx, 0], Y[idx, 1], label=name, alpha=0.2)
axes[i].set_title("Perplexity=%d" % perplexity)
axes[i].xaxis.set_major_formatter(NullFormatter())
axes[i].yaxis.set_major_formatter(NullFormatter())
axes[i].legend()
plt.show()

2. Test vs. Train
另一個好的方法是看我們?nèi)绾畏诸惤o定條目是否屬于測試或訓(xùn)練數(shù)據(jù)集 - 如果可以合理地做到這一點,那就是兩個數(shù)據(jù)集分布之間差異的指示。 我將使用基本的隨機森林模型進行簡單的混合10倍交叉驗證,看看它執(zhí)行此任務(wù)的效果如何。 首先讓我們嘗試對所有條目執(zhí)行縮放的情況進行分類:
def test_prediction(data):
"""Try to classify train/test samples from total dataframe"""
?
# Create a target which is 1 for training rows, 0 for test rows
y = np.zeros(len(data))
y[train_idx] = 1
?
# Perform shuffled CV predictions of train/test label
predictions = cross_val_predict(
ExtraTreesClassifier(n_estimators=100, n_jobs=4),
data, y,
cv=StratifiedKFold(
n_splits=10,
shuffle=True,
random_state=42
)
)
?
# Show the classification report
print(classification_report(y, predictions))
# Run classification on total raw data
test_prediction(total_df_all)
在目前的數(shù)據(jù)上,這給出了大約0.71 f1的分數(shù),這意味著我們可以很好地做到這一預(yù)測,表明數(shù)據(jù)集之間存在一些顯著差異。 讓我們試試我們只縮放非零值的數(shù)據(jù)集:
>> Prediction Train or Test
precision recall f1-score support
0.0 0.86 0.46 0.60 4459
1.0 0.63 0.92 0.75 4459
avg / total 0.75 0.69 0.68 8918
#3 每個特征的分布相似性
接下來讓我們嘗試逐個特征地查看問題,并執(zhí)行Kolomogorov-Smirnov測試以查看測試和訓(xùn)練集中的分布是否相似。 我將從scipy使用函數(shù)來運行 測試。 對于分布高度可區(qū)分的所有特征,我們可以從忽略這些列中受益,以避免過度擬合訓(xùn)練數(shù)據(jù)。 在下文中,我只是識別這些列,并將分布繪制為一些功能的完整性檢查
def get_diff_columns(train_df, test_df, show_plots=True, show_all=False, threshold=0.1):
"""Use KS to estimate columns where distributions differ a lot from each other"""
# Find the columns where the distributions are very different
diff_data = []
for col in tqdm(train_df.columns):
statistic, pvalue = ks_2samp(
train_df[col].values,
test_df[col].values
)
if pvalue <= 0.05 and np.abs(statistic) > threshold:
diff_data.append({'feature': col, 'p': np.round(pvalue, 5), 'statistic': np.round(np.abs(statistic), 2)})
# Put the differences into a dataframe
diff_df = pd.DataFrame(diff_data).sort_values(by='statistic', ascending=False)
if show_plots:
# Let us see the distributions of these columns to confirm they are indeed different
n_cols = 7
if show_all:
n_rows = int(len(diff_df) / 7)
else:
n_rows = 2
_, axes = plt.subplots(n_rows, n_cols, figsize=(20, 3*n_rows))
axes = [x for l in axes for x in l]
# Create plots
for i, (_, row) in enumerate(diff_df.iterrows()):
if i >= len(axes):
break
extreme = np.max(np.abs(train_df[row.feature].tolist() + test_df[row.feature].tolist()))
train_df.loc[:, row.feature].apply(np.log1p).hist(
ax=axes[i], alpha=0.5, label='Train', density=True,
bins=np.arange(-extreme, extreme, 0.25)
)
test_df.loc[:, row.feature].apply(np.log1p).hist(
ax=axes[i], alpha=0.5, label='Test', density=True,
bins=np.arange(-extreme, extreme, 0.25)
)
axes[i].set_title(f"Statistic = {row.statistic}, p = {row.p}")
axes[i].set_xlabel(f'Log({row.feature})')
axes[i].legend()
plt.tight_layout()
plt.show()
return diff_df
# Get the columns which differ a lot between test and train
diff_df = get_diff_columns(total_df.iloc[train_idx], total_df.iloc[test_idx])
>> Dropping 22 features based on KS tests
precision recall f1-score support
0.0 0.85 0.45 0.59 4459
1.0 0.63 0.92 0.75 4459
avg / total 0.74 0.68 0.67 8918
#4 分解特征
到目前為止,我只看了PCA組件,但是大多數(shù)內(nèi)核都考慮了幾種分解方法,所以看一下每種方法的10-50個組件的t-SNE而不是1000個PCA組件可能會很有趣。 此外,有趣的是我們可以根據(jù)這個縮小的特征空間對測試/訓(xùn)練進行分類。
COMPONENTS = 20
# List of decomposition methods to use
methods = [
TruncatedSVD(n_components=COMPONENTS),
PCA(n_components=COMPONENTS),
FastICA(n_components=COMPONENTS),
GaussianRandomProjection(n_components=COMPONENTS, eps=0.1),
SparseRandomProjection(n_components=COMPONENTS, dense_output=True)
]
# Run all the methods
embeddings = []
for method in methods:
name = method.__class__.__name__
embeddings.append(
pd.DataFrame(method.fit_transform(total_df), columns=[f"{name}_{i}" for i in range(COMPONENTS)])
)
print(f">> Ran {name}")
# Put all components into one dataframe
components_df = pd.concat(embeddings, axis=1)
# Prepare plot
_, axes = plt.subplots(1, 3, figsize=(20, 5))
# Run t-SNE on components
tsne_df = test_tsne(
components_df, axes[0],
title='t-SNE: with decomposition features'
)
# Color by index
sc = axes[1].scatter(tsne_df[:, 0], tsne_df[:, 1], alpha=0.2, c=range(len(tsne_df)), cmap=cm)
cbar = fig.colorbar(sc, ax=axes[1])
cbar.set_label('Entry index')
axes[1].set_title("t-SNE colored by index")
axes[1].xaxis.set_major_formatter(NullFormatter())
axes[1].yaxis.set_major_formatter(NullFormatter())
# Color by target
sc = axes[2].scatter(tsne_df[train_idx, 0], tsne_df[train_idx, 1], alpha=0.2, c=np.log1p(train_df.target), cmap=cm)
cbar = fig.colorbar(sc, ax=axes[2])
cbar.set_label('Log1p(target)')
axes[2].set_title("t-SNE colored by target")
axes[2].xaxis.set_major_formatter(NullFormatter())
axes[2].yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')
plt.show()


測試數(shù)據(jù)集和訓(xùn)練數(shù)據(jù)集合分布相似了