男人天堂黄色导航,亚洲精品成人AV在线,91美女久久久蜜桃

數(shù)據(jù)來源：

來自kaggle的數(shù)據(jù)集Titanic：Titanic: Machine Learning from Disaster包含兩個數(shù)據(jù)集train.csv和test.csv。

分析目的：

主要通過已有的數(shù)據(jù)集的幾個維度，探索各維度數(shù)據(jù)與生存之間的關(guān)系，然后通過數(shù)據(jù)清洗，數(shù)據(jù)處理，建立模型預(yù)測test數(shù)據(jù)的生存情況。

數(shù)據(jù)探索：

image

變量字段釋義：

PassengerId：乘客序號

Pclass：乘客等級

Name：乘客姓名

Sex：乘客性別

Age：乘客年齡

SibSp：乘客兄弟姐妹數(shù)

Parch：乘客父母子女?dāng)?shù)

Ticket：船票碼

Fare：船費

Cabin：乘客倉號

Embarked：乘客乘船地點

導(dǎo)入數(shù)據(jù)后觀察各變量的情況，發(fā)現(xiàn)Age列，Embarked列，F(xiàn)are列，Cabin列存在缺失值

train.csv數(shù)據(jù)集變量探索：

PassengerId：
對于PassengerId變量，我們認為只是序號，姑且不考慮參與模型。

Pclass：

train_data['Survived'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title('生存與遇難比')

image.png

生存與遇難者的比例為61.6%：38.4%≈1.6：1，正負樣本比例能夠接受，所以后續(xù)不用對樣本進行采樣。

pd.crosstab(train_data.Pclass,train_data.Survived).plot.bar()
plt.title('乘客等級與生存之間的關(guān)系')

image.png

圖中我們可以得到隨著等級從1-3變化，遇難幾率會更大，當(dāng)乘客的等級為1時存活率最高。

Sex：

train_data.groupby(['Sex','Survived'])['Survived'].count().plot(kind='bar')
plt.title('性別與生存之間的關(guān)系')

image.png

對于性別而言，女性的存活率要高于男性，符合當(dāng)時歐洲‘女士優(yōu)先’的觀念，符合邏輯。

Age：

f,ax=plt.subplots(1,2,figsize=(12,9))
sns.violinplot("Pclass","Age", hue="Survived", data=train_data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

image.png

分析不同等級和性別下的年齡情況，發(fā)現(xiàn)總體而言，雖然乘客等級不同生存下來的人年齡都比遇難的要小。而對于性別，生存男性的年齡分布比遇難的小，女性由于’女士優(yōu)先‘的緣故，使得存活的女性整體年齡分布要大于遇難女性。

Embarked：

train_data.Embarked.value_counts().plot.bar()
plt.title('登船地點人數(shù)情況')

image.png

pd.crosstab(train_data.Embarked,train_data.Survived).plot.bar()
plt.title('登船地點與存活遇難情況')

image.png

f=pd.crosstab(train_data.Embarked,train_data.Survived)
f['R：遇難/存活']=f[0]/f[1]
f

image.png

由圖表可知在S地點登船的人遇難比率最高，C地點存活率最高

Parch & SibSp：

t,ax=plt.subplots(1,2,figsize=(12,8))
pd.crosstab(train_data.Parch,train_data.Survived).plot.bar(ax=ax[0])
ax[0].set_title('父母孩子與生存')
pd.crosstab(train_data.SibSp,train_data.Survived).plot.bar(ax=ax[1])
ax[1].set_title('兄弟姐妹與生存')

image.png

因為兄弟姐妹和父母孩子都算是親屬，我們嘗試把他們結(jié)合起來：

pd.crosstab(train_data.Parch+train_data.SibSp+1,train_data.Survived).plot.bar()

算上自己所以加1

image.png

通過圖表我們發(fā)現(xiàn)對于單獨一人的生存率比較低。

Ticket & Fare：
通過觀察數(shù)據(jù)集我們發(fā)現(xiàn)Ticket有單人票和團體票，若是團體票，每個人的Fare項顯示的是團體票的價格。

數(shù)據(jù)清洗與處理：

由前面得知Ticket項和Fare項的關(guān)系以及數(shù)據(jù)處理在對train數(shù)據(jù)集處理后也要用同樣的方法對test數(shù)據(jù)集處理，因此，我們把train數(shù)據(jù)集和test數(shù)據(jù)集結(jié)合在一起同時清理

train_data_temp = pd.read_csv(r'C:\Users\zhang\Desktop\train.csv')
test_data_temp = pd.read_csv(r'C:\Users\zhang\Desktop\test.csv')
test_data_temp['Survived'] = 0 #先對此賦值為0，后續(xù)drop掉
combined_train_test = train_data_temp.append(test_data_temp,sort=False)

對于缺失值：
Embarked:采用眾數(shù)填充缺失值，由于Embarked項有三個取值，所以填充后用獨立熱編碼處理

combined_train_test.Embarked[combined_train_test.Embarked.isnull()]=combined_train_test.Embarked.dropna().mode().values
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'],prefix='E')
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)

Sex:對原來的male和female做映射，映射成1和0

combined_train_test['Sex']=combined_train_test.Sex.map({'male':1,'female':0})

Age：同樣采取眾數(shù)填充缺失值

combined_train_test['Age']=combined_train_test['Age'].fillna(combined_train_test.Age.mean())
combined_train_test['Fare']=combined_train_test['Fare'].fillna(combined_train_test.Fare.mean())

SibSp & Parch:結(jié)合之前的探索分析，將兩者結(jié)合成為新的變量Family_Size

combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1

Ticket & Fare：通過前面的觀察我們可以知道，因為團體票的存在和每個購買團體票的Fare是團體票的價格，因此將團體票的價格平攤到每個購買團體票的人的身上

combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
combined_train_test['Fare'] = combined_train_test['Fare'] / combined_train_test['Group_Ticket']
combined_train_test.drop(['Group_Ticket'], axis=1, inplace=True)

Cabin：數(shù)據(jù)確實太多，因此去掉
Name：此項不納入考慮
將多余和不納入考慮的變量去掉

combined_train_test=combined_train_test.drop(['Cabin','Ticket','Name','Embarked','PassengerId'],axis=1)

重新檢查一次我們數(shù)據(jù)集的情況：

image.png

清洗完畢，接下來將train數(shù)據(jù)集和test數(shù)據(jù)集分開，同時將原來添加的Survived項刪除還原：

train_data = combined_train_test[:891]
test_data = combined_train_test[891:]

f_X_test = test_data.drop(['Survived'],axis=1)#原test測試集，改名f_X_test

數(shù)據(jù)建模：

考慮使用交叉驗證，將測試數(shù)據(jù)集分為訓(xùn)練集和測試集

X = train_data.drop(['Survived'],axis=1)
y= train_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

用GridSearchCv調(diào)參，得到最優(yōu)參數(shù)和相關(guān)情況：

# Set the parameters by cross-validation

parameter_space = {

    "n_estimators": [10, 15, 20],

    "criterion": ["gini", "entropy"],

    "min_samples_leaf": [2, 4, 6],

}

 

#scores = ['precision', 'recall', 'roc_auc']

scores = ['roc_auc']

 

for score in scores:

    print("# Tuning hyper-parameters for %s" % score)

    print()

 

    clf = RandomForestClassifier(random_state=14)

    grid = GridSearchCV(clf, parameter_space, cv=5, scoring='%s' % score)

    #scoring='%s_macro' % score：precision_macro、recall_macro是用于multiclass/multilabel任務(wù)的

    grid.fit(X_train, y_train)

 

    print("Best parameters set found on development set:")

    print()

    print(grid.best_params_)

    print()

    print("Grid scores on development set:")

    print()

    means = grid.cv_results_['mean_test_score']

    stds = grid.cv_results_['std_test_score']

    for mean, std, params in zip(means, stds, grid.cv_results_['params']):

        print("%0.3f (+/-%0.03f) for %r"

              % (mean, std * 2, params))

    print()

    print("Detailed classification report:")

    print()

    print("The model is trained on the full development set.")

    print("The scores are computed on the full evaluation set.")

    print()

    bclf = grid.best_estimator_

    bclf.fit(X_train, y_train)

    y_true = y_test

    y_pred = bclf.predict(X_test)

    y_pred_pro = bclf.predict_proba(X_test)

    y_scores = pd.DataFrame(y_pred_pro, columns=bclf.classes_.tolist())[1].values

    print(classification_report(y_true, y_pred))

    auc_value = roc_auc_score(y_true, y_scores)
    
    #繪制ROC曲線

fpr, tpr, thresholds = roc_curve(y_true, y_scores, pos_label=1.0)

plt.figure()

lw = 2

plt.plot(fpr, tpr, color='darkorange', linewidth=lw, label='ROC curve (area = %0.4f)' % auc_value)

plt.plot([0, 1], [0, 1], color='navy', linewidth=lw, linestyle='--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver operating characteristic example')

plt.legend(loc="lower right")

plt.show()

參數(shù)信息：

image.png

因為Roc值0.8869，結(jié)果還可以，因而對搜索出來的參數(shù)建模：

clf = RandomForestClassifier(criterion='entropy', min_samples_leaf= 6, n_estimators= 10,random_state=14)

clf.fit(X_train,y_train)

accuracy_score(y_test, clf.predict(X_test))

image.png

準(zhǔn)確率0.813

image.png

對測試集擬合得到的模型精確率0.81
召回率0.81
F1score：0.81

最后將test數(shù)據(jù)集（改名為f_X_test ）predict即可獲得生存情況。

f_y_Pred=clf.predict(f_X_test)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

kaggle-Titanic生存分析

kaggle-Titanic生存分析

數(shù)據(jù)來源：

分析目的：

數(shù)據(jù)探索：

數(shù)據(jù)清洗與處理：

數(shù)據(jù)建模：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

kaggle-Titanic生存分析

數(shù)據(jù)來源：

分析目的：

數(shù)據(jù)探索：

數(shù)據(jù)清洗與處理：

數(shù)據(jù)建模：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av