該比賽為DC練習(xí)賽，要求使用邏輯回歸的方法，從給定的影響員工離職的因素和員工是否離職的記錄，建立一個(gè)邏輯回歸模型預(yù)測(cè)有可能離職的員工。

相關(guān)數(shù)據(jù)的介紹，請(qǐng)參考：比賽地址

1.探索數(shù)據(jù)

import numpy as np
import pandas as pd

# 讀取數(shù)據(jù)
train = pd.read_csv('pfm_train.csv')
test = pd.read_csv('/pfm_test.csv')
print('train size:{}'.format(train.shape))  # train size:(1100, 31)
print('test size:{}'.format(test.shape))  #test size:(350, 30)
# 查看數(shù)據(jù)集中是否含有缺失值：無(wú)缺失值
# train.isnull().mean()

1.1 數(shù)據(jù)分析

# EmployeeNumber為員工ID，將其刪除
train.drop(['EmployeeNumber'], axis = 1, inplace = True)

# 將Attrition（該字段為標(biāo)簽）移至最后一列，方便索引
Attrition = train['Attrition']
train.drop(['Attrition'], axis = 1, inplace = True)
train.insert(0, 'Attrition', Attrition)

使用pyecharts從各維度上對(duì)離職人數(shù)以及離職率進(jìn)行分析

from pyecharts import Bar,Line,Grid
from pyecharts import Overlap

# 通過(guò)圖表分析哪些因素是主要影響員工離職的因素
def get_chatrs(train, col):
    data = train.groupby([col])['Attrition']
    data_sum = data.sum() # 離職人數(shù)
    data_mean = data.mean()  # 離職率
    
    bar = Bar(col, title_pos="45%")
    bar.add('離職人數(shù)', data_sum.index, data_sum.values, mark_point = ['max'],
            yaxis_formatter =  '人', yaxis_max = 200 , legend_pos="40%", legend_orient="vertical", 
            legend_top="95%", bar_category_gap = '25%')

    line = Line()
    line.add('離職率', data_mean.index, data_mean.values, mark_point = ['max'], mark_line = ['average'],
        yaxis_max = 0.8)

    overlap = Overlap(width=900, height=400)
    overlap.add(bar)
    overlap.add(line, is_add_yaxis=True, yaxis_index=1)

    return overlap
    

from pyecharts import Page
page = Page()
for col in train.columns[1:]:
    page.add(get_chatrs(train, col))
page.render('pages.html')
page

運(yùn)行此段代碼后，發(fā)現(xiàn)圖表數(shù)據(jù)顯示有錯(cuò)誤，檢查代碼沒(méi)有發(fā)現(xiàn)問(wèn)題，手動(dòng)的在圖表中刷新數(shù)據(jù)后，問(wèn)題得到解決

# 公司總體的離職率在16.2%
train['Attrition'].mean()

通過(guò)觀察圖表發(fā)現(xiàn)以下問(wèn)題
Q1：研發(fā)部門離職人數(shù)最多，這主要是因?yàn)樵摴狙邪l(fā)部門人數(shù)最多的原因，雖然人數(shù)多，但是研發(fā)部門離職率最低，離職率最高的部門是HR，該部門也是公司人數(shù)最少的部門，人員架構(gòu)不太穩(wěn)定？（綜合其他因素如’Education‘、’JobRole‘等，也發(fā)現(xiàn)，HRD離職率很高

Department.png

Q2: 18歲-23歲員工離職率超過(guò)40%，23歲-33歲員工離職率在20%-40%

Age.png

Q3: 工作投入度為1等級(jí)的員工離職率有近40%，達(dá)到了38%?。?！

JobInvolvement.png

Q4: 加班的員工離職率是不加班員工的三倍?。?！

OverTime.png

進(jìn)一步探索

部門&加班&收入

收入在較低水平的銷售部門員工在加班的況下離職率達(dá)80%，且該部門員工在加班情況下，無(wú)論收入水平如何，離職率都高于公司的整體離職率

部門工作滿意度

HRD員工離職的原因之一：工作滿意度比較低，是否存在辦公室政治的原因？

2.數(shù)據(jù)處理&特征處理

2.1 數(shù)據(jù)處理

# 在分析中發(fā)現(xiàn)有一些字段的值是單一的,進(jìn)一步驗(yàn)證
single_value_feature = []
for col in train.columns:
    lenght = len(train[col].unique())
    if lenght == 1:
        single_value_feature.append(col)

single_value_feature  # ['Over18', 'StandardHours']

'Over18', 'StandardHours'這兩個(gè)字段的值是唯一的，刪除這兩個(gè)字段

# 刪除這兩個(gè)字段
train.drop(['Over18', 'StandardHours'], axis = 1, inplace = True)
train.shape  # (1100, 28)

由于數(shù)據(jù)集中沒(méi)有缺失值，這里不需要對(duì)缺失值做處理

2.2 特征處理

主要是對(duì)部分特征進(jìn)行分組以及one-hot編碼

# 對(duì)收入進(jìn)行分箱
print(train['MonthlyIncome'].min())  # 1009
print(train['MonthlyIncome'].max())  # 19999
print(test['MonthlyIncome'].min())  # 1051
print(test['MonthlyIncome'].max())  # 19973

為了在train和test中的MonthlyIncome進(jìn)行分組后的區(qū)間一致，需要保持兩個(gè)數(shù)據(jù)集中MonthlyIncome的最大值和最小值一致，這里使用等寬分組

由于test數(shù)據(jù)集中MonthlyIncome的最小值比train數(shù)據(jù)集中的最小值大，最大值比train數(shù)據(jù)集中的最大值小，需要人工插入最大值最小值后才能進(jìn)行分組，這樣在test數(shù)據(jù)集中MonthlyIncome的分組區(qū)間才能與train中MonthlyIncome分組一致，這個(gè)后面會(huì)進(jìn)行具體操作

# 使用pandas的cut進(jìn)行分組，分為10組
train['MonthlyIncome'] = pd.cut(train['MonthlyIncome'], bins=10)

# 將數(shù)據(jù)類型為‘object’的字段名提取出來(lái)，并使用one-hot-encode對(duì)其進(jìn)行編碼
col_object = []
for col in train.columns[1:]:
    if train[col].dtype == 'object':
        col_object.append(col)
col_object

對(duì)train數(shù)據(jù)集進(jìn)行one-hot編碼

train_encode = pd.get_dummies(train)

保存數(shù)據(jù)集，方便日后使用

train.to_csv('trainwithoutencode.csv')
train_encode.to_csv('train.csv')

2.3 特征共線性處理

corr = train.corr()

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)
plt.show()

特征共線性

'TotalWorkingYears' & 'JobLevel'
'YearsAtCompany' & 'YearsWithCurrManager'存在共線性，選擇刪除其中一個(gè)特征即可

train_encode.drop(['TotalWorkingYears', 'YearsWithCurrManager'], axis = 1, inplace = True)

3. 建模預(yù)測(cè)

3.1 建模

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = train_encode.iloc[:, 1:]
y = train_encode.iloc[:, 0]

# 劃分訓(xùn)練集以及測(cè)試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)  # 0.8886363636363637

由于存在隨機(jī)性，最終在訓(xùn)練集上的score大約在0.88~0.9波動(dòng)

pred = lr.predict(X_test)
np.mean(pred == y_test)  # 0.8863636363636364

結(jié)果在測(cè)試集上的結(jié)果與訓(xùn)練集差不多，下面看一下預(yù)測(cè)結(jié)果的混淆矩陣是怎樣的

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

#對(duì)整個(gè)train數(shù)據(jù)集的混淆矩陣
y_pred = lr.predict(X)
confmat= confusion_matrix(y_true=y,y_pred=y_pred)#輸出混淆矩陣
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat,cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

#召回率、準(zhǔn)確率、F1
print ('precision:%.3f' %precision_score(y_true=y,y_pred=y_pred))
print ('recall:%.3f' %recall_score(y_true=y,y_pred=y_pred))
print ('F1:%.3f' %f1_score(y_true=y,y_pred=y_pred))

混淆矩陣

發(fā)現(xiàn)準(zhǔn)確率和召回率都是很很滿意，然后嘗試調(diào)參，lr中可調(diào)整的參數(shù)不多，調(diào)整后發(fā)現(xiàn)模型的精度提高不是很大

3.2 預(yù)測(cè)

# test數(shù)據(jù)集處理
test.drop(['EmployeeNumber', 'Over18', 'StandardHours'], axis = 1, inplace = True)
test_MonthlyIncome = pd.concat((pd.Series([1009, 19999]), test['MonthlyIncome'])) 
# 在指定位置插入與train中MonthlyIncome的max、min一致的數(shù)值，之后再刪除
test['MonthlyIncome'] = pd.cut(test_MonthlyIncome, bins=10)[2:]  # 分組并去除對(duì)應(yīng)的值
test_encode = pd.get_dummies(test)
test_encode.drop(['TotalWorkingYears', 'YearsWithCurrManager'], axis = 1, inplace = True)# 輸出結(jié)果
sample = pd.DataFrame(lr.predict(test_encode))
sample.to_csv('sample.csv')

按照要求修改sample的格式后上傳

結(jié)果排名

上傳結(jié)果后，排在60名，top5%，還算可以的結(jié)果

4. 反思

對(duì)于邏輯回歸，在參數(shù)上調(diào)整空間比較小，應(yīng)該注重在特征工程上的處理，除了使用one-hot編碼的方法外，還可以嘗試使用歸一化、標(biāo)準(zhǔn)表等等，使用交叉驗(yàn)證的方式查看模型的穩(wěn)定性
也可以使用隨機(jī)森林、GBDT等方法

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

DC員工離職預(yù)測(cè)訓(xùn)練賽

DC員工離職預(yù)測(cè)訓(xùn)練賽

1.探索數(shù)據(jù)

1.1 數(shù)據(jù)分析

2.數(shù)據(jù)處理&特征處理

2.1 數(shù)據(jù)處理

2.2 特征處理

2.3 特征共線性處理

3. 建模預(yù)測(cè)

3.1 建模

3.2 預(yù)測(cè)

4. 反思

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

DC員工離職預(yù)測(cè)訓(xùn)練賽

1.探索數(shù)據(jù)

1.1 數(shù)據(jù)分析

2.數(shù)據(jù)處理&特征處理

2.1 數(shù)據(jù)處理

2.2 特征處理

2.3 特征共線性處理

3. 建模預(yù)測(cè)

3.1 建模

3.2 預(yù)測(cè)

4. 反思

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av