用R語言分析與預(yù)測員工離職

作者：佚名 2018-09-26 19:51:07

在實驗室搬磚之后，繼續(xù)我們的kaggle數(shù)據(jù)分析之旅，這次數(shù)據(jù)也是答主在kaggle上選擇的比較火的一份關(guān)于人力資源的數(shù)據(jù)集，關(guān)注點(diǎn)在于員工離職的分析和預(yù)測，依然還是從數(shù)據(jù)讀取，數(shù)據(jù)預(yù)處理，EDA和機(jī)器學(xué)習(xí)建模這幾個部分開始進(jìn)行，最后使用集成學(xué)習(xí)中比較火的random forest算法來預(yù)測離職情況。

在實驗室搬磚之后，繼續(xù)我們的kaggle數(shù)據(jù)分析之旅，這次數(shù)據(jù)也是答主在kaggle上選擇的比較火的一份關(guān)于人力資源的數(shù)據(jù)集，關(guān)注點(diǎn)在于員工離職的分析和預(yù)測，依然還是從數(shù)據(jù)讀取，數(shù)據(jù)預(yù)處理，EDA和機(jī)器學(xué)習(xí)建模這幾個部分開始進(jìn)行，***使用集成學(xué)習(xí)中比較火的random forest算法來預(yù)測離職情況。

數(shù)據(jù)讀取

setwd("E:/kaggle/human resource") 
library(data.table) 
library(plotly) 
library(corrplot) 
library(randomForest) 
library(pROC) 
library(tidyverse) 
library(caret) 
hr<-as.tibble(fread("HR_comma_sep.csv")) 
glimpse(hr) 
sapply(hr,function(x){sum(is.na(x))}) 
———————————————————————————————————————————————————————————————————————————————————— 
Observations: 14,999 
Variables: 10 
$ satisfaction_level    <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10, 0.92, 0.89, 0.42, 0.45, 0.11, 0.84, 0.41, 0.36, 0.38, 0.45, 0.78, 0.45, 0.76, 0.11, 0.3... 
$ last_evaluation       <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77, 0.85, 1.00, 0.53, 0.54, 0.81, 0.92, 0.55, 0.56, 0.54, 0.47, 0.99, 0.51, 0.89, 0.83, 0.5... 
$ number_project        <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2, 2, 2, 2, 4, 2, 5, 6, 2, 6, 2, 2, 5, 4, 2, 2, 2, 6, 2, 2, 2, 4, 6, 2, 2, 6, 2, 5, 2, 2, ... 
$ average_montly_hours  <int> 157, 262, 272, 223, 159, 153, 247, 259, 224, 142, 135, 305, 234, 148, 137, 143, 160, 255, 160, 262, 282, 147, 304, 139, 158, 242,... 
$ time_spend_company    <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3, 3, 3, 3, 6, 3, 5, 4, 3, 4, 3, 3, 5, 5, 3, 3, 3, 4, 3, 3, 3, 6, 4, 3, 3, 4, 3, 5, 3, 3, ... 
$ Work_accident         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 
$ left                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 
$ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 
$ sales                 <chr> "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sa... 
$ salary                <chr> "low", "medium", "medium", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low... 
 
 satisfaction_level       last_evaluation        number_project  average_montly_hours    time_spend_company         Work_accident                  left  
                    0                     0                     0                     0                     0                     0                     0  
promotion_last_5years                 sales                salary  
                    0                     0                     0

數(shù)據(jù)集情況如下，一共10維數(shù)據(jù)，14999個觀測值，變量的代表名稱分別是

satisfaction_level--滿意度，last_evaluation--***一次評估，number_project--參與項目數(shù)量，average_montly_hours--每月平均工作時間，time_spend_company--公司停留時間，Work_accident--工作事故次數(shù)，left--是否離職，promotion_last_5years--過去五年升值狀況，sales--工種，salary--工資。

而且簡單的觀測了一下，沒有發(fā)現(xiàn)缺失值，那么我就可以直接進(jìn)入數(shù)據(jù)分析階段了。

數(shù)據(jù)預(yù)處理

根據(jù)每一個特征的數(shù)值情況，我們可以將不少特征因子化，方便后期做不同類別的差異分析。

hr$sales<-as.factor(hr$sales) 
hr$salary<-as.factor(hr$salary) 
hr$left<-as.factor(hr$left) 
hr$Work_accident<-as.factor(hr$Work_accident) 
hr$left<-recode(hr$left,'1'="yes",'0'="no") 
hr$promotion_last_5years<-as.factor(hr$promotion_last_5years)

看的出大部分?jǐn)?shù)據(jù)都是數(shù)值型的，我們使用相關(guān)性來衡量不同變量之間的相關(guān)性高低：

cor.hr<-hr %>% select(-sales,-salary) 
cor.hr$Work_accident<-as.numeric(as.character(cor.hr$Work_accident)) 
cor.hr$promotion_last_5years<-as.numeric(as.character(cor.hr$promotion_last_5years)) 
cor.hr$left<-as.numeric(as.character(cor.hr$left)) 
corrplot(corr = cor(cor.hr),type = "lower",method = "square",title="變量相關(guān)性",order="AOE")

直觀的來看，是否離職和滿意度高低就有很高的關(guān)聯(lián)性啊。

EDA

ggplot(group_by(hr,sales),aes(x=sales,fill=sales))+geom_bar(width = 1)+coord_polar(theta = "x")+ggtitle("不同職業(yè)的人數(shù)") 
ggplot(hr,aes(x=sales,y=satisfaction_level,fill=sales))+geom_boxplot()+ggtitle("不同職業(yè)的滿意度")+stat_summary(fun.y = mean,size=3,color='white',geom = "point")+ 
  theme(legend.position = "none") 
ggplot(hr,aes(x=sales,y=satisfaction_level,fill=left))+geom_boxplot()+ggtitle("不同職業(yè)的滿意度") 
ggplot(hr,aes(x=sales,y=average_montly_hours,fill=left))+geom_boxplot()+ggtitle("不同職業(yè)的工作時長") 
ggplot(hr,aes(x=sales,y=number_project,fill=left))+geom_boxplot()+ggtitle("不同職業(yè)的項目情況")

首先觀察不同崗位的工作人數(shù)。搞銷售的人數(shù)真的是不少，難道有不少我大生科的同學(xué)嗎??(哈哈哈哈哈哈哈，開個玩笑而已，不過說實話做生物真的很累啊)。銷售，后期支持，和技術(shù)崗人數(shù)占據(jù)人數(shù)排行榜前三。

不同的職業(yè)滿意度的分布大體相當(dāng)，不過accounting的小伙伴們似乎打分都不高哦，其他的幾個工種均值和中位數(shù)都沒有明顯差別，接下來我們看看不同職業(yè)是否離職的情況和打分的高低情況：

和想象中結(jié)果幾乎沒有區(qū)別，離職和不離職的打分區(qū)分度很高，和職業(yè)幾乎沒有關(guān)系。

那么不同職業(yè)的平均工作時長呢，看圖而言，沒有離職的人群工作時間都很穩(wěn)定，但是離職人群的工作時間呈現(xiàn)兩極分化的趨勢，看來太忙和太閑都不是很好，這對hr的考驗還是很大的。

后面我們來一次關(guān)注一下不同特征和離職的關(guān)系問題：

ggplot(hr,aes(x=satisfaction_level,color=left))+geom_line(stat = "density")+ggtitle("滿意度和離職的關(guān)系") 
ggplot(hr,aes(x=salary,fill=left))+geom_histogram(stat="count")+ggtitle("工資和離職的關(guān)系") 
ggplot(hr,aes(x=promotion_last_5years,fill=left))+geom_histogram(stat="count")+ggtitle("近5年升值和離職的關(guān)系") 
ggplot(hr,aes(x=last_evaluation,color=left))+geom_point(stat = "count")+ggtitle("***一次評價和離職的關(guān)系") 
hr %>% group_by(sales) %>% ggplot(aes(x=sales,fill=Work_accident))+geom_bar()+coord_flip()+ 
  theme(axis.text.x = element_blank(),axis.title.x = element_blank(),axis.title.y = element_blank())+scale_fill_discrete(labels=c("no accident","at least once"))

沒有離職的人群打分已知非常穩(wěn)定，而離職人群的打分就有點(diǎn)難以估摸了

還是那句話，“有錢好辦事啊”

你不給寶寶升職，寶寶就生氣離職

和前面的面積圖差不多，hr也要警惕那些***一次打分很高的，雖然大部分是不準(zhǔn)備離職的，但是有些為了給老東家面子還是會來點(diǎn)“善意的謊言”的。

不出錯是不可能的，出錯人數(shù)多少基本和總?cè)藬?shù)成正比，所以這個對于離職來說不是問題。

模型構(gòu)建和評估

index<-sample(2,nrow(hr),replace = T,prob = c(0.7,0.3)) 
train<-hr[index==1,];test<-hr[index==2,] 
model<-randomForest(left~.,data = train) 
predict.hr<-predict(model,test) 
confusionMatrix(test$left,predict.hr) 
 
prob.hr<-predict(model,test,type="prob") 
roc.hr<-roc(test$left,prob.hr[,2],levels=levels(test$left)) 
plot(roc.hr,type="S",col="red",main = paste("AUC=",roc.hr$auc,sep = ""))

根據(jù)前面的特征分析，本次答主并沒有覺得有很好的特征來提取，就直接扔進(jìn)算法里面計算去了，計算出來的混淆矩陣的情況效果還是杠杠的：

Confusion Matrix and Statistics 
 
          Reference 
Prediction   no  yes 
       no  3429    5 
       yes   28 1010 
                                           
               Accuracy : 0.9926           
                 95% CI : (0.9897, 0.9949) 
    No Information Rate : 0.773            
    P-Value [Acc > NIR] : < 2.2e-16        
                                           
                  Kappa : 0.9791           
 Mcnemar's Test P-Value : 0.0001283        
                                           
            Sensitivity : 0.9919           
            Specificity : 0.9951           
         Pos Pred Value : 0.9985           
         Neg Pred Value : 0.9730           
             Prevalence : 0.7730           
         Detection Rate : 0.7668           
   Detection Prevalence : 0.7679           
      Balanced Accuracy : 0.9935           
                                           
       'Positive' Class : no

acc=0.9926,recall=0.9951,precision=0.9730,基本都是逆天的數(shù)據(jù)了，看來kaggle的數(shù)據(jù)集已經(jīng)清洗的很棒了，rf算法也是一如既往地給力。***貼出ROC曲線的圖

寫在***

本次分析其實并沒有很多的技巧可言，答主的ggplot2水平也遇到了瓶頸期，后期需要不斷加強(qiáng)，而且只會調(diào)包不懂算法后面的原理更是不可以的，所以最近在慢慢把概率論，線性代數(shù)，還是統(tǒng)計學(xué)撿起來，當(dāng)然R語言的數(shù)據(jù)分析實踐還是不會停下來的，答主英語還不錯，可以和實驗室的老外教授“忽悠”幾句，也算是有了不少的進(jìn)步。

道阻且長，大家共勉~~~

責(zé)任編輯：未麗燕來源：經(jīng)管人學(xué)數(shù)據(jù)分析