Employee attrition is rarely driven by a single factor. This analysis decomposes the risk into two distinct components: Role-Based Risk (Department) and Workload-Based Risk (Overtime).
Key Findings:
The “Sales Penalty” is Real: Even after controlling for other variable, Sales employees face a statistically significant higher risk of attrition compared to R&D. This suggests intrinsic stressors inherent to the Sales department.
Overtime is the Multiplier: While the Sales role is risky, Overtime is the accelerant. Employees working overtime are 3.2x more likely to quit.
Strategic Implication: We should investigate what is causing the attrition risk in the Sales department and create a countermeasure for working overtime problem.
1. Loading Package and Data Preparation
We will use several package for the analysis.
Code
library(tidyverse)library(survival)library(survminer)library(dplyr)# Load the datadf <-read_csv("data/WA_Fn-UseC_-HR-Employee-Attrition.csv")# Colour palettemy_colour =c("#FC4E07", "#2E9FDF", "#E7B800")# Code attrition data, 1 for Yes and 0 for Nodf <- df %>%mutate(Status =ifelse(Attrition=="Yes",1,0),Department =as.factor(Department),OverTime =as.factor(OverTime),EnvironmentSatisfaction =as.factor(EnvironmentSatisfaction),JobSatisfaction =as.factor(JobSatisfaction),JobLevel =as.factor(JobLevel),WorkLifeBalance =as.factor(WorkLifeBalance) ) %>%as.data.frame()
2. Exploratory Data Analysis
Let’s see what is the general thing happening here.
Code
#Let's see the proportion of people leaving in generalmean(df$Status)
[1] 0.1612245
There is 16.1% attrition rate, now let’s see per department proportions.
Code
ggplot(df, aes(x = Department, fill = Attrition)) +geom_bar(position ="fill") +scale_fill_manual(values = my_colour) +labs(y ="Proportion", title ="Attrition by Department ")
Figure 1: Employee Attrition by Department
We can see that there is a little difference of attrition rate between department. Let’s see whether we can predict the attrition based on their department.
3. Survival Analysis: When do employees leave?
When we look at attrition purely by Department, the data seems to support the common belief that Sales is a high-turnover environment.
Code
fit <-survfit(Surv(YearsAtCompany, Status) ~ Department, data = df)ggsurvplot( fit,data = df,censor = F,size =1.5,palette = my_colour,pval =TRUE,risk.table =TRUE,risk.table.height =0.25,legend.title ="Department",legend.labs =c("HR", "R&D", "Sales"),ggtheme =theme_minimal(),xlab ="Time in Years",ylab ="Retention Probability",title ="Kaplan-Meier Curve: Retention by Department")
Figure 2: Employee Retention Probability by Department
We can observe that the yellow line drops significantly faster than other departments, proved with the significant p-value (p = 0.025).
4. Changing the Lens: Maybe there is other thing at play?
Before accepting the conclusion that Sales is “toxic,” I investigated workload distribution. Because it might be that Sales is not inherently bad but it is affected by known problem of Sales department needing to fulfill quota.
Code
fit_ot <-survfit(Surv(YearsAtCompany, Status) ~ OverTime, data = df)ggsurvplot( fit_ot,data = df,size =1.5,palette = my_colour,conf.int =FALSE,pval =TRUE,risk.table =TRUE,risk.table.height =0.25,legend.title ="Working Overtime?",legend.labs =c("No", "Yes"),ggtheme =theme_minimal(),title ="The Workload Lens: Overtime vs. Standard Hours")
Figure 3: Retention by Overtime (The Real Killer)
We can observe that the gap between the line here is much wider than the department graph.
5. The Multivariate Verdict
Now, we will implement Cox Proportional Hazard Multivariate analysis to take account the other variable at once: Department, OverTime, EnvironmentSatisfaction, JobSatisfaction, JobLevel, and WorkLifeBalance
Code
cox_model <-coxph(Surv(YearsAtCompany, Status) ~ Department + OverTime + EnvironmentSatisfaction + JobSatisfaction + JobLevel + WorkLifeBalance, data = df)# Visualizeggforest(cox_model, data = df)
Figure 4: Hazard Ratios (Controlling for Workload)
Now we can determine that being in sales department does significantly increase the risk of leaving the company (p = 0.026). But let’s shift our focus a bit, we can see that there are other big predictor of the risk. For example, the employee who did overtime is ~3.2x more likely to leave than the one who didn’t (p < 0.001). Employees at Job Level 2 demonstrate an 85.5% lower hazard rate compared to those at Level 1. We still need to be cautious because Cox Models assumes constant hazard ratio. Let’s check the assumption.
As we can see, all of the chi square resulted in p-value bigger than 5% meaning that global and covariates assumption are met and the result can be safely interpretated.
Source Code
---title: "Predicting Employee Attrition: A Survival Analysis"author: "Hansel Samosir"format: html: theme: cosmo toc: true code-fold: true code-tools: true embed-resources: trueeditor: visual---[← Back to Hansel's Portfolio](https://hanselsamosir.my.id)## Executive SummaryEmployee attrition is rarely driven by a single factor. This analysis decomposes the risk into two distinct components: **Role-Based Risk** (Department) and **Workload-Based Risk** (Overtime).**Key Findings:**1. **The "Sales Penalty" is Real:** Even after controlling for other variable, Sales employees face a statistically significant higher risk of attrition compared to R&D. This suggests intrinsic stressors inherent to the Sales department.2. **Overtime is the Multiplier:** While the Sales role is risky, Overtime is the accelerant. Employees working overtime are 3.2x more likely to quit.3. **Strategic Implication:** We should investigate what is causing the attrition risk in the Sales department and create a countermeasure for working overtime problem.## 1. Loading Package and Data PreparationWe will use several package for the analysis.```{r}#| label: setup#| message: false#| warning: falselibrary(tidyverse)library(survival)library(survminer)library(dplyr)# Load the datadf <-read_csv("data/WA_Fn-UseC_-HR-Employee-Attrition.csv")# Colour palettemy_colour =c("#FC4E07", "#2E9FDF", "#E7B800")# Code attrition data, 1 for Yes and 0 for Nodf <- df %>%mutate(Status =ifelse(Attrition=="Yes",1,0),Department =as.factor(Department),OverTime =as.factor(OverTime),EnvironmentSatisfaction =as.factor(EnvironmentSatisfaction),JobSatisfaction =as.factor(JobSatisfaction),JobLevel =as.factor(JobLevel),WorkLifeBalance =as.factor(WorkLifeBalance) ) %>%as.data.frame()```## 2. Exploratory Data AnalysisLet's see what is the general thing happening here.```{r}#| label: EDA#| message: false#| warning: false#Let's see the proportion of people leaving in generalmean(df$Status)```There is 16.1% attrition rate, now let's see per department proportions.```{r}#| label: attrition-per-department-plot#| message: false#| warning: false#| fig-cap: "Figure 1: Employee Attrition by Department"ggplot(df, aes(x = Department, fill = Attrition)) +geom_bar(position ="fill") +scale_fill_manual(values = my_colour) +labs(y ="Proportion", title ="Attrition by Department ")```We can see that there is a little difference of attrition rate between department. Let's see whether we can predict the attrition based on their department.## 3. Survival Analysis: When do employees leave?When we look at attrition purely by Department, the data seems to support the common belief that Sales is a high-turnover environment.```{r}#| label: survival-plot#| fig-cap: "Figure 2: Employee Retention Probability by Department"#| fig-width: 10#| fig-height: 6#| warning: falsefit <-survfit(Surv(YearsAtCompany, Status) ~ Department, data = df)ggsurvplot( fit,data = df,censor = F,size =1.5,palette = my_colour,pval =TRUE,risk.table =TRUE,risk.table.height =0.25,legend.title ="Department",legend.labs =c("HR", "R&D", "Sales"),ggtheme =theme_minimal(),xlab ="Time in Years",ylab ="Retention Probability",title ="Kaplan-Meier Curve: Retention by Department")```We can observe that the yellow line drops significantly faster than other departments, proved with the significant *p*-value (*p* = 0.025).## 4. Changing the Lens: Maybe there is other thing at play?Before accepting the conclusion that Sales is "toxic," I investigated workload distribution. Because it might be that Sales is not inherently bad but it is affected by known problem of Sales department needing to fulfill quota.```{r}#| label: overtime-km-plot#| fig-cap: "Figure 3: Retention by Overtime (The Real Killer)"#| fig-width: 10#| fig-height: 6#| warning: falsefit_ot <-survfit(Surv(YearsAtCompany, Status) ~ OverTime, data = df)ggsurvplot( fit_ot,data = df,size =1.5,palette = my_colour,conf.int =FALSE,pval =TRUE,risk.table =TRUE,risk.table.height =0.25,legend.title ="Working Overtime?",legend.labs =c("No", "Yes"),ggtheme =theme_minimal(),title ="The Workload Lens: Overtime vs. Standard Hours")```We can observe that the gap between the line here is much wider than the department graph.## 5. The Multivariate VerdictNow, we will implement Cox Proportional Hazard Multivariate analysis to take account the other variable at once: Department, OverTime, EnvironmentSatisfaction, JobSatisfaction, JobLevel, and WorkLifeBalance```{r}#| label: cox-model#| fig-cap: "Figure 4: Hazard Ratios (Controlling for Workload)"#| fig-width: 10#| fig-height: 10cox_model <-coxph(Surv(YearsAtCompany, Status) ~ Department + OverTime + EnvironmentSatisfaction + JobSatisfaction + JobLevel + WorkLifeBalance, data = df)# Visualizeggforest(cox_model, data = df)```Now we can determine that being in sales department does significantly increase the risk of leaving the company (p = 0.026). But let's shift our focus a bit, we can see that there are other big predictor of the risk. For example, the employee who did overtime is \~3.2x more likely to leave than the one who didn't (*p* \< 0.001). Employees at Job Level 2 demonstrate an 85.5% lower hazard rate compared to those at Level 1. **We still need to be cautious because Cox Models assumes constant hazard ratio**. Let's check the assumption.```{r}#| label: assumption check#| message: false#| warning: falsecox.zph(cox_model)```As we can see, all of the chi square resulted in *p-value* bigger than 5% meaning that global and covariates assumption are met and the result can be safely interpretated.