Predicting Employee Attrition: A Survival Analysis

Author

Hansel Samosir

← Back to Hansel’s Portfolio

Executive Summary

Employee attrition is rarely driven by a single factor. This analysis decomposes the risk into two distinct components: Role-Based Risk (Department) and Workload-Based Risk (Overtime).

Key Findings:

  1. The “Sales Penalty” is Real: Even after controlling for other variable, Sales employees face a statistically significant higher risk of attrition compared to R&D. This suggests intrinsic stressors inherent to the Sales department.
  2. Overtime is the Multiplier: While the Sales role is risky, Overtime is the accelerant. Employees working overtime are 3.2x more likely to quit.
  3. Strategic Implication: We should investigate what is causing the attrition risk in the Sales department and create a countermeasure for working overtime problem.

1. Loading Package and Data Preparation

We will use several package for the analysis.

Code
library(tidyverse)
library(survival)
library(survminer)
library(dplyr)

# Load the data
df <- read_csv("data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

# Colour palette
my_colour = c("#FC4E07", "#2E9FDF", "#E7B800")

# Code attrition data, 1 for Yes and 0 for No
df <- df %>%
  mutate(
    Status = ifelse(Attrition=="Yes",1,0),
    Department = as.factor(Department),
    OverTime = as.factor(OverTime),
    EnvironmentSatisfaction = as.factor(EnvironmentSatisfaction),
    JobSatisfaction = as.factor(JobSatisfaction),
    JobLevel = as.factor(JobLevel),
    WorkLifeBalance = as.factor(WorkLifeBalance)
  ) %>%
  as.data.frame()

2. Exploratory Data Analysis

Let’s see what is the general thing happening here.

Code
#Let's see the proportion of people leaving in general
mean(df$Status)
[1] 0.1612245

There is 16.1% attrition rate, now let’s see per department proportions.

Code
ggplot(df, aes(x = Department, fill = Attrition)) + 
  geom_bar(position = "fill") + 
  scale_fill_manual(values = my_colour) + 
  labs(y = "Proportion", title = "Attrition by Department ")

Figure 1: Employee Attrition by Department

We can see that there is a little difference of attrition rate between department. Let’s see whether we can predict the attrition based on their department.

3. Survival Analysis: When do employees leave?

When we look at attrition purely by Department, the data seems to support the common belief that Sales is a high-turnover environment.

Code
fit <- survfit(Surv(YearsAtCompany, Status) ~ Department, data = df)
ggsurvplot(
  fit,
  data = df,
  censor = F,
  size = 1.5,
  palette = my_colour,
  pval = TRUE,
  risk.table = TRUE,
  risk.table.height = 0.25,
  legend.title = "Department",
  legend.labs = c("HR", "R&D", "Sales"),
  ggtheme = theme_minimal(),
  xlab = "Time in Years",
  ylab = "Retention Probability",
  title = "Kaplan-Meier Curve: Retention by Department"
)

Figure 2: Employee Retention Probability by Department

We can observe that the yellow line drops significantly faster than other departments, proved with the significant p-value (p = 0.025).

4. Changing the Lens: Maybe there is other thing at play?

Before accepting the conclusion that Sales is “toxic,” I investigated workload distribution. Because it might be that Sales is not inherently bad but it is affected by known problem of Sales department needing to fulfill quota.

Code
fit_ot <- survfit(Surv(YearsAtCompany, Status) ~ OverTime, data = df)

ggsurvplot(
  fit_ot,
  data = df,
  size = 1.5,
  palette = my_colour,
  conf.int = FALSE,
  pval = TRUE,
  risk.table = TRUE,
  risk.table.height = 0.25,
  legend.title = "Working Overtime?",
  legend.labs = c("No", "Yes"),
  ggtheme = theme_minimal(),
  title = "The Workload Lens: Overtime vs. Standard Hours"
)

Figure 3: Retention by Overtime (The Real Killer)

We can observe that the gap between the line here is much wider than the department graph.

5. The Multivariate Verdict

Now, we will implement Cox Proportional Hazard Multivariate analysis to take account the other variable at once: Department, OverTime, EnvironmentSatisfaction, JobSatisfaction, JobLevel, and WorkLifeBalance

Code
cox_model <- coxph(
  Surv(YearsAtCompany, Status) ~ Department + OverTime + EnvironmentSatisfaction + JobSatisfaction + JobLevel + WorkLifeBalance, 
  data = df
)

# Visualize
ggforest(cox_model, data = df)

Figure 4: Hazard Ratios (Controlling for Workload)

Now we can determine that being in sales department does significantly increase the risk of leaving the company (p = 0.026). But let’s shift our focus a bit, we can see that there are other big predictor of the risk. For example, the employee who did overtime is ~3.2x more likely to leave than the one who didn’t (p < 0.001). Employees at Job Level 2 demonstrate an 85.5% lower hazard rate compared to those at Level 1. We still need to be cautious because Cox Models assumes constant hazard ratio. Let’s check the assumption.

Code
cox.zph(cox_model)
                         chisq df    p
Department              0.0462  2 0.98
OverTime                0.4674  1 0.49
EnvironmentSatisfaction 1.0407  3 0.79
JobSatisfaction         3.6899  3 0.30
JobLevel                2.2209  4 0.70
WorkLifeBalance         3.5271  3 0.32
GLOBAL                  9.8278 16 0.88

As we can see, all of the chi square resulted in p-value bigger than 5% meaning that global and covariates assumption are met and the result can be safely interpretated.