INTRODUCTION
Worker attrition is an enormous problem for firms, impacting productiveness and rising hiring prices.
What if we might predict which staff are more likely to go away?
On this undertaking, I used machine studying to foretell worker attrition, making use of strategies like function engineering, dealing with imbalanced information with SMOTE, and coaching a RandomForest mannequin.
DATA PREPARATION
Initially, I draw the ER diagram to make it straightforward to visualise the construction of dataset together with title and kinds.
The dataset consists of 4 CSV information containing worker info. I import all information into excel (Knowledge > Get Knowledge from Textual content / CSV), use Energy Question Editor to set the primary row as headers and use Merge Queries operate to hitch (interior be a part of) the info collectively.
Subsequent, I verify for lacking values utilizing Pandas.
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
print(missing_values)
The columns ‘LeavingYear,’ ‘Cause,’ and ‘RelievingStatus’ have 38,169 lacking values for every, which correspond to staff who haven’t left the corporate but. I made a decision to drop these three columns as a result of the lacking information solely applies to present staff and doesn’t contribute helpful info for predicting attrition.
EDA (EXPLORATION DATA ANAYSIS)
I create field plots for numerical options and rely plots for categorical options to watch and determine which options must be chosen for coaching machine studying mannequin.
The logic I used to decide on the numerical options from the field plot relies on whether or not the function can distinguish between attrition (Sure / No). I noticed the imply, IQR (Interquartile Vary), and whiskers to see if there have been important variations between the 2 teams.
For instance, within the field plot of ‘WorkLifeBalance,’ the shapes for each ‘Attrition: Sure’ and ‘Attrition: No’ are fairly related. Since there may be little to no distinction between two teams, I can’t embody this function for coaching the mannequin, as it’s unlikely to offer priceless info for predicting attrition.
For categorical options, I used rely plot to watch the distribution of every class in relation to attrition.
For instance, within the rely plot of ‘Nation,’ I seen that in each the US and Canada, the variety of ‘No’ attrition circumstances was larger than the variety of ‘Sure’ attrition circumstances. Primarily based on this remark, I feel this function may not be considerably essential, because it doesn’t present a transparent relationship with attrition.
MODEL TRAINING
Earlier than practice the mannequin, I have to convert ‘Attrition’ function to 1 and 0 as a result of initially it was ‘Sure’ and ‘No’.
import pandas as pddf['Attrition'] = df['Attrition'].map({'Sure': 1, 'No': 0})
I create a listing one incorporates numerical options, and one incorporates categorical function.
numerical_features = [
'rated_year',
'rating',
'JoiningYear',
'Age',
'DistanceFromHome',
'EnvironmentSatisfaction'
]categorical_features = ['OverTime']
Then, extract the options following the listing and create a brand new DataFrame.
numerical_df = df[numerical_features]
categorical_df = df[categorical_features]
For categorical function, I take advantage of OneHotEncoder to transform to format that may be offered to machine studying algorithms.
encoder = OneHotEncoder(sparse=False)
encoded_categorical_df = pd.DataFrame(encoder.fit_transform(categorical_df), columns=encoder.get_feature_names_out(categorical_features))
Subsequent, I mix each numerical and categorical which have already encoded into ‘X,’ set the goal to ‘y,’ and cut up the dataset in to coaching set with ratio of 0.8, and testing set with ratio of 0.2
# Mix numerical and encoded categorical options
X = pd.concat([df[numerical_features], encoded_categorical_df], axis=1)
y = df['Attrition']# Break up the info into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
As a result of making use of completely different preprocessing steps to completely different subsets of options within the dataset, I take advantage of ColumnTransformer, the category from sklearn, to make sure the preprocessing steps are utilized persistently and effectively. It additionally integrates with Pipeline to make the whole workflow extra environment friendly and maintainable.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifierpreprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(), categorical_features)
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
When each half is prepared for coaching, practice the mannequin with the pipeline. On this undertaking, I take advantage of RandomForest to coach.
The mannequin carried out nicely in predicting the ‘No Attrition’ class, however since this undertaking focuses on staff who’re more likely to go away the corporate, the mannequin’s efficiency on the ‘Attrition’ class is extra essential.
As noticed from the recall, F1-score, and confusion matrix, the mannequin tends to overlook predicting staff who really left, indicating room for enchancment in figuring out staff vulnerable to attrition.
HANDLING IMBALANCED DATA
The essential key’s that the dataset is imbalanced!
So, I apply SMOTE (Artificial Minority Over-sampling Approach) to deal with the imbalanced dataset downside.
from imblearn.over_sampling import SMOTEsmote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
In easy phrases, SMOTE is a technique that creates new information factors by trying on the present ones and producing new ones in between them.
After making use of SMOTE, the recall of the ‘Attrition’ class is improve from 0.621 to 0.70, F1-score is elevated from 0.705 to 0.728, and false negatives (the mannequin predicts that an worker will keep (No Attrition), however really, they go away (Sure Attrition)), in confusion matrix lower from 667 to 520 folks.
FEATURE ENGINEERING
Making use of solely SMOTE improved the mannequin’s efficiency, however there was nonetheless room for enchancment.
So, I revisited the function engineering course of, including extra essential options and eradicating those who weren’t helpful.
The primary enchancment I targeted on was options associated to satisfaction. Initially, I used solely ‘EnvironmentSatisfaction.’ Then, I added ‘JobSatisfaction’ and ‘RelationshipSatisfaction’ one after the other, and every addition regularly improved the mannequin’s efficiency.
Subsequent, I added the ‘DailyRate’ and ‘MonthlyIncome’ options, which improved the general outcomes by recall elevated to 0.891, F1-score to 0.926, and false negatives decreased to 192.
I then examined my speculation that ‘rated_year’ may not be an essential consider an worker’s choice to remain or go away. After eradicating this function, the mannequin’s efficiency improved much more by recall elevated to 0.967, F1-score to 0.978, and false negatives deceased to 57.
Lastly, I take into consideration function associated to yr. I add the ‘YearsInCurrentRole’ function, and the outcomes was improved by precision elevated to 0.991, recall to 0.975, F1-score to 0.983, and false destructive dropped to simply 44.