Close Menu
    Trending
    • Use This AI-Powered Platform to Turn Your Side Hustle into a Scalable Business
    • Rethinking Reasoning: A Critical Look at Large Reasoning Models | by Eshaan Gupta | Jun, 2025
    • Streamline Your Workflow With This $30 Microsoft Office Professional Plus 2019 License
    • Future of Business Analytics in This Evolution of AI | by Advait Dharmadhikari | Jun, 2025
    • You’re Only Three Weeks Away From Reaching International Clients, Partners, and Customers
    • How Brain-Computer Interfaces Are Changing the Game | by Rahul Mishra | Coding Nexus | Jun, 2025
    • How Diverse Leadership Gives You a Big Competitive Advantage
    • Making Sense of Metrics in Recommender Systems | by George Perakis | Jun, 2025
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Machine Learning»Employee Attrition Prediction with Machine Learning: My First Data Science Portfolio Project | by Kongpop T. | Mar, 2025
    Machine Learning

    Employee Attrition Prediction with Machine Learning: My First Data Science Portfolio Project | by Kongpop T. | Mar, 2025

    FinanceStarGateBy FinanceStarGateMarch 5, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    INTRODUCTION

    Worker attrition is an enormous problem for firms, impacting productiveness and rising hiring prices.

    What if we might predict which staff are more likely to go away?

    On this undertaking, I used machine studying to foretell worker attrition, making use of strategies like function engineering, dealing with imbalanced information with SMOTE, and coaching a RandomForest mannequin.

    DATA PREPARATION

    Initially, I draw the ER diagram to make it straightforward to visualise the construction of dataset together with title and kinds.

    The dataset consists of 4 CSV information containing worker info. I import all information into excel (Knowledge > Get Knowledge from Textual content / CSV), use Energy Question Editor to set the primary row as headers and use Merge Queries operate to hitch (interior be a part of) the info collectively.

    Subsequent, I verify for lacking values utilizing Pandas.

    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0]
    print(missing_values)

    The columns ‘LeavingYear,’ ‘Cause,’ and ‘RelievingStatus’ have 38,169 lacking values for every, which correspond to staff who haven’t left the corporate but. I made a decision to drop these three columns as a result of the lacking information solely applies to present staff and doesn’t contribute helpful info for predicting attrition.

    EDA (EXPLORATION DATA ANAYSIS)

    I create field plots for numerical options and rely plots for categorical options to watch and determine which options must be chosen for coaching machine studying mannequin.

    Field plots for numerical options
    Rely plots for categorical options

    The logic I used to decide on the numerical options from the field plot relies on whether or not the function can distinguish between attrition (Sure / No). I noticed the imply, IQR (Interquartile Vary), and whiskers to see if there have been important variations between the 2 teams.

    For instance, within the field plot of ‘WorkLifeBalance,’ the shapes for each ‘Attrition: Sure’ and ‘Attrition: No’ are fairly related. Since there may be little to no distinction between two teams, I can’t embody this function for coaching the mannequin, as it’s unlikely to offer priceless info for predicting attrition.

    For categorical options, I used rely plot to watch the distribution of every class in relation to attrition.

    For instance, within the rely plot of ‘Nation,’ I seen that in each the US and Canada, the variety of ‘No’ attrition circumstances was larger than the variety of ‘Sure’ attrition circumstances. Primarily based on this remark, I feel this function may not be considerably essential, because it doesn’t present a transparent relationship with attrition.

    MODEL TRAINING

    Earlier than practice the mannequin, I have to convert ‘Attrition’ function to 1 and 0 as a result of initially it was ‘Sure’ and ‘No’.

    import pandas as pd

    df['Attrition'] = df['Attrition'].map({'Sure': 1, 'No': 0})

    I create a listing one incorporates numerical options, and one incorporates categorical function.

    numerical_features = [
    'rated_year',
    'rating',
    'JoiningYear',
    'Age',
    'DistanceFromHome',
    'EnvironmentSatisfaction'
    ]

    categorical_features = ['OverTime']

    Then, extract the options following the listing and create a brand new DataFrame.

    numerical_df = df[numerical_features]
    categorical_df = df[categorical_features]

    For categorical function, I take advantage of OneHotEncoder to transform to format that may be offered to machine studying algorithms.

    encoder = OneHotEncoder(sparse=False)
    encoded_categorical_df = pd.DataFrame(encoder.fit_transform(categorical_df), columns=encoder.get_feature_names_out(categorical_features))

    Subsequent, I mix each numerical and categorical which have already encoded into ‘X,’ set the goal to ‘y,’ and cut up the dataset in to coaching set with ratio of 0.8, and testing set with ratio of 0.2

    # Mix numerical and encoded categorical options
    X = pd.concat([df[numerical_features], encoded_categorical_df], axis=1)
    y = df['Attrition']

    # Break up the info into coaching and testing units
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    As a result of making use of completely different preprocessing steps to completely different subsets of options within the dataset, I take advantage of ColumnTransformer, the category from sklearn, to make sure the preprocessing steps are utilized persistently and effectively. It additionally integrates with Pipeline to make the whole workflow extra environment friendly and maintainable.

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.ensemble import RandomForestClassifier

    preprocessor = ColumnTransformer(
    transformers=[
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(), categorical_features)
    ])

    pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
    ])

    When each half is prepared for coaching, practice the mannequin with the pipeline. On this undertaking, I take advantage of RandomForest to coach.

    Mannequin coaching pipeline

    The mannequin carried out nicely in predicting the ‘No Attrition’ class, however since this undertaking focuses on staff who’re more likely to go away the corporate, the mannequin’s efficiency on the ‘Attrition’ class is extra essential.

    The classification report
    Confusion matrix

    As noticed from the recall, F1-score, and confusion matrix, the mannequin tends to overlook predicting staff who really left, indicating room for enchancment in figuring out staff vulnerable to attrition.

    HANDLING IMBALANCED DATA

    The essential key’s that the dataset is imbalanced!

    So, I apply SMOTE (Artificial Minority Over-sampling Approach) to deal with the imbalanced dataset downside.

    from imblearn.over_sampling import SMOTE

    smote = SMOTE(random_state=42)
    X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

    In easy phrases, SMOTE is a technique that creates new information factors by trying on the present ones and producing new ones in between them.

    After making use of SMOTE, the recall of the ‘Attrition’ class is improve from 0.621 to 0.70, F1-score is elevated from 0.705 to 0.728, and false negatives (the mannequin predicts that an worker will keep (No Attrition), however really, they go away (Sure Attrition)), in confusion matrix lower from 667 to 520 folks.

    FEATURE ENGINEERING

    Making use of solely SMOTE improved the mannequin’s efficiency, however there was nonetheless room for enchancment.

    So, I revisited the function engineering course of, including extra essential options and eradicating those who weren’t helpful.

    The primary enchancment I targeted on was options associated to satisfaction. Initially, I used solely ‘EnvironmentSatisfaction.’ Then, I added ‘JobSatisfaction’ and ‘RelationshipSatisfaction’ one after the other, and every addition regularly improved the mannequin’s efficiency.

    Subsequent, I added the ‘DailyRate’ and ‘MonthlyIncome’ options, which improved the general outcomes by recall elevated to 0.891, F1-score to 0.926, and false negatives decreased to 192.

    I then examined my speculation that ‘rated_year’ may not be an essential consider an worker’s choice to remain or go away. After eradicating this function, the mannequin’s efficiency improved much more by recall elevated to 0.967, F1-score to 0.978, and false negatives deceased to 57.

    Lastly, I take into consideration function associated to yr. I add the ‘YearsInCurrentRole’ function, and the outcomes was improved by precision elevated to 0.991, recall to 0.975, F1-score to 0.983, and false destructive dropped to simply 44.

    The classification report
    Confusion matrix



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe Risks and Rewards of Trading Altcoins: Maximise Gains, Minimise Risks
    Next Article Improving Cash Flow with AI-Driven Financial Forecasting
    FinanceStarGate

    Related Posts

    Machine Learning

    Rethinking Reasoning: A Critical Look at Large Reasoning Models | by Eshaan Gupta | Jun, 2025

    June 14, 2025
    Machine Learning

    Future of Business Analytics in This Evolution of AI | by Advait Dharmadhikari | Jun, 2025

    June 14, 2025
    Machine Learning

    How Brain-Computer Interfaces Are Changing the Game | by Rahul Mishra | Coding Nexus | Jun, 2025

    June 14, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    The 10 AI Papers That Redefined the Post-Transformer Era From Language to Protein Folding: How These Breakthroughs Built the Future of AI | by Neural Lab | Neural Lab | May, 2025

    May 3, 2025

    How Landlords Can Maximize Their Tax Savings

    March 4, 2025

    How AI is used to surveil workers

    February 25, 2025

    MACHINE LEARNING-II. CLASSIFICATION | by Aditi | Mar, 2025

    March 17, 2025

    9 Old-School ML Algorithms Getting a Makeover with LLMs & Vector Search in 2025 | by Anix Lynch, MBA, ex-VC | Feb, 2025

    February 3, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Systematic Hedging Of An Equity Portfolio With Short-Selling Strategies Based On The VIX | by Domenico D’Errico | Jun, 2025

    June 14, 2025

    In-Demand Jobs 2025: Accountant, Analyst, Nurse, Truck Driver

    February 14, 2025

    A One-Time Payment of $20 Gets You Access to 1,000+ Courses Forever

    May 18, 2025
    Our Picks

    This Is the Real Reason Most Rebrands Fail to Drive Real Change

    February 18, 2025

    Logistic Regression: Intuition and Math | by Sharmayogesh | Jun, 2025

    June 7, 2025

    Machine Learning Project — 6. Tune and Improve — ML model; Hyperparameters | Practice & Theory – Machine Learning Maverick

    March 23, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.