Close Menu
    Trending
    • How AI Agents “Talk” to Each Other
    • Creating Smart Forms with Auto-Complete and Validation using AI | by Seungchul Jeff Ha | Jun, 2025
    • Why Knowing Your Customer Drives Smarter Growth (and Higher Profits)
    • Stop Building AI Platforms | Towards Data Science
    • What If Your Portfolio Could Speak for You? | by Lusha Wang | Jun, 2025
    • High Paying, Six Figure Jobs For Recent Graduates: Report
    • What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization
    • YouBot: Understanding YouTube Comments and Chatting Intelligently — An Engineer’s Perspective | by Sercan Teyhani | Jun, 2025
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Machine Learning»Employee Attrition Prediction with Machine Learning: My First Data Science Portfolio Project | by Kongpop T. | Mar, 2025
    Machine Learning

    Employee Attrition Prediction with Machine Learning: My First Data Science Portfolio Project | by Kongpop T. | Mar, 2025

    FinanceStarGateBy FinanceStarGateMarch 5, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    INTRODUCTION

    Worker attrition is an enormous problem for firms, impacting productiveness and rising hiring prices.

    What if we might predict which staff are more likely to go away?

    On this undertaking, I used machine studying to foretell worker attrition, making use of strategies like function engineering, dealing with imbalanced information with SMOTE, and coaching a RandomForest mannequin.

    DATA PREPARATION

    Initially, I draw the ER diagram to make it straightforward to visualise the construction of dataset together with title and kinds.

    The dataset consists of 4 CSV information containing worker info. I import all information into excel (Knowledge > Get Knowledge from Textual content / CSV), use Energy Question Editor to set the primary row as headers and use Merge Queries operate to hitch (interior be a part of) the info collectively.

    Subsequent, I verify for lacking values utilizing Pandas.

    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0]
    print(missing_values)

    The columns ‘LeavingYear,’ ‘Cause,’ and ‘RelievingStatus’ have 38,169 lacking values for every, which correspond to staff who haven’t left the corporate but. I made a decision to drop these three columns as a result of the lacking information solely applies to present staff and doesn’t contribute helpful info for predicting attrition.

    EDA (EXPLORATION DATA ANAYSIS)

    I create field plots for numerical options and rely plots for categorical options to watch and determine which options must be chosen for coaching machine studying mannequin.

    Field plots for numerical options
    Rely plots for categorical options

    The logic I used to decide on the numerical options from the field plot relies on whether or not the function can distinguish between attrition (Sure / No). I noticed the imply, IQR (Interquartile Vary), and whiskers to see if there have been important variations between the 2 teams.

    For instance, within the field plot of ‘WorkLifeBalance,’ the shapes for each ‘Attrition: Sure’ and ‘Attrition: No’ are fairly related. Since there may be little to no distinction between two teams, I can’t embody this function for coaching the mannequin, as it’s unlikely to offer priceless info for predicting attrition.

    For categorical options, I used rely plot to watch the distribution of every class in relation to attrition.

    For instance, within the rely plot of ‘Nation,’ I seen that in each the US and Canada, the variety of ‘No’ attrition circumstances was larger than the variety of ‘Sure’ attrition circumstances. Primarily based on this remark, I feel this function may not be considerably essential, because it doesn’t present a transparent relationship with attrition.

    MODEL TRAINING

    Earlier than practice the mannequin, I have to convert ‘Attrition’ function to 1 and 0 as a result of initially it was ‘Sure’ and ‘No’.

    import pandas as pd

    df['Attrition'] = df['Attrition'].map({'Sure': 1, 'No': 0})

    I create a listing one incorporates numerical options, and one incorporates categorical function.

    numerical_features = [
    'rated_year',
    'rating',
    'JoiningYear',
    'Age',
    'DistanceFromHome',
    'EnvironmentSatisfaction'
    ]

    categorical_features = ['OverTime']

    Then, extract the options following the listing and create a brand new DataFrame.

    numerical_df = df[numerical_features]
    categorical_df = df[categorical_features]

    For categorical function, I take advantage of OneHotEncoder to transform to format that may be offered to machine studying algorithms.

    encoder = OneHotEncoder(sparse=False)
    encoded_categorical_df = pd.DataFrame(encoder.fit_transform(categorical_df), columns=encoder.get_feature_names_out(categorical_features))

    Subsequent, I mix each numerical and categorical which have already encoded into ‘X,’ set the goal to ‘y,’ and cut up the dataset in to coaching set with ratio of 0.8, and testing set with ratio of 0.2

    # Mix numerical and encoded categorical options
    X = pd.concat([df[numerical_features], encoded_categorical_df], axis=1)
    y = df['Attrition']

    # Break up the info into coaching and testing units
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    As a result of making use of completely different preprocessing steps to completely different subsets of options within the dataset, I take advantage of ColumnTransformer, the category from sklearn, to make sure the preprocessing steps are utilized persistently and effectively. It additionally integrates with Pipeline to make the whole workflow extra environment friendly and maintainable.

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.ensemble import RandomForestClassifier

    preprocessor = ColumnTransformer(
    transformers=[
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(), categorical_features)
    ])

    pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
    ])

    When each half is prepared for coaching, practice the mannequin with the pipeline. On this undertaking, I take advantage of RandomForest to coach.

    Mannequin coaching pipeline

    The mannequin carried out nicely in predicting the ‘No Attrition’ class, however since this undertaking focuses on staff who’re more likely to go away the corporate, the mannequin’s efficiency on the ‘Attrition’ class is extra essential.

    The classification report
    Confusion matrix

    As noticed from the recall, F1-score, and confusion matrix, the mannequin tends to overlook predicting staff who really left, indicating room for enchancment in figuring out staff vulnerable to attrition.

    HANDLING IMBALANCED DATA

    The essential key’s that the dataset is imbalanced!

    So, I apply SMOTE (Artificial Minority Over-sampling Approach) to deal with the imbalanced dataset downside.

    from imblearn.over_sampling import SMOTE

    smote = SMOTE(random_state=42)
    X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

    In easy phrases, SMOTE is a technique that creates new information factors by trying on the present ones and producing new ones in between them.

    After making use of SMOTE, the recall of the ‘Attrition’ class is improve from 0.621 to 0.70, F1-score is elevated from 0.705 to 0.728, and false negatives (the mannequin predicts that an worker will keep (No Attrition), however really, they go away (Sure Attrition)), in confusion matrix lower from 667 to 520 folks.

    FEATURE ENGINEERING

    Making use of solely SMOTE improved the mannequin’s efficiency, however there was nonetheless room for enchancment.

    So, I revisited the function engineering course of, including extra essential options and eradicating those who weren’t helpful.

    The primary enchancment I targeted on was options associated to satisfaction. Initially, I used solely ‘EnvironmentSatisfaction.’ Then, I added ‘JobSatisfaction’ and ‘RelationshipSatisfaction’ one after the other, and every addition regularly improved the mannequin’s efficiency.

    Subsequent, I added the ‘DailyRate’ and ‘MonthlyIncome’ options, which improved the general outcomes by recall elevated to 0.891, F1-score to 0.926, and false negatives decreased to 192.

    I then examined my speculation that ‘rated_year’ may not be an essential consider an worker’s choice to remain or go away. After eradicating this function, the mannequin’s efficiency improved much more by recall elevated to 0.967, F1-score to 0.978, and false negatives deceased to 57.

    Lastly, I take into consideration function associated to yr. I add the ‘YearsInCurrentRole’ function, and the outcomes was improved by precision elevated to 0.991, recall to 0.975, F1-score to 0.983, and false destructive dropped to simply 44.

    The classification report
    Confusion matrix



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe Risks and Rewards of Trading Altcoins: Maximise Gains, Minimise Risks
    Next Article Improving Cash Flow with AI-Driven Financial Forecasting
    FinanceStarGate

    Related Posts

    Machine Learning

    Creating Smart Forms with Auto-Complete and Validation using AI | by Seungchul Jeff Ha | Jun, 2025

    June 14, 2025
    Machine Learning

    What If Your Portfolio Could Speak for You? | by Lusha Wang | Jun, 2025

    June 14, 2025
    Machine Learning

    YouBot: Understanding YouTube Comments and Chatting Intelligently — An Engineer’s Perspective | by Sercan Teyhani | Jun, 2025

    June 13, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    RAG vs. Fine-Tuning: Strategic Choices for Enterprise AI Systems | by willard mechem | Mar, 2025

    March 29, 2025

    My Learning to Be Hired Again After a Year… Part 2

    March 31, 2025

    Your Team Will Love This Easy-to-Use PDF Editor

    June 1, 2025

    Why Read 300 Pages When You Can Learn the Key Points in 15 Minutes?

    March 22, 2025

    What to Do When Your Environment Is Stifling Your Growth

    March 29, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    The Art of the Phillips Curve

    May 13, 2025

    Amazon’s RTO Mandate Complicated By Lack of Desks, Space

    February 14, 2025

    How to Grow Your Small Business Without Breaking the Bank

    February 4, 2025
    Our Picks

    Too Many Founders Are Making This Critical Mistake — And It’s Costing Them

    March 19, 2025

    Sacrificing The Stock Market For The Good Of Your Loving Home

    April 7, 2025

    🎙️ Everything You Need to Know About AI Voice Models: From Whisper to GPT-4o | by Asimsultan (Head of AI) | Jun, 2025

    June 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.