Close Menu
    Trending
    • Why Being a ‘Good Communicator’ Isn’t Enough
    • How to Learn the Math Needed for Machine Learning
    • Is Your Company Ready for AI?. 5 Signs You’re Not Just Following the… | by Medoid AI | May, 2025
    • How to Turn Simple Ideas Into Never-Ending Paychecks
    • Understanding Random Forest using Python (scikit-learn)
    • Prediksi Turnover Karyawan Menggunakan Random Forest dan K-Fold Cross-Validation | by Devi Hilsa Farida | May, 2025
    • Warren Buffett Reveals Why He’s Retiring as Berkshire CEO
    • Google’s AlphaEvolve Is Evolving New Algorithms — And It Could Be a Game Changer
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»How To Build a Benchmark for Your Models
    Artificial Intelligence

    How To Build a Benchmark for Your Models

    FinanceStarGateBy FinanceStarGateMay 15, 2025No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    I’ve science advisor for the previous three years, and I’ve had the chance to work on a number of tasks throughout numerous industries. But, I observed one frequent denominator amongst a lot of the shoppers I labored with:

    They not often have a transparent thought of the undertaking goal.

    This is among the fundamental obstacles knowledge scientists face, particularly now that Gen AI is taking on each area.

    However let’s suppose that after some forwards and backwards, the target turns into clear. We managed to pin down a particular query to reply. For instance:

    I wish to classify my prospects into two teams based on their likelihood to churn: “excessive probability to churn” and “low probability to churn”

    Nicely, now what? Simple, let’s begin constructing some fashions!

    Flawed!

    If having a transparent goal is uncommon, having a dependable benchmark is even rarer.

    In my view, some of the essential steps in delivering a knowledge science undertaking is defining and agreeing on a set of benchmarks with the consumer.

    On this weblog put up, I’ll clarify:

    • What a benchmark is,
    • Why it is very important have a benchmark,
    • How I might construct one utilizing an instance state of affairs and
    • Some potential drawbacks to remember

    What’s a benchmark?

    A benchmark is a standardized method to consider the efficiency of a mannequin. It gives a reference level in opposition to which new fashions will be in contrast.

    A benchmark wants two key parts to be thought of full:

    1. A set of metrics to judge the efficiency
    2. A set of easy fashions to make use of as baselines

    The idea at its core is easy: each time I develop a brand new mannequin I evaluate it in opposition to each earlier variations and the baseline fashions. This ensures enhancements are actual and tracked.

    It’s important to grasp that this baseline shouldn’t be mannequin or dataset-specific, however moderately business-case-specific. It ought to be a common benchmark for a given enterprise case.

    If I encounter a brand new dataset, with the identical enterprise goal, this benchmark ought to be a dependable reference level.


    Why constructing a benchmark is essential

    Now that we’ve outlined what a benchmark is, let’s dive into why I consider it’s price spending an additional undertaking week on the event of a powerful benchmark.

    1. With no Benchmark you’re aiming for perfection — If you’re working with out a clear reference level any consequence will lose which means. “My mannequin has a MAE of 30.000” Is that good? IDK! Perhaps with a easy imply you’ll get a MAE of 25.000. By evaluating your mannequin to a baseline, you may measure each efficiency and enchancment.
    2. Improves Speaking with Purchasers — Purchasers and enterprise groups may not instantly perceive the usual output of a mannequin. Nonetheless, by participating them with easy baselines from the beginning, it turns into simpler to show enhancements later. In lots of instances benchmarks might come instantly from the enterprise in numerous shapes or kinds.
    3. Helps in Mannequin Choice — A benchmark provides a start line to check a number of fashions pretty. With out it, you would possibly waste time testing fashions that aren’t price contemplating.
    4. Mannequin Drift Detection and Monitoring — Fashions can degrade over time. By having a benchmark you would possibly be capable to intercept drifts early by evaluating new mannequin outputs in opposition to previous benchmarks and baselines.
    5. Consistency Between Completely different Datasets — Datasets evolve. By having a set set of metrics and fashions you make sure that efficiency comparisons stay legitimate over time.

    With a transparent benchmark, each step within the mannequin growth will present rapid suggestions, making the entire course of extra intentional and data-driven.


    How I might construct a benchmark

    I hope I’ve satisfied you of the significance of getting a benchmark. Now, let’s truly construct one.

    Let’s begin from the enterprise query we offered on the very starting of this weblog put up:

    I wish to classify my prospects into two teams based on their likelihood to churn: “excessive probability to churn” and “low probability to churn”

    For simplicity, I’ll assume no extra enterprise constraints, however in real-world eventualities, constraints usually exist.

    For this instance, I’m utilizing this dataset (CC0: Public Domain). The information accommodates some attributes from an organization’s buyer base (e.g., age, intercourse, variety of merchandise, …) together with their churn standing.

    Now that we have now one thing to work on let’s construct the benchmark:

    1. Defining the metrics

    We’re coping with a churn use case, particularly, this can be a binary classification drawback. Thus the primary metrics that we might use are:

    • Precision — Share of accurately predicted churners amongst all predicted churners
    • Recall — Share of precise churners accurately recognized
    • F1 rating — Balances precision and recall
    • True Positives, False Positives, True Damaging and False Negatives

    These are a few of the “easy” metrics that might be used to judge the output of a mannequin.

    Nonetheless, it isn’t an exhaustive checklist, commonplace metrics aren’t all the time sufficient. In lots of use instances, it is likely to be helpful to construct customized metrics.

    Let’s assume that in our enterprise case the prospects labeled as “excessive probability to churn” are provided a reduction. This creates:

    • A price ($250) when providing the low cost to a non-churning buyer
    • A revenue ($1000) when retaining a churning buyer

    Following on this definition we will construct a customized metric that will likely be essential in our state of affairs:

    # Defining the enterprise case-specific reference metric
    def financial_gain(y_true, y_pred):  
        loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250  
        gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000  
        return gain_from_tp - loss_from_fp

    When you find yourself constructing business-driven metrics these are often probably the most related. Such metrics might take any form or type: Monetary objectives, minimal necessities, proportion of protection and extra.

    2. Defining the benchmarks

    Now that we’ve outlined our metrics, we will outline a set of baseline fashions for use as a reference.

    On this section, it’s best to outline an inventory of simple-to-implement mannequin of their easiest potential setup. There isn’t any cause at this state to spend time and sources on the optimization of those fashions, my mindset is:

    If I had quarter-hour, how would I implement this mannequin?

    In later phases of the mannequin, you may add mode baseline fashions because the undertaking proceeds.

    On this case, I’ll use the next fashions:

    • Random Mannequin — Assigns labels randomly
    • Majority Mannequin — At all times predicts probably the most frequent class
    • Easy XGB
    • Easy KNN
    import numpy as np  
    import xgboost as xgb  
    from sklearn.neighbors import KNeighborsClassifier  
      
    class BinaryMean():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            np.random.seed(21)  
            return np.random.selection(a=[1, 0], measurement=len(df_test), p=[df_train['y'].imply(), 1 - df_train['y'].imply()])  
          
    class SimpleXbg():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            mannequin = xgb.XGBClassifier()  
            mannequin.match(df_train.select_dtypes(embody=np.quantity).drop(columns='y'), df_train['y'])  
            return mannequin.predict(df_test.select_dtypes(embody=np.quantity).drop(columns='y'))  
          
    class MajorityClass():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            majority_class = df_train['y'].mode()[0]  
            return np.full(len(df_test), majority_class)  
      
    class SimpleKNN():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            mannequin = KNeighborsClassifier()  
            mannequin.match(df_train.select_dtypes(embody=np.quantity).drop(columns='y'), df_train['y'])  
            return mannequin.predict(df_test.select_dtypes(embody=np.quantity).drop(columns='y'))

    Once more, as within the case of the metrics, we will construct customized benchmarks.

    Let’s assume that in our enterprise case the the advertising and marketing staff contacts each consumer who’s:

    • Over 50 y/o and
    • That’s not energetic anymore

    Following this rule we will construct this mannequin:

    # Defining the enterprise case-specific benchmark
    class BusinessBenchmark():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            df = df_test.copy()  
            df.loc[:,'y_hat'] = 0  
            df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1  
            return df['y_hat']

    Operating the benchmark

    To run the benchmark I’ll use the next class. The entry level is the tactic compare_with_benchmark() that, given a prediction, runs all of the fashions and calculates all of the metrics.

    import numpy as np  
      
    class ChurnBinaryBenchmark():  
        def __init__(        
    	    self,  
            metrics = [],  
            benchmark_models = [],        
            ):  
            self.metrics = metrics  
            self.benchmark_models = benchmark_models  
      
        def compare_pred_with_benchmark(        
    	    self,  
            df_train,  
            df_test,  
            my_predictions,    
            ):  
           
            output_metrics = {  
                'Prediction': self._calculate_metrics(df_test['y'], my_predictions)  
            }  
            dct_benchmarks = {}  
      
            for mannequin in self.benchmark_models:  
                dct_benchmarks[model.__name__] = mannequin.run_benchmark(df_train = df_train, df_test = df_test)  
                output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])  
      
            return output_metrics  
          
        def _calculate_metrics(self, y_true, y_pred):  
            return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}

    Now all we’d like is a prediction. For this instance, I made a rapid characteristic engineering and a few hyperparameter tuning.

    The final step is simply to run the benchmark:

    binary_benchmark = ChurnBinaryBenchmark(  
        metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],  
        benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]  
        )  
      
    res = binary_benchmark.compare_pred_with_benchmark(  
        df_train=df_train,  
        df_test=df_test,  
        my_predictions=preds,  
    )  
      
    pd.DataFrame(res)
    Benchmark metrics comparability | Picture by Writer

    This generates a comparability desk of all fashions throughout all metrics. Utilizing this desk, it’s potential to attract concrete conclusions on the mannequin’s predictions and make knowledgeable choices on the next steps of the method.


    Some drawbacks

    As we’ve seen there are many the reason why it’s helpful to have a benchmark. Nonetheless, although benchmarks are extremely helpful, there are some pitfalls to be careful for:

    1. Non-Informative Benchmark — When the metrics or fashions are poorly outlined the marginal impression of getting a benchmark decreases. At all times outline significant baselines.
    2. Misinterpretation by Stakeholders — Communication with the consumer is important, it is very important state clearly what the metrics are measuring. One of the best mannequin may not be the very best on all of the outlined metrics.
    3. Overfitting to the Benchmark — You would possibly find yourself making an attempt to create options which can be too particular, which may beat the benchmark, however don’t generalize nicely in prediction. Don’t deal with beating the benchmark, however on creating the very best answer potential to the issue.
    4. Change of Goal — Targets outlined would possibly change, on account of miscommunication or modifications in plans. Hold your benchmark versatile so it might adapt when wanted.

    Closing ideas

    Benchmarks present readability, guarantee enhancements are measurable, and create a shared reference level between knowledge scientists and shoppers. They assist keep away from the entice of assuming a mannequin is performing nicely with out proof and be sure that each iteration brings actual worth.

    Additionally they act as a communication device, making it simpler to clarify progress to shoppers. As a substitute of simply presenting numbers, you may present clear comparisons that spotlight enhancements.

    Here you can find a notebook with a full implementation from this blog post.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleFrom Code to Creativity: Building Multimodal AI Apps with Gemini and Imagen | by Hiralkotwani | May, 2025
    Next Article Coinbase CEO Says Company Won’t Pay Hackers’ Ransom
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    How to Learn the Math Needed for Machine Learning

    May 16, 2025
    Artificial Intelligence

    Understanding Random Forest using Python (scikit-learn)

    May 16, 2025
    Artificial Intelligence

    Google’s AlphaEvolve Is Evolving New Algorithms — And It Could Be a Game Changer

    May 16, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Irony: The DeepSeek Team is “mainly in their mid-20s” While Many Aged Computer Science Professors Are Now Rushing into AI/ML Research | by Zhimin Zhan | Feb, 2025

    February 7, 2025

    AI crawler wars threaten to make the web more closed for everyone

    February 11, 2025

    UALink Consortium Releases Ultra Accelerator Link 200G 1.0 Spec

    April 8, 2025

    IBM Adds Granite 3.2 LLMs for Multi-Modal AI and Reasoning

    February 26, 2025

    Serious Golfers are Consulting This Before Every Swing

    March 28, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    DeepSeek: Architettura, Ottimizzazione e Benchmark

    February 5, 2025

    The Cultural Backlash Against Generative AI | by Stephanie Kirmer | Feb, 2025

    February 2, 2025

    A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration

    March 29, 2025
    Our Picks

    AI governance solutions for security and compliance

    February 5, 2025

    Live-To-Work Is Back And It May Cost You A Great Fortune

    March 21, 2025

    How They Started a Multimillion-Dollar Sold-Out Business

    March 13, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.