Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Best Bike Helmets (2025), Tested and Reviewed
    • I Stepped Into the Future of Hyper-Connected Entertainment. It Made Me Surprisingly Emotional
    • Cyber attack threat keeps me awake at night, bank boss says
    • Boost 2-Bit LLM Accuracy with EoRA
    • AI’s energy impact is still small—but how we handle it is huge
    • FDA approves first blood test for early Alzheimer’s detection
    • AI-powered ClearQuote takes top spot at Last Mile Nexus Europe 2025 (Sponsored)
    • What It’s Like to Interview for a Job at DOGE
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, May 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How To Build a Benchmark for Your Models
    Artificial Intelligence

    How To Build a Benchmark for Your Models

    Editor Times FeaturedBy Editor Times FeaturedMay 20, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    I’ve science guide for the previous three years, and I’ve had the chance to work on a number of tasks throughout varied industries. But, I observed one widespread denominator amongst a lot of the purchasers I labored with:

    They hardly ever have a transparent thought of the mission goal.

    This is among the predominant obstacles knowledge scientists face, particularly now that Gen AI is taking on each area.

    However let’s suppose that after some backwards and forwards, the target turns into clear. We managed to pin down a selected query to reply. For instance:

    I need to classify my prospects into two teams based on their chance to churn: “excessive chance to churn” and “low chance to churn”

    Properly, now what? Straightforward, let’s begin constructing some fashions!

    Improper!

    If having a transparent goal is uncommon, having a dependable benchmark is even rarer.

    In my view, one of the crucial essential steps in delivering a knowledge science mission is defining and agreeing on a set of benchmarks with the shopper.

    On this weblog publish, I’ll clarify:

    • What a benchmark is,
    • Why you will need to have a benchmark,
    • How I’d construct one utilizing an instance situation and
    • Some potential drawbacks to bear in mind

    What’s a benchmark?

    A benchmark is a standardized option to consider the efficiency of a mannequin. It offers a reference level towards which new fashions may be in contrast.

    A benchmark wants two key parts to be thought of full:

    1. A set of metrics to judge the efficiency
    2. A set of straightforward fashions to make use of as baselines

    The idea at its core is straightforward: each time I develop a brand new mannequin I examine it towards each earlier variations and the baseline fashions. This ensures enhancements are actual and tracked.

    It’s important to know that this baseline shouldn’t be mannequin or dataset-specific, however somewhat business-case-specific. It must be a common benchmark for a given enterprise case.

    If I encounter a brand new dataset, with the identical enterprise goal, this benchmark must be a dependable reference level.


    Why constructing a benchmark is essential

    Now that we’ve outlined what a benchmark is, let’s dive into why I consider it’s price spending an additional mission week on the event of a powerful benchmark.

    1. With no Benchmark you’re aiming for perfection — In case you are working with out a clear reference level any consequence will lose which means. “My mannequin has a MAE of 30.000” Is that good? IDK! Perhaps with a easy imply you’d get a MAE of 25.000. By evaluating your mannequin to a baseline, you’ll be able to measure each efficiency and enchancment.
    2. Improves Speaking with Shoppers — Shoppers and enterprise groups may not instantly perceive the usual output of a mannequin. Nonetheless, by partaking them with easy baselines from the beginning, it turns into simpler to exhibit enhancements later. In lots of instances benchmarks may come instantly from the enterprise in several shapes or kinds.
    3. Helps in Mannequin Choice — A benchmark provides a place to begin to check a number of fashions pretty. With out it, you may waste time testing fashions that aren’t price contemplating.
    4. Mannequin Drift Detection and Monitoring — Fashions can degrade over time. By having a benchmark you may be capable to intercept drifts early by evaluating new mannequin outputs towards previous benchmarks and baselines.
    5. Consistency Between Completely different Datasets — Datasets evolve. By having a hard and fast set of metrics and fashions you make sure that efficiency comparisons stay legitimate over time.

    With a transparent benchmark, each step within the mannequin growth will present rapid suggestions, making the entire course of extra intentional and data-driven.


    How I’d construct a benchmark

    I hope I’ve satisfied you of the significance of getting a benchmark. Now, let’s really construct one.

    Let’s begin from the enterprise query we offered on the very starting of this weblog publish:

    I need to classify my prospects into two teams based on their chance to churn: “excessive chance to churn” and “low chance to churn”

    For simplicity, I’ll assume no further enterprise constraints, however in real-world eventualities, constraints typically exist.

    For this instance, I’m utilizing this dataset (CC0: Public Domain). The information incorporates some attributes from an organization’s buyer base (e.g., age, intercourse, variety of merchandise, …) together with their churn standing.

    Now that we have now one thing to work on let’s construct the benchmark:

    1. Defining the metrics

    We’re coping with a churn use case, specifically, this can be a binary classification downside. Thus the principle metrics that we may use are:

    • Precision — Proportion of appropriately predicted churners amongst all predicted churners
    • Recall — Proportion of precise churners appropriately recognized
    • F1 rating — Balances precision and recall
    • True Positives, False Positives, True Adverse and False Negatives

    These are among the “easy” metrics that might be used to judge the output of a mannequin.

    Nonetheless, it’s not an exhaustive checklist, commonplace metrics aren’t at all times sufficient. In lots of use instances, it may be helpful to construct customized metrics.

    Let’s assume that in our enterprise case the prospects labeled as “excessive chance to churn” are provided a reduction. This creates:

    • A price ($250) when providing the low cost to a non-churning buyer
    • A revenue ($1000) when retaining a churning buyer

    Following on this definition we will construct a customized metric that will likely be essential in our situation:

    # Defining the enterprise case-specific reference metric
    def financial_gain(y_true, y_pred):  
        loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250  
        gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000  
        return gain_from_tp - loss_from_fp

    If you find yourself constructing business-driven metrics these are normally probably the most related. Such metrics may take any form or type: Monetary targets, minimal necessities, proportion of protection and extra.

    2. Defining the benchmarks

    Now that we’ve outlined our metrics, we will outline a set of baseline fashions for use as a reference.

    On this part, you need to outline a listing of simple-to-implement mannequin of their easiest attainable setup. There is no such thing as a purpose at this state to spend time and assets on the optimization of those fashions, my mindset is:

    If I had quarter-hour, how would I implement this mannequin?

    In later phases of the mannequin, you’ll be able to add mode baseline fashions because the mission proceeds.

    On this case, I’ll use the next fashions:

    • Random Mannequin — Assigns labels randomly
    • Majority Mannequin — At all times predicts probably the most frequent class
    • Easy XGB
    • Easy KNN
    import numpy as np  
    import xgboost as xgb  
    from sklearn.neighbors import KNeighborsClassifier  
      
    class BinaryMean():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            np.random.seed(21)  
            return np.random.selection(a=[1, 0], dimension=len(df_test), p=[df_train['y'].imply(), 1 - df_train['y'].imply()])  
          
    class SimpleXbg():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            mannequin = xgb.XGBClassifier()  
            mannequin.match(df_train.select_dtypes(embrace=np.quantity).drop(columns='y'), df_train['y'])  
            return mannequin.predict(df_test.select_dtypes(embrace=np.quantity).drop(columns='y'))  
          
    class MajorityClass():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            majority_class = df_train['y'].mode()[0]  
            return np.full(len(df_test), majority_class)  
      
    class SimpleKNN():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            mannequin = KNeighborsClassifier()  
            mannequin.match(df_train.select_dtypes(embrace=np.quantity).drop(columns='y'), df_train['y'])  
            return mannequin.predict(df_test.select_dtypes(embrace=np.quantity).drop(columns='y'))

    Once more, as within the case of the metrics, we will construct customized benchmarks.

    Let’s assume that in our enterprise case the the advertising staff contacts each shopper who’s:

    • Over 50 y/o and
    • That’s not lively anymore

    Following this rule we will construct this mannequin:

    # Defining the enterprise case-specific benchmark
    class BusinessBenchmark():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            df = df_test.copy()  
            df.loc[:,'y_hat'] = 0  
            df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1  
            return df['y_hat']

    Operating the benchmark

    To run the benchmark I’ll use the next class. The entry level is the strategy compare_with_benchmark() that, given a prediction, runs all of the fashions and calculates all of the metrics.

    import numpy as np  
      
    class ChurnBinaryBenchmark():  
        def __init__(        
    	    self,  
            metrics = [],  
            benchmark_models = [],        
            ):  
            self.metrics = metrics  
            self.benchmark_models = benchmark_models  
      
        def compare_pred_with_benchmark(        
    	    self,  
            df_train,  
            df_test,  
            my_predictions,    
            ):  
           
            output_metrics = {  
                'Prediction': self._calculate_metrics(df_test['y'], my_predictions)  
            }  
            dct_benchmarks = {}  
      
            for mannequin in self.benchmark_models:  
                dct_benchmarks[model.__name__] = mannequin.run_benchmark(df_train = df_train, df_test = df_test)  
                output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])  
      
            return output_metrics  
          
        def _calculate_metrics(self, y_true, y_pred):  
            return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}

    Now all we want is a prediction. For this instance, I made a rapid characteristic engineering and a few hyperparameter tuning.

    The final step is simply to run the benchmark:

    binary_benchmark = ChurnBinaryBenchmark(  
        metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],  
        benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]  
        )  
      
    res = binary_benchmark.compare_pred_with_benchmark(  
        df_train=df_train,  
        df_test=df_test,  
        my_predictions=preds,  
    )  
      
    pd.DataFrame(res)
    Benchmark metrics comparability | Picture by Creator

    This generates a comparability desk of all fashions throughout all metrics. Utilizing this desk, it’s attainable to attract concrete conclusions on the mannequin’s predictions and make knowledgeable selections on the next steps of the method.


    Some drawbacks

    As we’ve seen there are many the explanation why it’s helpful to have a benchmark. Nonetheless, despite the fact that benchmarks are extremely helpful, there are some pitfalls to be careful for:

    1. Non-Informative Benchmark — When the metrics or fashions are poorly outlined the marginal influence of getting a benchmark decreases. At all times outline significant baselines.
    2. Misinterpretation by Stakeholders — Communication with the shopper is important, you will need to state clearly what the metrics are measuring. The very best mannequin may not be the very best on all of the outlined metrics.
    3. Overfitting to the Benchmark — You may find yourself making an attempt to create options which can be too particular, that may beat the benchmark, however don’t generalize properly in prediction. Don’t concentrate on beating the benchmark, however on creating the very best resolution attainable to the issue.
    4. Change of Goal — Targets outlined may change, on account of miscommunication or modifications in plans. Hold your benchmark versatile so it might probably adapt when wanted.

    Closing ideas

    Benchmarks present readability, guarantee enhancements are measurable, and create a shared reference level between knowledge scientists and purchasers. They assist keep away from the entice of assuming a mannequin is performing properly with out proof and be certain that each iteration brings actual worth.

    In addition they act as a communication instrument, making it simpler to elucidate progress to purchasers. As a substitute of simply presenting numbers, you’ll be able to present clear comparisons that spotlight enhancements.

    Here you can find a notebook with a full implementation from this blog post.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Boost 2-Bit LLM Accuracy with EoRA

    May 20, 2025

    🚪🚪🐐 Lessons in Decision Making from the Monty Hall Problem

    May 20, 2025

    How to Learn the Math Needed for Machine Learning

    May 20, 2025

    Understanding Random Forest using Python (scikit-learn)

    May 20, 2025

    Google’s AlphaEvolve Is Evolving New Algorithms — And It Could Be a Game Changer

    May 19, 2025

    Agentic AI 102: Guardrails and Agent Evaluation

    May 19, 2025
    Leave A Reply Cancel Reply

    Editors Picks

    Best Bike Helmets (2025), Tested and Reviewed

    May 20, 2025

    I Stepped Into the Future of Hyper-Connected Entertainment. It Made Me Surprisingly Emotional

    May 20, 2025

    Cyber attack threat keeps me awake at night, bank boss says

    May 20, 2025

    Boost 2-Bit LLM Accuracy with EoRA

    May 20, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Best Internet Providers in Bozeman, Montana

    April 19, 2025

    Breast cancer cure rates almost doubled in combo therapy trial

    February 2, 2025

    Business leaders are embracing AI, but their employees are not so sure

    February 3, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.