Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Call of Duty battles to stay on top
    • Lights, Camera, AI! Sora and Veo 3 Battle for the Future of Video Creation
    • Revolutionary biofuel battery powered by sugar and vitamin B2
    • Holo raises €1 million to bring personalised lab testing and daily-life health tracking to more users
    • All MAGA Wanted Was the Epstein Files. Now They’re Ignoring Them
    • Australia’s self-exclusion register greatly improves wellbeing for most participants, ACMA reports
    • A Budget MacBook Makes Sense, but Crushing the Chromebook Won’t Be Easy
    • AI Clone Lets His Stories Travel Further Than Ever
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Friday, November 14
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Zero-Inflated Data: A Comparison of Regression Models
    Artificial Intelligence

    Zero-Inflated Data: A Comparison of Regression Models

    Editor Times FeaturedBy Editor Times FeaturedSeptember 6, 2025No Comments13 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    on a regression drawback. I knew that the goal I wished to design a predictive mannequin for was countable (i.e. 0, 1, 2, …). Consequently, I instantly considered selecting a Generalized Linear Mannequin (GLM) with a related discrete distribution, just like the Poisson distribution or the Destructive binomial distribution. However the whole lot didn’t go in addition to anticipated. I mistook haste for pace.

    Zero inflated knowledge

    Initially, allow us to take a look at a dataset appropriate for the put up. I’ve chosen the outcomes of the NextGen National Household Travel Survey [1]. The variable of curiosity, named “BIKETRANSIT”, is the variety of “days in final 30 days biking used”, so that is an integer worth between 0 and 30 for each day customers. Here’s a histogram of the variable in query.

    Histogram of the variety of biking days

    We will clearly see the countable knowledge is zero inflated. A lot of the respondents haven’t used a motorcycle a single day during the last 30 days. I’ve additionally observed some attention-grabbing patterns: there are usually extra individuals reporting bike use on precisely 5, 10, 15, 20, 25, or 30 days in comparison with the adjoining numbers. That is most likely as a result of respondents favor to decide on spherical numbers when they’re not sure of the exact depend. Regardless of the cause, on this put up we are going to focus totally on the problem of zero inflation by evaluating fashions designed for zero-inflated depend knowledge.

    A number of survey fields have been chosen as impartial variables to clarify the variety of bike days (e.g., age, gender, employee class, training degree, family measurement, and district traits). I deliberately excluded options that depend the variety of days spent on different actions (similar to utilizing taxis or shared bikes), since a few of them are extremely correlated with the end result of curiosity. I need the mannequin to stay lifelike: predicting bike utilization over 30 days based mostly on taxi, automobile, or public transport utilization over the identical interval wouldn’t present significant insights.

    Poisson regression limits

    Earlier than introducing the zero inflated mannequin, I wish to illustrate the restrict of the Poisson regression, which I first thought-about for this dataset. I’ve not regarded on the Destructive Binomial distribution within the part. Poisson regression assumes that the dependent random variable Y follows a Poisson distribution, conditional on the impartial variables X and the parameters β.

    Poisson regression distribution mannequin

    So, let’s take a take a look at some empirical distributions of Y∣X,β. Since I included many options, it’s troublesome to search out numerous observations with precisely the identical values of X. To handle this, I used a clustering algorithm — AgglomerativeClustering from scikit-learn [2] — to group observations with comparable function profiles.
    First, I preprocessed the information in order that it may well feed the regression fashions and likewise the clustering algorithm. I don’t need to spend an excessive amount of explaining all of the preprocessing steps as this put up doesn’t give attention to it. The total preprocessing code is accessible on a repo [8]. Briefly, I encoded the specific options utilizing one-hot encoding. I additionally utilized a number of preprocessing steps to the opposite options: imputing lacking values, clipping outliers, and making use of transformation features the place applicable. Lastly, I carried out clustering on the reworked dataset.

    from sklearn.cluster import AgglomerativeClustering
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    
    pipe = Pipeline(
        [
            ("scaler", StandardScaler()), # I normalized the data as some numerical features, like age, have range of value greater than the one hot encoded features and I know clustering works based on some distance
            ("cluster", AgglomerativeClustering(n_clusters=100)) # I chose 100 clusters to have many observations in the biggest groups
        ]
    )
    cluster_id = pipe.fit_predict(X_train_preprocessed) # right here X_train_preprocessed is numerical dataframe, after encoding the specific options

    Then I estimated the parameter of the Poisson distribution Ⲗ with the unbiased estimator being the imply of the noticed random variables for every group of the cluster.

    Ⲗ estimator

    I then plotted the empirical histograms together with the likelihood mass features of the fitted Poisson distributions for a number of teams of observations. To evaluate the standard of the match, I computed each the cross-entropy and the entropy, noting that entropy serves as a decrease certain for cross-entropy in response to Gibbs’ inequality [3]. A great mannequin ought to produce a cross-entropy worth near the entropy (although barely bigger).

    For this evaluation, I centered on three of the biggest teams, since parameter estimation is extra dependable with bigger pattern sizes. That is significantly essential right here as a result of the information is skewed attributable to zero inflation, making it essential to gather many observations. Among the many teams, two include bike customers, whereas one group (228 respondents) reported no bike utilization in any respect. For this final group, no Poisson distribution was fitted, because the Poisson parameter have to be strictly higher than zero. Lastly, I used a vertical log scale within the plots to account for the zero inflation.

    I discover it troublesome to judge the standard of the fitted distribution by wanting on the entropy and the cross entropy. Nonetheless I can see that the histogram and the likelihood mass operate differ lots. Because of this I then thought-about the Zero Inflated Poisson (ZIP) distribution.

    Zero inflated knowledge tailored fashions

    Fashions designed for zero-inflated knowledge purpose to seize each the excessive likelihood of zeros and the comparatively low chances of different occasions. I explored two important households of such fashions:

    • “Zero-inflated fashions […] mannequin the zeros utilizing a two-component combination mannequin. […] The likelihood of the variable being zero is set by each the principle distribution and the combination weight”. “A zero-inflated mannequin can solely enhance the likelihood of P(x = 0)” [5]. For notation, I exploit the next setup (barely completely different from Wikipedia and different sources). Let X1 be a hidden variable following a Bernoulli distribution. In my notation, the likelihood of success is p (whereas Wikipedia makes use of 1-π). Let X2 be one other hidden variable following a distribution that enables zeros with nonzero likelihood. For my use case, I assume X2 is discrete. The noticed variable is then outlined as X=X1*X2 which ends up in the next likelihood mass operate:
      We will discover that X1 and X2 are partially hidden. When X=0, then we can not know the values of X1 and X2, however as quickly as X>0, each variables X1 and X2 are recognized.
    • Hurdle models mannequin the observable “random variable […] utilizing two components, the primary of which is the likelihood of achieving the worth 0, and the second half fashions the likelihood of the non-zero values” [5]. Not like zero-inflated fashions, the second element should comply with a distribution by which the likelihood of zero is precisely zero. Utilizing the identical notation as earlier than, X1 fashions whether or not the statement is zero or non-zero (sometimes through a Bernoulli distribution). X2 follows a distribution that assigns no likelihood mass to zero. Consequently, the likelihood mass operate is:

    Zero Inflated Poisson mannequin

    Allow us to take a look a the Zero Inflated Poisson model [4]. The ZIP likelihood mass operate is:

    ZIP likelihood mass operate

    It’s now doable to increase the earlier histograms and Poisson-fitted likelihood mass features by including the ZIP-fitted likelihood mass features. To do that, estimators of the 2 parameters, p and λ, are required. I used the strategy of moments to derive these estimators: the primary two moments present a system of two equations with two unknowns, which might then be solved.

    Second methodology to get ZIP parameter estimators

    So the parameter estimators are:

    ZIP parameter estimators

    Lastly I’ve plotted the identical two figures with the fitted ZIP distribution likelihood mass features in addition to the cross entropy measures.

    Each visible inspection and cross-entropy values present that the ZIP mannequin matches the noticed knowledge higher than the Poisson mannequin. This offers an goal and quantifiable cause to favor ZIP regression over Poisson regression.

    Mannequin comparability

    Allow us to now evaluate a number of fashions. I break up the information into coaching and check units, but it surely was not instantly clear which analysis metrics can be most applicable. As an illustration, ought to I depend on Poisson deviance, regardless that the information is zero-inflated? Or imply squared error, which closely penalizes outliers? In the long run, I selected to make use of a number of metrics to raised seize mannequin efficiency: imply absolute error, Poisson deviance, and correlation. The fashions I evaluated are:

    • A naïve mannequin predicting the imply worth of the coaching set,
    • Linear regression (lr),
    • Poisson regression (pr),
    • Zero-inflated Poisson regression (zip),
    • A chained Logistic–Poisson regression (hurdle mannequin, lr_pr),
    • A chained Logistic–Zero-Truncated Poisson regression (hurdle mannequin, lr_tpr).

    ZIP mannequin

    Allow us to take a look at the ZIP regression implementation. First the damaging log probability of the noticed knowledge, famous y, is:

    Destructive log probability

    The marginal probability of the noticed knowledge, P(Y=y), will be expressed analytically with out the integral formulation of the joint distribution, P(Y=y, X1=x1). So it will be optimized straight with no need to make use of the expectation minimization algorithm [6]. The 2 distribution parameters p and Ⲗ are features of the options X and the parameters of the mannequin β that will probably be learnt. I’ve chosen that p is outlined because the sigmoid of the dot product between X and β and Ⲗ is outlined because the exponential of the dot product between X and β. To make the mannequin extra versatile, I exploit separate units of parameters β: one for p and one other for λ.

    ZIP parameter expressions

    Furthermore, I’ve added a prior on the parameters β to regularize the mannequin, particularly helpful for the Poisson mannequin for which there’s few observations due to the zero inflation. I’ve assumed a Regular prior, therefore the L2 regularization phrases added to the loss operate. I’ve assumed two completely different priors, one on the β for the Bernoulli mannequin and one on the β for the Poisson mannequin, therefore the 2 α hyper parameters, famous as alpha_b and alpha_p attributes within the mannequin. I’ve optimized these values via a hyper parameter optimization.

    I created a category that inherits from scikit-learn’s BaseEstimator. The Python implementation of the loss operate is proven beneath (carried out inside the class, therefore the self argument):

    def _loss(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
        n_feat = X.form[1]
    
        # break up beta into two components: one for bernoulli p and one for poisson lambda
        beta_p = beta[:n_feat]
        beta_lam = beta[n_feat:]
        
        # get bernoulli p and poisson lambda
        p = sigmoid.val(beta_p, X)
        lam = exp.val(beta_lam, X)
        
        # initialize damaging log probability
        out = 0
        
        # y == 0
        y_e0_mask = np.the place(y == 0)[0]
        out += np.sum(-np.log((1 - p) + p * np.exp(-lam))[y_e0_mask])
        
        # y > 0
        y_gt0_mask = np.the place(y > 0)[0]
        out += np.sum(-np.log(p)[y_gt0_mask])
        out += np.sum(-xlogy(y, lam)[y_gt0_mask])
        out += np.sum(lam[y_gt0_mask])
        
        # prior
        mask_b = np.ones_like(beta)
        mask_b[n_feat:] = 0
        mask_p = np.ones_like(beta)
        mask_p[:n_feat] = 0
        if self.fit_intercept:
            mask_b[n_feat - 1] = 0
            mask_p[2 * n_feat - 1] = 0
        out += 0.5 * self.alpha_b * np.sum((beta * mask_b) ** 2)
        out += 0.5 * self.alpha_p * np.sum((beta * mask_p) ** 2)
        
        return out

    With a purpose to optimize the loss goal operate, I’ve additionally computed the jacobian of the loss.

    Jacobian of the damaging log probability

    The Python implementation is:

    def _jac(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        n_feat = X.form[1]
    
        # break up beta into two components: one for bernoulli p and one for poisson lambda
        beta_p = beta[:n_feat]
        beta_lam = beta[n_feat:]
    
        # get bernoulli p and poisson lambda
        p = sigmoid.val(beta_p, X)
        lam = exp.val(beta_lam, X)
    
        # y == 0 & beta_p
        jac_e0_p = np.expand_dims(
            np.the place(
                y == 0,
                (1 - np.exp(-lam)) / ((1 - p) + p * np.exp(-lam)),
                np.zeros_like(y),
            ),
            axis=1,
        ) * sigmoid.jac(beta_p, X)
        # y == 0 & beta_lam
        jac_e0_lam = np.expand_dims(
            np.the place(
                y == 0,
                p * np.exp(-lam) / ((1 - p) + p * np.exp(-lam)),
                np.zeros_like(y),
            ),
            axis=1,
        ) * exp.jac(beta_lam, X)
    
        # y > 0 & beta_p
        jac_gt0_p = np.expand_dims(
            np.the place(y > 0, -1 / p, np.zeros_like(y)), axis=1
        ) * sigmoid.jac(beta_p, X)
        # y > 0 & beta_lam
        jac_gt0_lam = np.expand_dims(
            np.the place(y > 0, 1 - y / lam, np.zeros_like(y)), axis=1
        ) * exp.jac(beta_lam, X)
    
        # initialize jac
        out = np.concatenate((jac_e0_p + jac_gt0_p, jac_e0_lam + jac_gt0_lam), axis=1)
    
        # jac for prior
        mask_b = np.ones_like(beta)
        mask_b[n_feat:] = 0
        mask_p = np.ones_like(beta)
        mask_p[:n_feat] = 0
        if self.fit_intercept:
            mask_b[n_feat - 1] = 0
            mask_p[2 * n_feat - 1] = 0
    
        return (
            np.sum(out, axis=0)
            + self.alpha_b * beta * mask_b
            + self.alpha_p * beta * mask_p
        )

    Sadly the loss operate isn’t convex, a neighborhood minima isn’t assured to be a world minima. I’ve chosen the sunshine implementation of Broyden-Fletcher-Goldfarb-Shanno from scipy as a result of it’s quicker than the gradient descent strategies that I’ve examined.

    res = decrease(
        self._loss,
        np.zeros(2 * n_feat),
        args=(X, y),
        jac=self._jac,
        methodology="L-BFGS-B",
    )

    All the class is coded on this file from the shared repo.
    After performing an hyper optimization tuning part to get the very best regularization hyper parameters, I’ve lastly computed the chosen metrics on the check set. The becoming time has been displayed along with the metrics.

    Benchmark outcomes

    Zero-inflated fashions — each ZIP and hurdle — obtain higher metrics than the naïve mannequin, linear regression, and normal Poisson regression. I initially anticipated a bigger efficiency hole, on condition that the empirical histogram of the noticed Y extra carefully resembles a ZIP distribution than a Poisson distribution. The enchancment, nonetheless, comes at the price of longer becoming occasions, significantly for the ZIP mannequin. For this use case, hurdle fashions seem to supply the very best compromise, delivering sturdy efficiency whereas holding coaching time comparatively low.

    One doable cause for the comparatively modest enchancment could also be that the information doesn’t strictly comply with a ZIP distribution. To research this, I ran one other benchmark utilizing the identical fashions on an artificial dataset particularly generated to comply with a ZIP distribution. This dataset was designed to have roughly the identical variety of observations and options as the unique one, however with a goal variable that follows ZIP distribution by design.

    Benchmark outcomes for a pretend ZIP distributed dataset

    When the goal really follows a ZIP distribution, the ZIP mannequin outperforms all the opposite fashions thought-about. It is usually value noting that, on this artificial setup, the options are now not sparse (by design), which can assist clarify the discount in becoming time.

    Conclusions

    Earlier than selecting a statistical mannequin, it’s essential to fastidiously analyze the dataset reasonably than relying solely on prior assumptions about its traits. Inspecting the empirical distribution — similar to by way of histograms — usually reveals insights that information the selection of an applicable likelihood mannequin.

    That is significantly essential for zero-inflated knowledge, the place normal fashions might wrestle. An artificial instance with a zero-inflated Poisson (ZIP) distribution reveals how the proper mannequin can present a a lot better match in comparison with alternate options, even when these alternate options should not fully misguided.

    For zero-inflated datasets, fashions such because the zero-inflated Poisson or hurdle fashions are particularly helpful. Whereas each can seize extra zeros successfully, hurdle fashions typically supply comparable efficiency with quicker coaching.

    Additional readings

    When engaged on this matter and writing the put up, I discovered this medium post [7] that I extremely suggest.

    References



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Lights, Camera, AI! Sora and Veo 3 Battle for the Future of Video Creation

    November 14, 2025

    AI Clone Lets His Stories Travel Further Than Ever

    November 14, 2025

    Promptchan Anime Generator: My Unfiltered Thoughts

    November 14, 2025

    Spearman Correlation Coefficient for When Pearson Isn’t Enough

    November 14, 2025

    Organizing Code, Experiments, and Research for Kaggle Competitions

    November 14, 2025

    Robotics with Python: Q-Learning vs Actor-Critic vs Evolutionary Algorithms

    November 13, 2025

    Comments are closed.

    Editors Picks

    Call of Duty battles to stay on top

    November 14, 2025

    Lights, Camera, AI! Sora and Veo 3 Battle for the Future of Video Creation

    November 14, 2025

    Revolutionary biofuel battery powered by sugar and vitamin B2

    November 14, 2025

    Holo raises €1 million to bring personalised lab testing and daily-life health tracking to more users

    November 14, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    London-based startup Privasee rebrands as Vera, expanding beyond privacy compliance

    July 1, 2025

    The Impact of AI on Adult Content Consumption Patterns

    April 10, 2025

    German search engine Ecosia unveils new climate impact experience for users, shifting away from tree planting

    May 18, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.