on a regression drawback. I knew that the goal I wished to design a predictive mannequin for was countable (i.e. 0, 1, 2, …). Consequently, I instantly considered selecting a Generalized Linear Mannequin (GLM) with a related discrete distribution, just like the Poisson distribution or the Destructive binomial distribution. However the whole lot didn’t go in addition to anticipated. I mistook haste for pace.
Zero inflated knowledge
Initially, allow us to take a look at a dataset appropriate for the put up. I’ve chosen the outcomes of the NextGen National Household Travel Survey [1]. The variable of curiosity, named “BIKETRANSIT”, is the variety of “days in final 30 days biking used”, so that is an integer worth between 0 and 30 for each day customers. Here’s a histogram of the variable in query.
We will clearly see the countable knowledge is zero inflated. A lot of the respondents haven’t used a motorcycle a single day during the last 30 days. I’ve additionally observed some attention-grabbing patterns: there are usually extra individuals reporting bike use on precisely 5, 10, 15, 20, 25, or 30 days in comparison with the adjoining numbers. That is most likely as a result of respondents favor to decide on spherical numbers when they’re not sure of the exact depend. Regardless of the cause, on this put up we are going to focus totally on the problem of zero inflation by evaluating fashions designed for zero-inflated depend knowledge.
A number of survey fields have been chosen as impartial variables to clarify the variety of bike days (e.g., age, gender, employee class, training degree, family measurement, and district traits). I deliberately excluded options that depend the variety of days spent on different actions (similar to utilizing taxis or shared bikes), since a few of them are extremely correlated with the end result of curiosity. I need the mannequin to stay lifelike: predicting bike utilization over 30 days based mostly on taxi, automobile, or public transport utilization over the identical interval wouldn’t present significant insights.
Poisson regression limits
Earlier than introducing the zero inflated mannequin, I wish to illustrate the restrict of the Poisson regression, which I first thought-about for this dataset. I’ve not regarded on the Destructive Binomial distribution within the part. Poisson regression assumes that the dependent random variable Y follows a Poisson distribution, conditional on the impartial variables X and the parameters β.

So, let’s take a take a look at some empirical distributions of Y∣X,β. Since I included many options, it’s troublesome to search out numerous observations with precisely the identical values of X. To handle this, I used a clustering algorithm — AgglomerativeClustering from scikit-learn [2] — to group observations with comparable function profiles.
First, I preprocessed the information in order that it may well feed the regression fashions and likewise the clustering algorithm. I don’t need to spend an excessive amount of explaining all of the preprocessing steps as this put up doesn’t give attention to it. The total preprocessing code is accessible on a repo [8]. Briefly, I encoded the specific options utilizing one-hot encoding. I additionally utilized a number of preprocessing steps to the opposite options: imputing lacking values, clipping outliers, and making use of transformation features the place applicable. Lastly, I carried out clustering on the reworked dataset.
from sklearn.cluster import AgglomerativeClustering
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline(
[
("scaler", StandardScaler()), # I normalized the data as some numerical features, like age, have range of value greater than the one hot encoded features and I know clustering works based on some distance
("cluster", AgglomerativeClustering(n_clusters=100)) # I chose 100 clusters to have many observations in the biggest groups
]
)
cluster_id = pipe.fit_predict(X_train_preprocessed) # right here X_train_preprocessed is numerical dataframe, after encoding the specific options
Then I estimated the parameter of the Poisson distribution Ⲗ with the unbiased estimator being the imply of the noticed random variables for every group of the cluster.

I then plotted the empirical histograms together with the likelihood mass features of the fitted Poisson distributions for a number of teams of observations. To evaluate the standard of the match, I computed each the cross-entropy and the entropy, noting that entropy serves as a decrease certain for cross-entropy in response to Gibbs’ inequality [3]. A great mannequin ought to produce a cross-entropy worth near the entropy (although barely bigger).
For this evaluation, I centered on three of the biggest teams, since parameter estimation is extra dependable with bigger pattern sizes. That is significantly essential right here as a result of the information is skewed attributable to zero inflation, making it essential to gather many observations. Among the many teams, two include bike customers, whereas one group (228 respondents) reported no bike utilization in any respect. For this final group, no Poisson distribution was fitted, because the Poisson parameter have to be strictly higher than zero. Lastly, I used a vertical log scale within the plots to account for the zero inflation.

I discover it troublesome to judge the standard of the fitted distribution by wanting on the entropy and the cross entropy. Nonetheless I can see that the histogram and the likelihood mass operate differ lots. Because of this I then thought-about the Zero Inflated Poisson (ZIP) distribution.
Zero inflated knowledge tailored fashions
Fashions designed for zero-inflated knowledge purpose to seize each the excessive likelihood of zeros and the comparatively low chances of different occasions. I explored two important households of such fashions:
- “Zero-inflated fashions […] mannequin the zeros utilizing a two-component combination mannequin. […] The likelihood of the variable being zero is set by each the principle distribution and the combination weight”. “A zero-inflated mannequin can solely enhance the likelihood of P(x = 0)” [5]. For notation, I exploit the next setup (barely completely different from Wikipedia and different sources). Let X1 be a hidden variable following a Bernoulli distribution. In my notation, the likelihood of success is p (whereas Wikipedia makes use of 1-π). Let X2 be one other hidden variable following a distribution that enables zeros with nonzero likelihood. For my use case, I assume X2 is discrete. The noticed variable is then outlined as X=X1*X2 which ends up in the next likelihood mass operate:
We will discover that X1 and X2 are partially hidden. When X=0, then we can not know the values of X1 and X2, however as quickly as X>0, each variables X1 and X2 are recognized. - Hurdle models mannequin the observable “random variable […] utilizing two components, the primary of which is the likelihood of achieving the worth 0, and the second half fashions the likelihood of the non-zero values” [5]. Not like zero-inflated fashions, the second element should comply with a distribution by which the likelihood of zero is precisely zero. Utilizing the identical notation as earlier than, X1 fashions whether or not the statement is zero or non-zero (sometimes through a Bernoulli distribution). X2 follows a distribution that assigns no likelihood mass to zero. Consequently, the likelihood mass operate is:

Zero Inflated Poisson mannequin
Allow us to take a look a the Zero Inflated Poisson model [4]. The ZIP likelihood mass operate is:

It’s now doable to increase the earlier histograms and Poisson-fitted likelihood mass features by including the ZIP-fitted likelihood mass features. To do that, estimators of the 2 parameters, p and λ, are required. I used the strategy of moments to derive these estimators: the primary two moments present a system of two equations with two unknowns, which might then be solved.

So the parameter estimators are:

Lastly I’ve plotted the identical two figures with the fitted ZIP distribution likelihood mass features in addition to the cross entropy measures.

Each visible inspection and cross-entropy values present that the ZIP mannequin matches the noticed knowledge higher than the Poisson mannequin. This offers an goal and quantifiable cause to favor ZIP regression over Poisson regression.
Mannequin comparability
Allow us to now evaluate a number of fashions. I break up the information into coaching and check units, but it surely was not instantly clear which analysis metrics can be most applicable. As an illustration, ought to I depend on Poisson deviance, regardless that the information is zero-inflated? Or imply squared error, which closely penalizes outliers? In the long run, I selected to make use of a number of metrics to raised seize mannequin efficiency: imply absolute error, Poisson deviance, and correlation. The fashions I evaluated are:
- A naïve mannequin predicting the imply worth of the coaching set,
- Linear regression (lr),
- Poisson regression (pr),
- Zero-inflated Poisson regression (zip),
- A chained Logistic–Poisson regression (hurdle mannequin, lr_pr),
- A chained Logistic–Zero-Truncated Poisson regression (hurdle mannequin, lr_tpr).
ZIP mannequin
Allow us to take a look at the ZIP regression implementation. First the damaging log probability of the noticed knowledge, famous y, is:

The marginal probability of the noticed knowledge, P(Y=y), will be expressed analytically with out the integral formulation of the joint distribution, P(Y=y, X1=x1). So it will be optimized straight with no need to make use of the expectation minimization algorithm [6]. The 2 distribution parameters p and Ⲗ are features of the options X and the parameters of the mannequin β that will probably be learnt. I’ve chosen that p is outlined because the sigmoid of the dot product between X and β and Ⲗ is outlined because the exponential of the dot product between X and β. To make the mannequin extra versatile, I exploit separate units of parameters β: one for p and one other for λ.

Furthermore, I’ve added a prior on the parameters β to regularize the mannequin, particularly helpful for the Poisson mannequin for which there’s few observations due to the zero inflation. I’ve assumed a Regular prior, therefore the L2 regularization phrases added to the loss operate. I’ve assumed two completely different priors, one on the β for the Bernoulli mannequin and one on the β for the Poisson mannequin, therefore the 2 α hyper parameters, famous as alpha_b and alpha_p attributes within the mannequin. I’ve optimized these values via a hyper parameter optimization.
I created a category that inherits from scikit-learn’s BaseEstimator. The Python implementation of the loss operate is proven beneath (carried out inside the class, therefore the self argument):
def _loss(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
n_feat = X.form[1]
# break up beta into two components: one for bernoulli p and one for poisson lambda
beta_p = beta[:n_feat]
beta_lam = beta[n_feat:]
# get bernoulli p and poisson lambda
p = sigmoid.val(beta_p, X)
lam = exp.val(beta_lam, X)
# initialize damaging log probability
out = 0
# y == 0
y_e0_mask = np.the place(y == 0)[0]
out += np.sum(-np.log((1 - p) + p * np.exp(-lam))[y_e0_mask])
# y > 0
y_gt0_mask = np.the place(y > 0)[0]
out += np.sum(-np.log(p)[y_gt0_mask])
out += np.sum(-xlogy(y, lam)[y_gt0_mask])
out += np.sum(lam[y_gt0_mask])
# prior
mask_b = np.ones_like(beta)
mask_b[n_feat:] = 0
mask_p = np.ones_like(beta)
mask_p[:n_feat] = 0
if self.fit_intercept:
mask_b[n_feat - 1] = 0
mask_p[2 * n_feat - 1] = 0
out += 0.5 * self.alpha_b * np.sum((beta * mask_b) ** 2)
out += 0.5 * self.alpha_p * np.sum((beta * mask_p) ** 2)
return out
With a purpose to optimize the loss goal operate, I’ve additionally computed the jacobian of the loss.

The Python implementation is:
def _jac(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
n_feat = X.form[1]
# break up beta into two components: one for bernoulli p and one for poisson lambda
beta_p = beta[:n_feat]
beta_lam = beta[n_feat:]
# get bernoulli p and poisson lambda
p = sigmoid.val(beta_p, X)
lam = exp.val(beta_lam, X)
# y == 0 & beta_p
jac_e0_p = np.expand_dims(
np.the place(
y == 0,
(1 - np.exp(-lam)) / ((1 - p) + p * np.exp(-lam)),
np.zeros_like(y),
),
axis=1,
) * sigmoid.jac(beta_p, X)
# y == 0 & beta_lam
jac_e0_lam = np.expand_dims(
np.the place(
y == 0,
p * np.exp(-lam) / ((1 - p) + p * np.exp(-lam)),
np.zeros_like(y),
),
axis=1,
) * exp.jac(beta_lam, X)
# y > 0 & beta_p
jac_gt0_p = np.expand_dims(
np.the place(y > 0, -1 / p, np.zeros_like(y)), axis=1
) * sigmoid.jac(beta_p, X)
# y > 0 & beta_lam
jac_gt0_lam = np.expand_dims(
np.the place(y > 0, 1 - y / lam, np.zeros_like(y)), axis=1
) * exp.jac(beta_lam, X)
# initialize jac
out = np.concatenate((jac_e0_p + jac_gt0_p, jac_e0_lam + jac_gt0_lam), axis=1)
# jac for prior
mask_b = np.ones_like(beta)
mask_b[n_feat:] = 0
mask_p = np.ones_like(beta)
mask_p[:n_feat] = 0
if self.fit_intercept:
mask_b[n_feat - 1] = 0
mask_p[2 * n_feat - 1] = 0
return (
np.sum(out, axis=0)
+ self.alpha_b * beta * mask_b
+ self.alpha_p * beta * mask_p
)
Sadly the loss operate isn’t convex, a neighborhood minima isn’t assured to be a world minima. I’ve chosen the sunshine implementation of Broyden-Fletcher-Goldfarb-Shanno from scipy as a result of it’s quicker than the gradient descent strategies that I’ve examined.
res = decrease(
self._loss,
np.zeros(2 * n_feat),
args=(X, y),
jac=self._jac,
methodology="L-BFGS-B",
)
All the class is coded on this file from the shared repo.
After performing an hyper optimization tuning part to get the very best regularization hyper parameters, I’ve lastly computed the chosen metrics on the check set. The becoming time has been displayed along with the metrics.

Zero-inflated fashions — each ZIP and hurdle — obtain higher metrics than the naïve mannequin, linear regression, and normal Poisson regression. I initially anticipated a bigger efficiency hole, on condition that the empirical histogram of the noticed Y extra carefully resembles a ZIP distribution than a Poisson distribution. The enchancment, nonetheless, comes at the price of longer becoming occasions, significantly for the ZIP mannequin. For this use case, hurdle fashions seem to supply the very best compromise, delivering sturdy efficiency whereas holding coaching time comparatively low.
One doable cause for the comparatively modest enchancment could also be that the information doesn’t strictly comply with a ZIP distribution. To research this, I ran one other benchmark utilizing the identical fashions on an artificial dataset particularly generated to comply with a ZIP distribution. This dataset was designed to have roughly the identical variety of observations and options as the unique one, however with a goal variable that follows ZIP distribution by design.

When the goal really follows a ZIP distribution, the ZIP mannequin outperforms all the opposite fashions thought-about. It is usually value noting that, on this artificial setup, the options are now not sparse (by design), which can assist clarify the discount in becoming time.
Conclusions
Earlier than selecting a statistical mannequin, it’s essential to fastidiously analyze the dataset reasonably than relying solely on prior assumptions about its traits. Inspecting the empirical distribution — similar to by way of histograms — usually reveals insights that information the selection of an applicable likelihood mannequin.
That is significantly essential for zero-inflated knowledge, the place normal fashions might wrestle. An artificial instance with a zero-inflated Poisson (ZIP) distribution reveals how the proper mannequin can present a a lot better match in comparison with alternate options, even when these alternate options should not fully misguided.
For zero-inflated datasets, fashions such because the zero-inflated Poisson or hurdle fashions are particularly helpful. Whereas each can seize extra zeros successfully, hurdle fashions typically supply comparable efficiency with quicker coaching.
Additional readings
When engaged on this matter and writing the put up, I discovered this medium post [7] that I extremely suggest.

