Building new GLMs
Over the next few sections we will learn more about GLM models, but we will only discuss a few more combinations of distribution and link function in depth. Many other combinations must remain unexplored at this time. Still, one of the great advantages of GLMs is their flexibility — their “plug and play” construction — and I hope that readers who become familiar with logistic regression, probit regression, and count models will feel emboldened to try new GLM distributions and link functions on their own.
Choosing a distributional assumption for Y
Our first use case for GLMs involved Bernoulli data — 1s and 0s. This data cannot be easily confused with samples from other distributions. However, in the broader world of GLM uses, we will sometimes see data of uncertain distribution, especially if the distribution is continuous. Before we begin modeling, how would we know if the data are conditionally normal, conditionally gamma, conditionally exponential, etc.?
We have two ways forward. One is to rely upon prior scholarship and context about the data. Many use cases we might encounter have been studied before. The gamma distribution, for example, is uncommon in introductory textbooks but used more often in finance, insurance, medical, and environmental disciplines to model times and amounts which aggregate multiple exponentially distributed components. If we were modeling rainfall collections in a reservoir or reinsurer liabilities in the wake of a hurricane, we might naturally start with a gamma assumption, based purely on our understanding of the world and the context of our data.1
The second way forward is to simply use what works. Examine the data closely during the EDA phase, and build a set of testable assumptions about your data’s distribution. Notice whether the values are symmetric around the mean or skewed. Notice whether the data show heteroskedasticity or not. Pick one or more candidate distributions and compare their fits using likelihood-based metrics such as AIC. Recall George Box’s adage: all models are wrong; some are useful.
Choosing a link function for the linear predictor
After we pick a distribution, we next need to choose the link function \(g(\mu)\) which relates the linear predictor \(\mathbf{X} \boldsymbol{\beta}\) to the mean response. We are not always provided guidance on the “right” link function. When I described the logistic function above, I tried to motivate why we might choose it for modeling Bernoulli data: it produces an S-shaped curve demonstrating diminishing returns, and it transforms a mean located on the [0, 1] interval into a boundless range better-suited for linear modeling. But many functions will do those things for us (we will study another one in the next section). Why choose the logistic function specifically?
There are two ways to pick a link function, one from theory and one from expediency. The theoretical choice follows from a close study of probability distributions and their PDFs/PMFs. Many of the distributions we have studied can be shown to be specific forms of a more general pattern called an exponential family of distributions. (Which is different than the exponential distribution… in fact, the exponential distribution is a specific example of an exponential family.) Past research on this more general form shows that every specific distribution can be associated with a “canonical” link function. Using this canonical link function in a GLM brings some nice properties, including:
The average of the residuals (on the response scale) will be zero.
Two common methods which solve for the unknown parameters, Newton’s Method and Fisher Scoring, can be shown to be mathematically identical.
The vector \(\mathbf{X}^T \boldsymbol{y}\) becomes a sufficient statistic for the model, meaning that it contains all the information needed to recover the best estimates for the parameters.
Because of these properties, statisticians often privilege the canonical link function for each distribution above other choices. The canonical link function seems like it was made to fit the distribution best, like a key and a keyhole.
The expedient choice, however, notes that our predictor variables are not required to relate to the mean response in the manner described by a link function. Sometimes they do, sometimes they don’t, sometimes they kinda-sorta-do. When you read the advantages of the canonical distribution listed above, you might have thought, “but I don’t really care about these supposed benefits…” You wouldn’t be alone! In many cases, experienced practitioners will choose to pick a different link function. Reasons for picking non-canonical link functions include:
Better model fit to the data
Greater interpretability for the betas, or simply a different interpretation
Avoiding predictions for the mean response which are not possible under our distributional assumption for \(Y\)2
The table below describes some of the more common distribution-link pairings, including some non-canonical links. The canonical link for each distribution is listed first:
| Distribution | Link name | Link function | Mean function |
|---|---|---|---|
| Normal | Identity | \(\boldsymbol{x \beta} = \mu\) | \(\mu = \boldsymbol{x \beta}\) |
| Log | \(\boldsymbol{x \beta} = \log \mu\) | \(\mu = e^{\boldsymbol{x \beta}}\) | |
| Bernoulli | Logit | \(\boldsymbol{x \beta} = \log (\mu/(1 - \mu))\) | \(\mu = e^{\boldsymbol{x \beta}}/(1 + e^{\boldsymbol{x \beta}})\) |
| Probit | \(\boldsymbol{x \beta} = \Phi^{-1} (\mu)\) | \(\mu = \Phi (\boldsymbol{x \beta})\) | |
| Complementary log-log | \(\boldsymbol{x \beta} = \log(-\log(1 - \mu))\) | \(\mu = 1-e^{-e^{\boldsymbol{x \beta}}}\) | |
| Identity | \(\boldsymbol{x \beta} = \mu\) | \(\mu = \boldsymbol{x \beta}\) | |
| Poisson | Log | \(\boldsymbol{x \beta} = \log \mu\) | \(\mu = e^{\boldsymbol{x \beta}}\) |
| Square root | \(\boldsymbol{x \beta} = \sqrt\mu\) | \(\mu = (\boldsymbol{x \beta})^2\) | |
| Identity | \(\boldsymbol{x \beta} = \mu\) | \(\mu = \boldsymbol{x \beta}\) | |
| Exponential | Negative inverse | \(\boldsymbol{x \beta} = -1/\mu\) | \(\boldsymbol{x \beta} = -1/\mu\) |
| Negative binomial | Log | \(\boldsymbol{x \beta} = \log \mu\) | \(\mu = e^{\boldsymbol{x \beta}}\) |
| Gamma | Inverse | \(\boldsymbol{x \beta} = 1/\mu\) | \(\mu = 1/(\boldsymbol{x \beta})\) |
At the risk of repeating myself, this is both the weakness and the strength of parametric modeling vis-à-vis many machine learning models. They require (and provide) an understanding of how the world works.↩︎
For example, the canonical link of exponential distribution has a mean function of \(\mu = -1⁄\boldsymbol{x \beta}\), which can create impossible estimates of negative means (the exponential is always positive-valued).↩︎