Poisson models

So far, our discussion of GLMs has focused on binary response variables which only take on values of 1 and 0. We began with these examples because they were relatively simple to model and because they expose and amend many of the shortcomings of linear regression. Let’s apply this new model-fitting method to a different type of response variable: count data. Count data assumes a non-negative integer number of outcomes, which typically count a number of events:

That last example is in fact drawn from real life; the statistician Ladislaus Bortkiewicz studied these same records in 1898 and noticed that the totals closely matched a mathematical formula which had been studied earlier by the mathematicians Abraham de Moivre (in 1711) and Siméon Denis Poisson (in 1837). Bortkiewicz’s work led to the recognition of the Poisson distribution, which serves as an excellent entry point for studying count regressions.

Poisson distributional assumption

Imagine an experienced freshwater fisher who lives by a large lake and loves to catch lake trout. From prior experience, the fisher knows that trout are often found in deeper waters rather than shallower waters, and now wishes to study this trend with some data. During one summer month, the fisher travels out to different spots on the lake and fishes at different depths each day, for thirty days:

Depth (m) 0.50 0.75 1.75 3.00 4.50 5.50 5.75 6.25 7.00 7.25
Caught 0 1 0 0 3 0 0 1 3 1
Depth (m) 7.25 7.50 7.75 8.50 9.25 10.00 10.75 10.75 10.75 11.50
Caught 2 3 2 6 6 7 6 7 5 12
Depth (m) 12.25 14.25 14.50 14.75 15.75 15.75 17.00 17.25 17.50 17.75
Caught 6 8 15 11 18 21 29 27 30 23

Plotting these records, the fisher notices a direct but nonlinear trend:

Code
set.seed(1210)
depth <- sort(floor(c(runif(10,0,40),runif(10,20,60),runif(10,40,80)))/4)
trout <- rpois(30,exp(-1+0.25*depth))
plot(depth,trout,main='Number of lake trout caught each day',
     xlab='Line depth (m)',ylab='Trout caught')

Illustration of Poisson count data

How should we model this data? We face some difficult constraints: nonlinearity, heteroskedasticity (note how range of count values opens up as depth increases), and non-normality (counts can only be integer-valued and non-negative, which creates extremely non-normal error distributions, especially for shallow depths.)

Let’s assume that the counts come from a Poisson distribution. We can feel comfortable with this starting point because, at least on the surface, fishing resembles a Poisson process. There are a large (essentially infinite) number of possible events (fish in the lake), and a low probability of catching any specific one of them. You can’t catch two fish at once, and the number of fish caught might be roughly proportional to the time spent fishing. Fish are probably not caught independently from each other, but Poisson still seems like a fine starting place.

As detailed in the appendices, the Poisson distribution PMF takes the form:

\[P(Y=y) = \lambda^y e^{-λ} / y! \]

Where \(\lambda\), the rate or intensity of the Poisson process, is both the mean and the variance of the distribution. Since \(\mu = \lambda\), if we use the Poisson for the distributional assumption of a GLM model, then we must choose some link function \(g\) such that \(g(\mu) = \mathbf{X}\boldsymbol{\beta}\). Then for each observation \(i\) we would have \(\lambda_i = \mu_i = g^{-1}(\boldsymbol{x}_i\boldsymbol{\beta})\), and our GLM would have the form,

\[P(Y_i=y_i) = \lambda_i^{y_i} e^{-\lambda_i} ⁄ y_i!\]

Put another way, in a Poisson regression, each observation \(y_i\) is drawn from a Poisson distribution with rate \(\lambda_i\), which is also the mean and the variance, and this rate is a (possibly nonlinear) function of the linear predictor \(\boldsymbol{x}_i\boldsymbol{\beta}\).

Model offset terms

Before we consider extensions to the Poisson model, I want to cover one particular proportional effect seen in many count regressions, a model offset term which enters the log link model as a predictor with a known beta of exactly 1.

We don’t often know the “true” beta for a predictor, so what is this mysterious offset term? Consider again the idea of a Poisson process: an event-generating process observed over a specific window or interval. The number of observed events is usually Poisson distributed and the time between events is usually exponentially distributed. One of the assumptions of any Poisson process is that the expected number of observed events is directly proportional to the size of the window or interval:

  • If we were counting annual instances of wildfires within large forests, we would expect a forest twice as large to produce twice as many fires.

  • If we were counting the number of buses that arrive at a bus stop, we would expect to count twice as many buses if we observed for twice as long a period of time.

  • If we were counting annual chicken pox cases at different elementary schools, we would expect schools with twice as many students to have twice as many cases.

The window or interval in each of these cases might be measured in units of time, or area, or people, or some other quantity, but they all can be measured and if we change this interval then the number of expected Poisson process events should scale proportionally.

Sometimes when we observe counts from a Poisson process, each observation comes from Poisson processes with different interval sizes. It would be an “apples-to-oranges” comparison if we used these different observations without any adjustment. Any difference in their counts would be wrongly attributed to differences in their predictor values when really the difference was being driven by the window size.

Let’s suppose that our fishing guide returns to the question of an ideal depth at which to fish for lake trout. They take 20 more trips out onto the lake, fishing at different depths and for different times each trip:

Depth (m) 10.50 10.50 10.75 11.75 12.00 12.25 12.75 13.00 13.50 14.75
Trip time (hrs) 2.5 2.75 2.75 2.75 2.75 3.50 3.50 3.75 4.25 4.25
Caught 1 0 1 1 2 4 2 2 2 9
Depth (m) 14.75 16.00 18.00 18.00 18.25 18.25 18.50 18.50 19.00 19.75
Trip time (hrs) 4.25 4.50 4.75 5.50 5.75 6.00 6.25 6.50 7.25 7.25
Caught 7 8 19 19 29 36 23 32 42 28

Notice that the fishing guide spends longer at the deeper fishing sites. This makes sense: they’re likely to stay longer when the fish are biting (in the deeper waters) and they’re likely to move on when nothing’s happening (in the shallow waters). So not only do we have Poisson intervals of different lengths, but also multicollinearity with our other predictor, line depth. Predictably, this would cause problems if we entered both into a log link regression, with the two predictors explaining away each other’s effects:

Code
set.seed(1209)
(depth2 <- sort(floor(runif(20,40,80))/4))
(hours <- sort(floor(runif(20,10,30))/4))
(trout3 <- rpois(20,(hours-1)/8*exp(-1+0.25*depth2)))
trout.nooff <- glm(trout3~depth2+hours,family=poisson)
summary(trout.nooff)
exp(trout.nooff$coefficients)
Log-link Poisson regression without offset
Parameter Estimate p-value
\(\beta_0\) (Intercept) -3.763 <0.0001
\(\beta_1\) (Line depth) 0.356 <0.0001
\(\beta_2\) (Trip time) 0.070 0.562
\(\,\)
Null deviance 264.935 (19 df)
Model deviance 19.162 (17 df)
AIC 97.622

The model above isn’t a bad model — the model deviance shows that it fits the data reasonably well. But the model is fundamentally misspecified. Trip time isn’t even found to be significant, which is clearly wrong, and the exponentiated coefficient suggests that every extra hour on the lake increases trout catch totals by 7.2%, which does not seem reasonable (additive shifts in time spent fishing should not yield proportional shifts in caught fish).

Let’s assume our doughty guide revises their initial analysis. As they think about how each fishing trip unfolds, they estimate that the first half-hour and last half-hour of most trips are not “quality fishing time”, but prep time and transit away from shore and out into the deeper lake waters. They add an offset to the linear predictor of \(\log (\textrm{time} - 1)\). Since they are using a log link, it’s necessary to log-transform the time offset so that it directly modifies the Poisson intensity, lambda:

\[\begin{aligned} \hat{\lambda}_i = \hat{\mu}_i &= e^{\mathbf{X}\boldsymbol{\beta} + \log (\textrm{time} - 1)} \\ &= e^{\mathbf{X}\boldsymbol{\beta}} \cdot e^{\log (\textrm{time} - 1)} \\ &= e^{\mathbf{X}\boldsymbol{\beta}} \cdot (\textrm{time} - 1) \end{aligned}\]

The use of an offset term has now changed our interpretation of the betas in the model. Instead of modeling the total count of the response variable, we are now modeling the rate of response counts for a single unit of the offset variable. In this example, we are now estimating the number of trout caught per hour at different line depths:

Code
trout.off <- glm(trout3~offset(log(hours-1))+depth2,family=poisson)
summary(trout.off)
exp(trout.off$coefficients)  
Parameter Estimate p-value
\(\beta_0\) (Intercept) -3.112 <0.0001
\(\beta_1\) (Line depth) 0.255 <0.0001
\(\,\)
Null deviance 108.139 (19 df)
Model deviance 18.135 (18 df)
AIC 94.596

Notice how the coefficient for line depth has changed from 0.356 to 0.255. I actually simulated this data (and the depth data seen previously) using a true beta of 0.25, so adding an offset has helped us recover the true relationship much better than we could without the offset. Exponentiating the coefficient for line depth, we find that increasing line depth by 1 meter increases the trout/hour catch rate by 29.1%, which closely matches our earlier estimate with a previous dataset of 26.7%.

Compare the deviances in the two regression summaries printed above. The null deviance without an offset was approximately 265; with an offset, 108. By manually adjusting the counts so that they now become “apples-to-apples” rates, we have removed a large source of variation from the data, and the remaining space for modeling improvements is now smaller. We can’t perform a deviance test between the model with an offset and the model without an offset because of this difference in starting points. However, we can still compare the two models using their AICs, since there’s no change to the response variable of fish caught. By this measure, we further confirm that the model with an offset (AIC=94.6) is a better fit to the data than the model without an offset (AIC=97.6).


  1. Remember that the Poisson distribution measures non-negative integer counts, so the mean \(\lambda\) can never be negative. Moreover, the PDF includes the term \(y!\) which is not defined for the negative values of \(y\) implied by a negative mean.↩︎

  2. The p-value of this hypothesis test would be how often a chi-squared distribution with \(29-28=1\) degrees of freedom would generate a value of at least \(283.426-29.039 \approx 254.387\), which is essentially… never.↩︎

  3. The p-value of this hypothesis test would be how often a chi-squared distribution with 28 degrees of freedom would generate a value of at least 29.039, which is roughly 41% of the time.↩︎