Poisson Regression with R

As a reminder, our model uses the natural logarithm (\(\log (x)\)) as the link function so:

\[ \log{[\hat{\mu}(\textbf{x})]} = \beta_{0}+\beta_{1}x_{1} + \ldots \]

We use the notation \(\hat{\mu}(\textbf{x})\) as the expected value of \(y\) given \(\textbf{x}\)

Interpretation of coefficients:

Example data: PHMC Fruits question

In 2018, the Public Health Management Corporation conducted a home health survey sampling residents of the southeastern PA region (Bucks, Montgomery, Chester, Delaware, and Philadelphia Counties).

One question they asked was:

How many servings of fruits and vegetables do you eat on a typical day?

Responses were recorded as integers: \(y=\{0, \dots, 15\}\)

Here’s a histogram of the responses

d <- foreign::read.spss("https://rickhass.github.io/PHMC_count_tutorial_2.sav?raw=true", to.data.frame = T)

hist(d$FRUITS,  breaks = 12, main = "", xlab = "Number of Fruits", ylab = "Freq")

Since we do have a numeric, possibly unbounded count, but not a lot of zeros, we’ll try an ordinary linear model with lm()

We’ll use the following predictors:

  1. Whether or not the person has insurance
  2. Their biological sex
  3. Their BMI
lm.fruit <- lm(FRUITS ~ INSUREDA2 + SEX01 + BMI, data=d)
summary(lm.fruit)
## 
## Call:
## lm(formula = FRUITS ~ INSUREDA2 + SEX01 + BMI, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9551 -1.1369 -0.3328  0.6685  8.7147 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.623202   0.167370  15.673  < 2e-16 ***
## INSUREDA2No -0.222155   0.146952  -1.512   0.1307    
## SEX01Female  0.473712   0.071179   6.655 3.61e-11 ***
## BMI         -0.009467   0.005611  -1.687   0.0917 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.598 on 2081 degrees of freedom
##   (173 observations deleted due to missingness)
## Multiple R-squared:  0.02356,    Adjusted R-squared:  0.02215 
## F-statistic: 16.74 on 3 and 2081 DF,  p-value: 9.586e-11
qqnorm(residuals(lm.fruit))

plot(predict(lm.fruit), residuals(lm.fruit), xlab = "Predicted value", ylab = "Residual")

As we can see, the residuals are not normally distributed, and their variance is not constant. Perhaps we can do better with a Poisson model.

Coefficients in Poisson Models

glm.fruit <- glm(FRUITS ~ INSUREDA2 + SEX01 + BMI, family = "poisson", data=d)
summary(glm.fruit)
## 
## Call:
## glm(formula = FRUITS ~ INSUREDA2 + SEX01 + BMI, family = "poisson", 
##     data = d)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.957420   0.065044  14.720  < 2e-16 ***
## INSUREDA2No -0.088984   0.059421  -1.498   0.1343    
## SEX01Female  0.184059   0.028081   6.555 5.58e-11 ***
## BMI         -0.003588   0.002175  -1.649   0.0991 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1979.1  on 2084  degrees of freedom
## Residual deviance: 1929.6  on 2081  degrees of freedom
##   (173 observations deleted due to missingness)
## AIC: 7525.7
## 
## Number of Fisher Scoring iterations: 5
halfnorm(residuals(glm.fruit))

It looks like we may have some outliers, but we’ll proceed with interpreting coefficients before looking into the model fit.

Exponentiated Coefficients Interpretation

A good source for interpretation of Poisson coefficients is here: https://stats.idre.ucla.edu/r/dae/poisson-regression/

  • When we exponentiate, we get incidence rates, which are essentially expected counts. Let’s look at the indicator for Female: the raw coefficient is \(\hat{\beta} = 0.184\) meaning that, on average, females eat 0.184 more log fruits compared with males.

Let’s compare with the exponentiated coefficient: the IRR

exp(coef(glm.fruit))
## (Intercept) INSUREDA2No SEX01Female         BMI 
##   2.6049669   0.9148600   1.2020871   0.9964187

So \(e^{\beta} = 1.202\) meaning that females eat about \(20\%\) more fruit on average. Or that the number of fruits consumed by the average female is \(1.20\) times that eaten by males.

As with logistic regression, we can exponentiate the confidence limits to get CI’s for the incidence rate ratios:

cbind(exp(coef(glm.fruit)), exp(confint(glm.fruit)))
## Waiting for profiling to be done...
##                           2.5 %   97.5 %
## (Intercept) 2.6049669 2.2936279 2.959752
## INSUREDA2No 0.9148600 0.8126132 1.025837
## SEX01Female 1.2020871 1.1378646 1.270277
## BMI         0.9964187 0.9921598 1.000657

Overdispersion and other count regression models

As we see above, the dispersion parameter is taken to be equal to 1. We can relax that assumption and estimate it using the quasipoisson family.

  • The Poisson model assumes that the dispersion parameter, \(\phi\) is equal to 1 in the below equation

\[ Var(y \mid \textbf{x}) = \phi \times V[\mu(\textbf{x})] \]

qp.fruit <- glm(FRUITS ~ INSUREDA2 + SEX01 + BMI, family = quasipoisson, data=d)
summary(qp.fruit)
## 
## Call:
## glm(formula = FRUITS ~ INSUREDA2 + SEX01 + BMI, family = quasipoisson, 
##     data = d)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.957420   0.064138  14.928  < 2e-16 ***
## INSUREDA2No -0.088984   0.058593  -1.519   0.1290    
## SEX01Female  0.184059   0.027689   6.647  3.8e-11 ***
## BMI         -0.003588   0.002145  -1.673   0.0946 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.9723222)
## 
##     Null deviance: 1979.1  on 2084  degrees of freedom
## Residual deviance: 1929.6  on 2081  degrees of freedom
##   (173 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Here, we see that \(\hat{\phi} = 0.972\), so the original fit is likely to be ok. Indeed, we can test it’s fit directly by getting a p-value for the residual deviance of the original model:

with(glm.fruit, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance   df         p
## [1,]     1929.629 2081 0.9916986

Overdispersion and Zero-inflation

For an example of overdispersed and zero-inflated data, let’s have a look at a snippet of data collected by Jefferson Health on a number of questions on patients’ health related social needs. To protect PHI, the individual questions are not here, only the number of needs each patient reported (from zero to 8).

load(url("https://rickhass.github.io/HRSN.RData?raw=true"))

summary(HRSN.example)
##      Sex                Age                 race4                   SVI4     
##  Female:5910   19-44      :3476   White        :6666   1 - Low        :3378  
##  Male  :4090   45-64      :3274   BlackAA      :1872   2 - Low Medium :2678  
##                65-84      :2963   Asian        : 600   3 - Medium High:1800  
##                85 And Over: 287   Other/Unknown: 862   4 - High       :1371  
##                                                        NA's           : 773  
##                                                                              
##     SumScore         race3          
##  Min.   :0.0000   Length:10000      
##  1st Qu.:0.0000   Class :character  
##  Median :0.0000   Mode  :character  
##  Mean   :0.2177                     
##  3rd Qu.:0.0000                     
##  Max.   :8.0000
xtabs(~SumScore, data = HRSN.example)
## SumScore
##    0    1    2    3    4    5    6    7    8 
## 8786  723  250  109   70   38   13    9    2
mean(HRSN.example$SumScore)
## [1] 0.2177
var(HRSN.example$SumScore)
## [1] 0.5337601

So we can see that \(\text{Var}(Y) > E(Y)\) in this case.

First, let’s look at a negative binomial regression model, this ignores the fact that about \(94\%\) of patients have no needs…

Remember:

  • NB regression will often have very similar coefficients to Poisson
  • Standard errors will be larger for NB regression, if \(\phi > 1\)
  • If overdispersion is the “truth” the SEs for NB regression will be less biased

Interpretation of \(e^{\beta}\) is the same in NB and Poisson regression

We use the glm.nb function from the MASS package. The syntax is the same as before, and we don’t have to specify a family because the function just fits NB models.

hrsn.nb <- glm.nb(SumScore ~ Sex + Age + SVI4 + race3, data = HRSN.example)
summary(hrsn.nb)
## 
## Call:
## glm.nb(formula = SumScore ~ Sex + Age + SVI4 + race3, data = HRSN.example, 
##     init.theta = 0.1689023943, link = log)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.22624    0.11726 -10.458  < 2e-16 ***
## SexMale             -0.18436    0.07389  -2.495 0.012590 *  
## Age45-64            -0.22775    0.08254  -2.759 0.005794 ** 
## Age65-84            -0.80505    0.09405  -8.560  < 2e-16 ***
## Age85 And Over      -0.90897    0.25519  -3.562 0.000368 ***
## SVI42 - Low Medium   0.40764    0.09397   4.338 1.44e-05 ***
## SVI43 - Medium High  0.57892    0.10308   5.616 1.95e-08 ***
## SVI44 - High         0.92585    0.11327   8.174 2.98e-16 ***
## race3Other          -0.49160    0.11768  -4.177 2.95e-05 ***
## race3White          -0.57691    0.09381  -6.150 7.76e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1689) family taken to be 1)
## 
##     Null deviance: 3639.7  on 9226  degrees of freedom
## Residual deviance: 3322.3  on 9217  degrees of freedom
##   (773 observations deleted due to missingness)
## AIC: 9155.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.16890 
##           Std. Err.:  0.00953 
## 
##  2 x log-likelihood:  -9133.06500

This suggests there is actually **under*-dispersion. This is not exactly true. The issue is that there are very many more zeros than there ought to be given the model.

Zero-inflated models

Remember, Zero-inflated models are Mixture Models

  • Mixture distribution is built out of combinations of other distributions

In Zero-inflated Models we model separate distributions for “zero v. not zero” and then the counts \(1\) to \(\inf\)

\[ P(Y = 0) \sim Logistic \] \[ P(Y = y_{i}) \sim Count \]

Where “Count” is either Poisson or Negative Binomial

  • Technically, each has a “mixture” proportion, but we’ll ignore that

Fitting a Zero-Inflated Negative Binomial (ZINB) model requires the pscl package

library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
ZINB.1 <- zeroinfl(SumScore ~ Sex + race3 + SVI4 | SVI4 + Sex,
  data = HRSN.example, dist = "negbin")

summary(ZINB.1)
## 
## Call:
## zeroinfl(formula = SumScore ~ Sex + race3 + SVI4 | SVI4 + Sex, data = HRSN.example, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4224 -0.3173 -0.2721 -0.2198 17.5424 
## 
## Count model coefficients (negbin with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.39269    0.24793  -1.584 0.113230    
## SexMale              0.12044    0.10722   1.123 0.261302    
## race3Other          -0.41252    0.11255  -3.665 0.000247 ***
## race3White          -0.58888    0.08957  -6.575 4.88e-11 ***
## SVI42 - Low Medium  -0.01276    0.14831  -0.086 0.931425    
## SVI43 - Medium High  0.27838    0.15436   1.803 0.071315 .  
## SVI44 - High         0.31952    0.15006   2.129 0.033225 *  
## Log(theta)          -0.64406    0.36460  -1.767 0.077312 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.7370     0.3650   2.019  0.04349 *  
## SVI42 - Low Medium   -0.6833     0.2101  -3.253  0.00114 ** 
## SVI43 - Medium High  -0.5381     0.1954  -2.754  0.00588 ** 
## SVI44 - High         -1.1535     0.2572  -4.485 7.29e-06 ***
## SexMale               0.4981     0.1602   3.110  0.00187 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.5252 
## Number of iterations in BFGS optimization: 45 
## Log-likelihood: -4587 on 13 Df
exp(cbind(coef(ZINB.1), confint(ZINB.1)))
##                                         2.5 %    97.5 %
## count_(Intercept)         0.6752389 0.4153512 1.0977399
## count_SexMale             1.1279935 0.9142019 1.3917816
## count_race3Other          0.6619785 0.5309322 0.8253701
## count_race3White          0.5549508 0.4656012 0.6614467
## count_SVI42 - Low Medium  0.9873187 0.7382721 1.3203780
## count_SVI43 - Medium High 1.3209873 0.9761330 1.7876738
## count_SVI44 - High        1.3764721 1.0257433 1.8471244
## zero_(Intercept)          2.0896773 1.0217890 4.2736331
## zero_SVI42 - Low Medium   0.5049381 0.3345206 0.7621729
## zero_SVI43 - Medium High  0.5838432 0.3980924 0.8562656
## zero_SVI44 - High         0.3155179 0.1905896 0.5223345
## zero_SexMale              1.6456525 1.2023081 2.2524778

The strange part is that the Zero-inflation model is a logistic model for a zero count. So we see that compared to low SVI, those in high SVI neighborhoods had smaller odds of NOT having an needs.

The count part of the model can be interpreted the same way as ordinary Poisson or NB regression:

  • log count raw coefficient
  • IRR for the exponentiated coefficient