Here is an example of fitting a linear model with an interaction. Then we have some examples of figures that illustrate the interaction.

First we read in the data:

# load the foreign package
library(foreign)

# read the data directly from SPSS
lou_data <- read.spss("PBH504_Homework_LinearRegression_cardio_key.sav", to.data.frame = T)

Re-leveling Factors

Notice that our data are read in as all numeric because in SPSS, we did not specify labels for the levels. So as with last week, we need to tell SPSS what our numeric codes mean. We’ll again use an indicator for African American, and one for one of the Sexes. As the data are coded, 1 = female, 0 = male, but what if we needed an indicator variable for males? Here’s two ways to do that:

First, we’ll just use the factor function as we normally would with numeric coded variables. All we need to do is to put our intended reference group first in the list of levels and labels:

lou_data$Sex <- factor(lou_data$Sex, levels = c(1,0), labels = c("female","male"))

# confirm that we have an indicator for male
contrasts(lou_data$Sex)
##        male
## female    0
## male      1

Note, above, that in the levels argument, the numeric code for females comes first (1), same with the labels argument. When you use the factor function on a numeric variable and specify the levels, whichever level comes first becomes the reference group by default.

Imagine, however, that you’ve already got a factor and want to change the order of the levels. You can’t use the factor command anymore, or at least, that’s awkward. A better way is to simply use the relevel function from base R. There are other ways to do this, but I find this the most straightforward. Importantly, when specifying ref, the reference group, you must use the same spelling as the pre-existing factor level (here "male")

# set reference for Sex to "Male" and save it as a new indicator for female
lou_data$Female <- relevel(lou_data$Sex, ref = "male")
# check that it is correct
contrasts(lou_data$Female)
##        female
## male        0
## female      1

We also need to change the African American indicator to a factor as well. We’ll just use the standard factor function.

lou_data$AfrAm <- factor(lou_data$AfrAm, levels = c(0,1), labels = c("NonAA","AA"))
# confirm the coding
contrasts(lou_data$AfrAm)
##       AA
## NonAA  0
## AA     1

Specifying the interaction term(s)

R has special syntax for formals to make interactions easier. In a linear model, an interaction variable is simply the product of two (or more) other variables. We can do this in two ways, using the asterisk for * or using the :

First, let’s imagine we want to include Sex, Race, and SBP as predictors of Cholesterol like last week. An important question is whether the association between blood pressure and cholesterol depends on Race. Our regression model would be:

\[E(Y|X) = \beta_{0} + \beta_{1}Female + \beta_{2}AfrAm + \beta_{3}SBP1 + \beta_{4}(SBP1 \times AfrAm)\]

So in the standard interaction model, we have main effects for all 3 predictors and then an additional variable representing the interaction. then, we test the hypothesis of an interaction by setting \(H_{0}\): \(\beta_{4} = 0\)

Below we have 2 options for including the interaction term:

SBP1*AfrAm and SBP1 + AfrAm + SBP1:AfrAm

They both mean the same thing! Generally, the use of the * is preferred, because in larger models, it’s easy to forget the main effects terms, which are important for interpreting the interaction.

# fit the model using *
Chol_mod1 <- lm(Chol ~ Female + SBP1*AfrAm, data = lou_data)

# See results
summary(Chol_mod1)
## 
## Call:
## lm(formula = Chol ~ Female + SBP1 * AfrAm, data = lou_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -142.268  -34.052   -4.421   27.407  297.858 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  116.83337    7.43590  15.712  < 2e-16 ***
## Femalefemale  13.64166    2.81115   4.853 1.37e-06 ***
## SBP1           0.29235    0.04982   5.868 5.64e-09 ***
## AfrAmAA       11.77542   14.51009   0.812    0.417    
## SBP1:AfrAmAA  -0.02921    0.09048  -0.323    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.96 on 1252 degrees of freedom
## Multiple R-squared:  0.07076,    Adjusted R-squared:  0.06779 
## F-statistic: 23.83 on 4 and 1252 DF,  p-value: < 2.2e-16
# use : instead
Chol_mod1a <- lm(Chol ~ Female + SBP1 + AfrAm + SBP1:AfrAm, data = lou_data)
summary(Chol_mod1a)
## 
## Call:
## lm(formula = Chol ~ Female + SBP1 + AfrAm + SBP1:AfrAm, data = lou_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -142.268  -34.052   -4.421   27.407  297.858 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  116.83337    7.43590  15.712  < 2e-16 ***
## Femalefemale  13.64166    2.81115   4.853 1.37e-06 ***
## SBP1           0.29235    0.04982   5.868 5.64e-09 ***
## AfrAmAA       11.77542   14.51009   0.812    0.417    
## SBP1:AfrAmAA  -0.02921    0.09048  -0.323    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.96 on 1252 degrees of freedom
## Multiple R-squared:  0.07076,    Adjusted R-squared:  0.06779 
## F-statistic: 23.83 on 4 and 1252 DF,  p-value: < 2.2e-16

As before, you can also test the hypotheses on all coefficents with F-tests as each predictor only uses 1 degree of freedom

Anova(Chol_mod1)
## Anova Table (Type II tests)
## 
## Response: Chol
##             Sum Sq   Df F value    Pr(>F)    
## Female       56458    1 23.5487 1.371e-06 ***
## SBP1        110331    1 46.0192 1.801e-11 ***
## AfrAm        12376    1  5.1622   0.02325 *  
## SBP1:AfrAm     250    1  0.1042   0.74687    
## Residuals  3001679 1252                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To report confidence intervals for the interaction use. Note whether the confidence interval agrees with the result of either the t-test or F-test of the interaction

round(Confint(Chol_mod1), 4)
##              Estimate    2.5 %   97.5 %
## (Intercept)  116.8334 102.2452 131.4216
## Femalefemale  13.6417   8.1266  19.1567
## SBP1           0.2924   0.1946   0.3901
## AfrAmAA       11.7754 -16.6913  40.2422
## SBP1:AfrAmAA  -0.0292  -0.2067   0.1483

A more complex model

In PHS 605, you fit a more complex interaction model:

\[E(Chol|X) = \beta_{0} + \beta_{1}Female + \beta_{2}AfrAm + \beta_{3}SBP1 + \beta_{4}(SBP1 \times AfrAm) + \beta_{5}(Female \times AfrAm)\]

Here we’re going to specify the interactions using the : operator to make the regressors explicit. The important thing to remember is that we make sure we have all 3 main effects along with the two interactions. Here’s one way to do this using * :

Chol ~ AfrAm*(SBP1 + Sex)

Chol_mod2 <- lm(Chol ~ SBP1 + Sex + AfrAm + SBP1:AfrAm + Sex:AfrAm, data = lou_data)

# chol_mod2 <- lm(Chol ~ SBP1 + Sex*AfrAm + SBP1*AfrAm, data = lou_data)

summary(Chol_mod2)
## 
## Call:
## lm(formula = Chol ~ SBP1 + Sex + AfrAm + SBP1:AfrAm + Sex:AfrAm, 
##     data = lou_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -143.162  -34.001   -4.101   27.217  296.620 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     132.38822    7.85303  16.858  < 2e-16 ***
## SBP1              0.28682    0.04995   5.742 1.17e-08 ***
## Sexmale         -16.05385    3.27843  -4.897 1.10e-06 ***
## AfrAmAA           5.35301   15.18506   0.353    0.725    
## SBP1:AfrAmAA     -0.01258    0.09119  -0.138    0.890    
## Sexmale:AfrAmAA   9.09029    6.36428   1.428    0.153    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.94 on 1251 degrees of freedom
## Multiple R-squared:  0.07227,    Adjusted R-squared:  0.06857 
## F-statistic: 19.49 on 5 and 1251 DF,  p-value: < 2.2e-16

Again, we can get F-tests which will agree with the t-tests, as well as confidence intervals.

Anova(Chol_mod2)
## Anova Table (Type II tests)
## 
## Response: Chol
##             Sum Sq   Df F value    Pr(>F)    
## SBP1        109884    1 45.8707 1.939e-11 ***
## Sex          56458    1 23.5682 1.358e-06 ***
## AfrAm        12376    1  5.1665    0.0232 *  
## SBP1:AfrAm      46    1  0.0190    0.8903    
## Sex:AfrAm     4887    1  2.0401    0.1534    
## Residuals  2996792 1251                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(Confint(Chol_mod2), 4)
##                 Estimate    2.5 %   97.5 %
## (Intercept)     132.3882 116.9817 147.7948
## SBP1              0.2868   0.1888   0.3848
## Sexmale         -16.0539 -22.4857  -9.6220
## AfrAmAA           5.3530 -24.4380  35.1440
## SBP1:AfrAmAA     -0.0126  -0.1915   0.1663
## Sexmale:AfrAmAA   9.0903  -3.3955  21.5761

Plotting interactions

There are many methods to plot interaction effects. The ggplot2 package has become the standard for graphing in R when there’s a mix of categorical and continuous data, and is covered in: Hadley Wickham’s book

Before running the plot, you will need to install the ggplot2 package. It may already be installed, but running the install.packages("ggplot2") command will either install it or update it.

Below you’ll create 2 plots. In the first, you’ll see that you get a scatterplot with dots that are one color for African Americans, and another color for non-AfrAm.

Based on what is in Hadley’s Chapter 3, see if you can “customize” this plot (i.e., simply change the colors of the dots.)

Also, in the ggsave command, you should change the filename. That command does saves the last plot you made (i.e., the one in this chunk) to your working directory. There are other options (like resizing it) if you want to dig deeper (not required)

library(ggplot2)

# a simple scatter with different colors for groups
ggplot(data = lou_data, aes(x = SBP1, y = Chol, col = AfrAm)) +
  geom_point(size = 1, alpha = .7, position = "jitter") +
  theme_minimal()

# save as PDF
ggsave("ChangeMe.pdf")
## Saving 7 x 5 in image

To make the interaction in regression more visible, you can instead create a plot with regression lines for each group, omitting the dots.

If you want to save this version, you can also use ‘ggsave()’

# instead of a scatterplot, plot the least squares regression lines for the groups in different colors
ggplot(data = lou_data, aes(x = SBP1, y = Chol, color = AfrAm)) +
  theme_minimal() +
  geom_smooth(method = lm,
              se     = TRUE,
              size   = .4, 
              alpha  = .7)# to add regression line
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

Below is code for a combined figure. This is the strength of ggplot. All I’ve done is include both “geoms” in one plot (geom_point() for the dots, geom_smooth() for the lines)

# Scatter with lm lines
ggplot(data = lou_data, aes(x = SBP1, y = Chol, color = AfrAm)) +
  theme_minimal() +
  geom_point(size = 1, alpha = .7, position = "jitter") +
  geom_smooth(method = lm,
              se     = FALSE,
              size   = .4, 
              alpha  = .7) # to add regression line
## `geom_smooth()` using formula = 'y ~ x'

# save as PDF
# ggsave("ChangeMe.pdf")

Extra fun!

Try to alter the code below to see the separate lines (colors) for males and females