Here is an example of fitting a linear model with an interaction. Then we have some examples of figures that illustrate the interaction.
First we read in the data:
# load the foreign package
library(foreign)
# read the data directly from SPSS
lou_data <- read.spss("PBH504_Homework_LinearRegression_cardio_key.sav", to.data.frame = T)
Notice that our data are read in as all numeric because in SPSS, we did not specify labels for the levels. So as with last week, we need to tell SPSS what our numeric codes mean. We’ll again use an indicator for African American, and one for one of the Sexes. As the data are coded, 1 = female, 0 = male, but what if we needed an indicator variable for males? Here’s two ways to do that:
First, we’ll just use the factor
function as we normally
would with numeric coded variables. All we need to do is to put our
intended reference group first in the list of levels and labels:
lou_data$Sex <- factor(lou_data$Sex, levels = c(1,0), labels = c("female","male"))
# confirm that we have an indicator for male
contrasts(lou_data$Sex)
## male
## female 0
## male 1
Note, above, that in the levels
argument, the numeric
code for females comes first (1
), same with the
labels
argument. When you use the factor function on a
numeric variable and specify the levels, whichever level comes first
becomes the reference group by default.
Imagine, however, that you’ve already got a factor
and
want to change the order of the levels. You can’t use the factor command
anymore, or at least, that’s awkward. A better way is to simply use the
relevel
function from base R. There are other ways to do
this, but I find this the most straightforward. Importantly, when
specifying ref
, the reference group, you must use the same
spelling as the pre-existing factor level (here "male"
)
# set reference for Sex to "Male" and save it as a new indicator for female
lou_data$Female <- relevel(lou_data$Sex, ref = "male")
# check that it is correct
contrasts(lou_data$Female)
## female
## male 0
## female 1
We also need to change the African American indicator to a factor as
well. We’ll just use the standard factor
function.
lou_data$AfrAm <- factor(lou_data$AfrAm, levels = c(0,1), labels = c("NonAA","AA"))
# confirm the coding
contrasts(lou_data$AfrAm)
## AA
## NonAA 0
## AA 1
R has special syntax for formals to make interactions easier. In a
linear model, an interaction variable is simply the product of two (or
more) other variables. We can do this in two ways, using the asterisk
for *
or using the :
First, let’s imagine we want to include Sex, Race, and SBP as predictors of Cholesterol like last week. An important question is whether the association between blood pressure and cholesterol depends on Race. Our regression model would be:
\[E(Y|X) = \beta_{0} + \beta_{1}Female + \beta_{2}AfrAm + \beta_{3}SBP1 + \beta_{4}(SBP1 \times AfrAm)\]
So in the standard interaction model, we have main effects for all 3 predictors and then an additional variable representing the interaction. then, we test the hypothesis of an interaction by setting \(H_{0}\): \(\beta_{4} = 0\)
Below we have 2 options for including the interaction term:
SBP1*AfrAm
and
SBP1 + AfrAm + SBP1:AfrAm
They both mean the same thing! Generally, the use of the
*
is preferred, because in larger models, it’s easy to
forget the main effects terms, which are important for interpreting the
interaction.
# fit the model using *
Chol_mod1 <- lm(Chol ~ Female + SBP1*AfrAm, data = lou_data)
# See results
summary(Chol_mod1)
##
## Call:
## lm(formula = Chol ~ Female + SBP1 * AfrAm, data = lou_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.268 -34.052 -4.421 27.407 297.858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.83337 7.43590 15.712 < 2e-16 ***
## Femalefemale 13.64166 2.81115 4.853 1.37e-06 ***
## SBP1 0.29235 0.04982 5.868 5.64e-09 ***
## AfrAmAA 11.77542 14.51009 0.812 0.417
## SBP1:AfrAmAA -0.02921 0.09048 -0.323 0.747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.96 on 1252 degrees of freedom
## Multiple R-squared: 0.07076, Adjusted R-squared: 0.06779
## F-statistic: 23.83 on 4 and 1252 DF, p-value: < 2.2e-16
# use : instead
Chol_mod1a <- lm(Chol ~ Female + SBP1 + AfrAm + SBP1:AfrAm, data = lou_data)
summary(Chol_mod1a)
##
## Call:
## lm(formula = Chol ~ Female + SBP1 + AfrAm + SBP1:AfrAm, data = lou_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.268 -34.052 -4.421 27.407 297.858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.83337 7.43590 15.712 < 2e-16 ***
## Femalefemale 13.64166 2.81115 4.853 1.37e-06 ***
## SBP1 0.29235 0.04982 5.868 5.64e-09 ***
## AfrAmAA 11.77542 14.51009 0.812 0.417
## SBP1:AfrAmAA -0.02921 0.09048 -0.323 0.747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.96 on 1252 degrees of freedom
## Multiple R-squared: 0.07076, Adjusted R-squared: 0.06779
## F-statistic: 23.83 on 4 and 1252 DF, p-value: < 2.2e-16
As before, you can also test the hypotheses on all coefficents with F-tests as each predictor only uses 1 degree of freedom
## Anova Table (Type II tests)
##
## Response: Chol
## Sum Sq Df F value Pr(>F)
## Female 56458 1 23.5487 1.371e-06 ***
## SBP1 110331 1 46.0192 1.801e-11 ***
## AfrAm 12376 1 5.1622 0.02325 *
## SBP1:AfrAm 250 1 0.1042 0.74687
## Residuals 3001679 1252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To report confidence intervals for the interaction use. Note whether the confidence interval agrees with the result of either the t-test or F-test of the interaction
## Estimate 2.5 % 97.5 %
## (Intercept) 116.8334 102.2452 131.4216
## Femalefemale 13.6417 8.1266 19.1567
## SBP1 0.2924 0.1946 0.3901
## AfrAmAA 11.7754 -16.6913 40.2422
## SBP1:AfrAmAA -0.0292 -0.2067 0.1483
In PHS 605, you fit a more complex interaction model:
\[E(Chol|X) = \beta_{0} + \beta_{1}Female + \beta_{2}AfrAm + \beta_{3}SBP1 + \beta_{4}(SBP1 \times AfrAm) + \beta_{5}(Female \times AfrAm)\]
Here we’re going to specify the interactions using the :
operator to make the regressors explicit. The important thing to
remember is that we make sure we have all 3 main effects along with the
two interactions. Here’s one way to do this using *
:
Chol ~ AfrAm*(SBP1 + Sex)
Chol_mod2 <- lm(Chol ~ SBP1 + Sex + AfrAm + SBP1:AfrAm + Sex:AfrAm, data = lou_data)
# chol_mod2 <- lm(Chol ~ SBP1 + Sex*AfrAm + SBP1*AfrAm, data = lou_data)
summary(Chol_mod2)
##
## Call:
## lm(formula = Chol ~ SBP1 + Sex + AfrAm + SBP1:AfrAm + Sex:AfrAm,
## data = lou_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -143.162 -34.001 -4.101 27.217 296.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.38822 7.85303 16.858 < 2e-16 ***
## SBP1 0.28682 0.04995 5.742 1.17e-08 ***
## Sexmale -16.05385 3.27843 -4.897 1.10e-06 ***
## AfrAmAA 5.35301 15.18506 0.353 0.725
## SBP1:AfrAmAA -0.01258 0.09119 -0.138 0.890
## Sexmale:AfrAmAA 9.09029 6.36428 1.428 0.153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.94 on 1251 degrees of freedom
## Multiple R-squared: 0.07227, Adjusted R-squared: 0.06857
## F-statistic: 19.49 on 5 and 1251 DF, p-value: < 2.2e-16
Again, we can get F-tests which will agree with the t-tests, as well as confidence intervals.
## Anova Table (Type II tests)
##
## Response: Chol
## Sum Sq Df F value Pr(>F)
## SBP1 109884 1 45.8707 1.939e-11 ***
## Sex 56458 1 23.5682 1.358e-06 ***
## AfrAm 12376 1 5.1665 0.0232 *
## SBP1:AfrAm 46 1 0.0190 0.8903
## Sex:AfrAm 4887 1 2.0401 0.1534
## Residuals 2996792 1251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Estimate 2.5 % 97.5 %
## (Intercept) 132.3882 116.9817 147.7948
## SBP1 0.2868 0.1888 0.3848
## Sexmale -16.0539 -22.4857 -9.6220
## AfrAmAA 5.3530 -24.4380 35.1440
## SBP1:AfrAmAA -0.0126 -0.1915 0.1663
## Sexmale:AfrAmAA 9.0903 -3.3955 21.5761
There are many methods to plot interaction effects. The ggplot2 package has become the standard for graphing in R when there’s a mix of categorical and continuous data, and is covered in: Hadley Wickham’s book
Before running the plot, you will need to install the
ggplot2
package. It may already be installed, but running
the install.packages("ggplot2")
command will either install
it or update it.
Below you’ll create 2 plots. In the first, you’ll see that you get a scatterplot with dots that are one color for African Americans, and another color for non-AfrAm.
Based on what is in Hadley’s Chapter 3, see if you can “customize” this plot (i.e., simply change the colors of the dots.)
Also, in the ggsave
command, you should change the
filename. That command does saves the last plot you made (i.e., the one
in this chunk) to your working directory. There are other options (like
resizing it) if you want to dig deeper (not required)
library(ggplot2)
# a simple scatter with different colors for groups
ggplot(data = lou_data, aes(x = SBP1, y = Chol, col = AfrAm)) +
geom_point(size = 1, alpha = .7, position = "jitter") +
theme_minimal()
## Saving 7 x 5 in image
To make the interaction in regression more visible, you can instead create a plot with regression lines for each group, omitting the dots.
If you want to save this version, you can also use ‘ggsave()’
# instead of a scatterplot, plot the least squares regression lines for the groups in different colors
ggplot(data = lou_data, aes(x = SBP1, y = Chol, color = AfrAm)) +
theme_minimal() +
geom_smooth(method = lm,
se = TRUE,
size = .4,
alpha = .7)# to add regression line
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
Below is code for a combined figure. This is the strength of ggplot. All I’ve done is include both “geoms” in one plot (geom_point() for the dots, geom_smooth() for the lines)
# Scatter with lm lines
ggplot(data = lou_data, aes(x = SBP1, y = Chol, color = AfrAm)) +
theme_minimal() +
geom_point(size = 1, alpha = .7, position = "jitter") +
geom_smooth(method = lm,
se = FALSE,
size = .4,
alpha = .7) # to add regression line
## `geom_smooth()` using formula = 'y ~ x'
Try to alter the code below to see the separate lines (colors) for males and females