2 Ordinary Linear Regression

Note, make sure to load the faraway package: library(faraway)

R uses the formula interface which consists of three parts:

the left-hand side, the ~ (tilde), and the right-hand side

The left-hand side of the model formula specifies the response variable
- Can be a variable name or a function like log(variable)
The tilde is a separator: think of it as “regressed on”
The right-hand side is the most complex part of the formula. It is a special expression
- it includes the names of the predictors, that R evaluates to produce the regressors for the model.
- The arithmetic operators, +, -, *, /, and ^, have special meaning on the right-hand side of a model formula

2.1 Our data

For the tutorial, we’ll use the teengamb data that is in the faraway package.

From the help page (type ?teengamb at the console):

Data has 47 rows and 5 columns. It was a from a survey studying teenage gambling in the UK
Variables:
- sex: 0 = male, 1 = female
- status: Socioeconomic status score based on parents’ occupation
- income: in pounds per week
- verbal: verbal score in words out of 12 correctly defined
- gamble: expenditure on gambling in pounds per year

library(faraway)

str(teengamb)

## 'data.frame':    47 obs. of  6 variables:
##  $ sex    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ status : int  51 28 37 28 65 61 28 27 43 18 ...
##  $ income : num  2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
##  $ verbal : int  8 8 6 4 8 6 7 5 6 7 ...
##  $ gamble : num  0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...
##  $ sex_fac: Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 2 2 2 2 ...

2.1.1 Converting from numeric to factor

Many times we have data that are numerically coded and we need R to interpret them as factors / codes. The teengamb is a good example. Here, sex = 0 corresponds to male, and sex = 1 is female. So this is an indicator variable for female.

You have two options in the binary situation, you can leave the codes if they are 0 / 1 and simply remember what they stand for, which is not good practice since that will likely lead to errors.

Also, if you want to change the reference group, you have to change the data instead of the formatting. Here’s how to make a numeric variable a factor

teengamb$sex_fac <- factor(teengamb$sex, levels = c(0, 1), labels = c("Male","Female"))

Let’s look at how this affects the model output

## model with numeric variable
gamb.mod <- lm(gamble ~ sex + status + income + verbal, data = teengamb)

summary(gamb.mod)

## 
## Call:
## lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.082 -11.320  -1.451   9.452  94.252 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.55565   17.19680   1.312   0.1968    
## sex         -22.11833    8.21111  -2.694   0.0101 *  
## status        0.05223    0.28111   0.186   0.8535    
## income        4.96198    1.02539   4.839 1.79e-05 ***
## verbal       -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.4816 
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

And now with the factor

## model with numeric variable
gamb.mod.fac <- lm(gamble ~ sex_fac + status + income + verbal, data = teengamb)

summary(gamb.mod.fac)

## 
## Call:
## lm(formula = gamble ~ sex_fac + status + income + verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.082 -11.320  -1.451   9.452  94.252 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    22.55565   17.19680   1.312   0.1968    
## sex_facFemale -22.11833    8.21111  -2.694   0.0101 *  
## status          0.05223    0.28111   0.186   0.8535    
## income          4.96198    1.02539   4.839 1.79e-05 ***
## verbal         -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.4816 
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

Results from all regression models in R using factors will print out the name of the indicator group appended to the variable name in the output

So above, sex_facFemale is short hand for the indicator variable Female constructed using the sex factor. When there are more than 2 categories, you will get multiple indicator variables from the same factor (\(k-1\) for \(k\) categories).

2.1.2 Confidence Intervals v. p-values

Reminder: A confidence interval is an estimated set of values that should “trap” the true parameter some “percentage” of time in repeated replications. We can be 95% confident in our interval, but that doesn’t mean there’s a .95 probability that the parameter is in that range
confint() returns 95% confidence intervals by default, but you can specify the level manually

Note, only the confidence limits are given, with some finagling, we can also get the coefficients to print too:

confint(gamb.mod.fac, level = .95)

##                     2.5 %     97.5 %
## (Intercept)   -12.1489038 57.2602050
## sex_facFemale -38.6890301 -5.5476301
## status         -0.5150722  0.6195399
## income          2.8926538  7.0313047
## verbal         -7.3430703  1.4240833

cbind(coef(gamb.mod.fac),confint(gamb.mod.fac, level = .95))

##                                  2.5 %     97.5 %
## (Intercept)    22.55565063 -12.1489038 57.2602050
## sex_facFemale -22.11833009 -38.6890301 -5.5476301
## status          0.05223384  -0.5150722  0.6195399
## income          4.96197922   2.8926538  7.0313047
## verbal         -2.95949350  -7.3430703  1.4240833

2.1.3 Adding interaction terms

Interaction terms can be added in 2 ways:
- using the * operator
- forming the actual product using the : operator
- multiplying to regressors together in a data-frame and adding that term with the + operator
The principle of marginality says that the interaction is best interpreted when the two main effects are in the model

2.1.4 Interaction model for Gambling

Here we use the * operator, which is equivalent to sex + income + sex:income

Since sex is binary, we’ll only need one interaction term if specifying it with the :. This is another reason to use the asterisk, as it sets up the correct number of indicator interactions for multiple category predictors.

gamb.mod2 <- lm(gamble ~ sex_fac*income + status + verbal, data = teengamb)
gamb.mod2a <- lm(gamble ~ sex + status + income + verbal + sex:income, data = teengamb)

2.1.5 Interaction Output

Coefficients and confidence intervals

summary(gamb.mod2)

## 
## Call:
## lm(formula = gamble ~ sex_fac * income + status + verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.109  -6.162  -0.938   2.267  86.503 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          19.25943   15.79635   1.219  0.22972    
## sex_facFemale         4.06362   11.51612   0.353  0.72600    
## income                6.19885    1.02591   6.042 3.77e-07 ***
## status               -0.04876    0.25978  -0.188  0.85203    
## verbal               -2.60864    1.99386  -1.308  0.19805    
## sex_facFemale:income -6.43683    2.14337  -3.003  0.00454 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.79 on 41 degrees of freedom
## Multiple R-squared:  0.6121, Adjusted R-squared:  0.5647 
## F-statistic: 12.94 on 5 and 41 DF,  p-value: 1.417e-07

confint(gamb.mod2)

##                            2.5 %     97.5 %
## (Intercept)          -12.6419453 51.1608038
## sex_facFemale        -19.1936466 27.3208960
## income                 4.1269732  8.2707189
## status                -0.5734002  0.4758751
## verbal                -6.6353263  1.4180392
## sex_facFemale:income -10.7654601 -2.1081980

2.2 ANOVA and drop1 commands

As illustrated in the Faraway book, it’s often better to examine model-comparison (likelihood-ratio) F-tests from nested models for inference about factors with multiple levels. This is an alternative to using the \(t\)-tests for the individual \(\beta\)’s as well for numeric predictors.

First, if we simply want to test the hypothesis that the addition of the interaction reduced the model residual \(SS_{resid}\) (i.e., improved fit), we can test that directly using anova().

anova(gamb.mod, gamb.mod2)

## Analysis of Variance Table
## 
## Model 1: gamble ~ sex + status + income + verbal
## Model 2: gamble ~ sex_fac * income + status + verbal
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1     42 21624                                
## 2     41 17725  1    3898.9 9.0188 0.004538 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, it’s often convenient and helpful to have a test for each predictor of the reduction in \(SS_{resid}\), and the drop1 function does just that. The output is similar to what SPSS and SAS give for their general linear model commands:

drop1(gamb.mod2, test = "F")

## Single term deletions
## 
## Model:
## gamble ~ sex_fac * income + status + verbal
##                Df Sum of Sq   RSS    AIC F value   Pr(>F)   
## <none>                      17725 290.83                    
## status          1      15.2 17740 288.87  0.0352 0.852032   
## verbal          1     740.0 18465 290.75  1.7117 0.198046   
## sex_fac:income  1    3898.9 21624 298.18  9.0188 0.004538 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since all of our predictors use up only one degree of freedom each, the \(F\)-test results are identical to the individual \(t\)-test results in the model itself. This will not be the case when we introduce factors with more than 2 categories and/or interactions involving such factors.

Notice also, that since there is an interaction in the model, drop1 will not provide a test of the main effects of income or sex. To get those, you’d have to use the first model:

drop1(gamb.mod.fac, test = "F")

## Single term deletions
## 
## Model:
## gamble ~ sex_fac + status + income + verbal
##         Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## <none>               21624 298.18                      
## sex_fac  1    3735.8 25360 303.67  7.2561   0.01011 *  
## status   1      17.8 21642 296.21  0.0345   0.85349    
## income   1   12056.2 33680 317.00 23.4169 1.792e-05 ***
## verbal   1     955.7 22580 298.21  1.8563   0.18031    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.3 Plots

Again, we’ll talk more about checking residual assumptions in later weeks, but R has built in methods for making plots:

plot(gamb.mod2)

Finally, we have a few options for making “marginal plots”: plots of the effect of one predictor holding the others constant. We’ll hold off on these for now, but the effects package has some nice features and we’ll explore that in subsequent weeks.

For now, we’ll make a simple scatterplot of income and gamble adding in the coefficient from the model (not an ideal way to make the plot…)

plot(gamble ~ income, teengamb, xlab="Income in pounds per week", ylab="Gambling expenditure in pounds per year")
abline(coef(gamb.mod)[c(1,4)], lty = 1)