This is a tutorial on data exploration and linear modeling syntax in R using Lou’s Data

Data from 1257 individuals with various vitals, demographics, and measures of total cholesterol.

You will need the car, psych, and foreign packages.

# load the foreign package
library(foreign)

# read the data directly from SPSS
lou_data <- read.spss("PBH504_Homework_LinearRegression_cardio_key.sav", to.data.frame = T)

Data cleaning

The first issue we have is with factor labeling. In R, Factors are the name used for categorical variables. When we import from SPSS, we lose some of that information. What we want to do is to set the attributes of the factors in the data set we’re using so they work well for regression (and other stuff). Let’s see how to do this:

In the Cardio dataset, the numeric codes for the existing indicator variable for African American are 1 = African American, 0 = non African American.

The sex variable is similar: The numeric codes are 1 = Female, 0 = Male.

Below, we use the factor() function to tell R that the 0’s and 1’s are actually codes. We use the levels argument to tell R what the data look like, and we use the labels argument to label the levels.

Note: the order of the labels must match the order of the levels

# change AfrAm to a factor
lou_data$AfrAm <- factor(lou_data$AfrAm, levels = c(0,1), labels = c("nonAfrAm","AfrAm"))
# check the reference group (it should be nonAfrAm)
contrasts(lou_data$AfrAm)
##          AfrAm
## nonAfrAm     0
## AfrAm        1
# change Sex to factor
lou_data$Sex <- factor(lou_data$Sex, levels = c(0,1), labels = c("male","female"))
# check ref group
contrasts(lou_data$Sex)
##        female
## male        0
## female      1

Exploring the data

A vital step in all data analysis is actually looking at the distributions of variables, including factors. It’s very important to do this first, before running any inferential analyses as we’re able to identify issues with the data that might lead to issues in modeling.

Let’s do this with some tools from the car and psych packages

First, let’s use the psych package to summarize all the numeric variables. We’ll also get the quantiles of Age for comparison.

describe(lou_data[, c("Age","SBP1","DBP1","Pulse1","Resp1","Chol")])
##        vars    n   mean    sd median trimmed   mad min max range  skew kurtosis
## Age       1 1257  73.63 10.36     74   74.33 10.38  31  99    68 -0.67     0.70
## SBP1      2 1257 150.29 33.98    147  148.89 34.10  55 266   211  0.42     0.03
## DBP1      3 1257  79.99 19.55     79   79.17 16.31  20 156   136  0.43     0.76
## Pulse1    4 1257  89.95 23.69     87   88.29 22.24  20 224   204  0.78     1.25
## Resp1     5 1257  23.23  6.35     21   22.34  4.45   4  80    76  1.96     7.97
## Chol      6 1257 170.26 50.71    166  167.10 44.48  22 464   442  0.82     2.13
##          se
## Age    0.29
## SBP1   0.96
## DBP1   0.55
## Pulse1 0.67
## Resp1  0.18
## Chol   1.43
quantile(lou_data$Age)
##   0%  25%  50%  75% 100% 
##   31   68   74   81   99

It’s often better to visualize the distribution of numeric variables, especially the outcome variable to see if we need to perform any transformations or use special regression method.

First we make a histogram of Chol

hist(lou_data$Chol)

hist(lou_data$Chol, breaks = 25)

Histograms are a bit awkward at times, so we can also use what’s called a kernal density smoother. Below, we use densityPlot from the car package.

densityPlot(~ Chol, data = lou_data, xlab = "Cholesterol")

When we want to have an exploratory look at any relationship between a numeric outcome and a categorical variable, the best way to go is to use boxplots. As a reminder, they are non-parametric plots of the 25th, 50th and 75th percentile, with the thick line representing the median, and the borders of the box being the 25th and 75th percentiles. The “whiskers” position the border for identifying outliers.

Again, the car package has a nice function using a capital “B”, outliers are labeled by row-number in the console.

Boxplot(Chol ~ Sex, data = lou_data)

##  [1] "677"  "234"  "334"  "473"  "691"  "789"  "832"  "841"  "1068" "1254"
## [11] "427"  "482"  "1022" "344"  "1081" "1161" "1031" "1213" "1188" "78"  
## [21] "615"  "54"
lou_data_no_outliers <- lou_data[-c(344,1022) , -9]

Boxplot(Chol ~ Sex, data = lou_data_no_outliers)

##  [1] "677"  "234"  "334"  "473"  "691"  "789"  "832"  "841"  "1068" "1254"
## [11] "427"  "482"  "54"   "78"   "615"  "797"  "1031" "1081" "1161" "1188"
## [21] "1213"

Finally, what if we want to visualize the linear relationships we hope to model?

Let’s first make a neat scatterplot of Systolic Blood Pressure and Cholesterol

scatterplot(Chol ~ SBP1, data = lou_data)

Now let’s condition on African American v. Non AA and smooth the line a bit

scatterplot(Chol ~ SBP1 | AfrAm, data=lou_data,
    legend=list(coords="bottomright", inset=0.1),
    smooth=list(span=0.9))

There are many other tools for exploratory data analysis, for example: R for Data Science Ch 7

Fitting the Models

In R, the syntax for linear models relies on the lm() function. That function makes use of the formula interface, a concise way of organizing data objects into dependent and independent variables.

Let’s first fit a model with only categorical predictors: Sex and African American

chol_mod_1 <- lm(Chol ~ Sex + AfrAm, data = lou_data)

summary(chol_mod_1) # this gives the Betas
## 
## Call:
## lm(formula = Chol ~ Sex + AfrAm, data = lou_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -152.533  -33.297   -3.655   26.467  289.467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  158.174      2.250  70.294  < 2e-16 ***
## Sexfemale     16.359      2.831   5.779 9.48e-09 ***
## AfrAmAfrAm    11.123      3.172   3.506  0.00047 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.82 on 1254 degrees of freedom
## Multiple R-squared:  0.03653,    Adjusted R-squared:  0.03499 
## F-statistic: 23.77 on 2 and 1254 DF,  p-value: 7.367e-11
residualPlots(chol_mod_1)

##            Test stat Pr(>|Test stat|)
## Sex                                  
## AfrAm                                
## Tukey test   -1.4946            0.135

summary prints out the information about each \(\beta\) and residualsPlots provides plots of the residuals against the predictors and the fitted values for diagnostic purposes.

Next, let’s get confidence intervals for all coefficients

Confint(chol_mod_1) # this is from the car package and gives the betas along with confidence intervals
##              Estimate      2.5 %    97.5 %
## (Intercept) 158.17400 153.759502 162.58851
## Sexfemale    16.35890  10.805233  21.91256
## AfrAmAfrAm   11.12251   4.899226  17.34579

We can also return F-tests as in SPSS: we use the car package’s Anova command (note the capital A)

Anova(chol_mod_1)
## Anova Table (Type II tests)
## 
## Response: Chol
##            Sum Sq   Df F value    Pr(>F)    
## Sex         82882    1  33.395  9.48e-09 ***
## AfrAm       30513    1  12.294 0.0004704 ***
## Residuals 3112261 1254                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now let’s add a numeric predictor: Systolic Blood Pressure

chol_mod_2 <- lm(Chol ~ Sex + AfrAm + SBP1, data = lou_data)

# or run
# chol_mod_2 <- update(chol_mod_1, . ~ . + SBP1)

summary(chol_mod_2)
## 
## Call:
## lm(formula = Chol ~ Sex + AfrAm + SBP1, data = lou_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -141.947  -33.843   -4.287   27.325  297.636 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 118.10494    6.30484  18.732  < 2e-16 ***
## Sexfemale    13.65954    2.80960   4.862 1.31e-06 ***
## AfrAmAfrAm    7.20421    3.16968   2.273   0.0232 *  
## SBP1          0.28360    0.04179   6.786 1.77e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.95 on 1253 degrees of freedom
## Multiple R-squared:  0.07068,    Adjusted R-squared:  0.06846 
## F-statistic: 31.77 on 3 and 1253 DF,  p-value: < 2.2e-16

Now let’s see a different way to test the “significance” of SBP1: via a nested model comparison. The F-test returned is a test of the reduction in the residual SS that is achieved by adding SBP1 to the model. Note that the p-value will be identical to the p-value for t-test of the \(\beta\) for SBP1 in the full model.

anova(chol_mod_1, chol_mod_2)
## Analysis of Variance Table
## 
## Model 1: Chol ~ Sex + AfrAm
## Model 2: Chol ~ Sex + AfrAm + SBP1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   1254 3112261                                  
## 2   1253 3001929  1    110331 46.052 1.772e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Nested models

The previous anova() ran an F-test between two nested models. This is essentially what the Anova command does on a single model: It runs all the different model comparisons that are needed to test the significance of each predictor.

Here’s a demonstration:

## run all model comparisons necessary
Anova(chol_mod_2)
## Anova Table (Type II tests)
## 
## Response: Chol
##            Sum Sq   Df F value    Pr(>F)    
## Sex         56628    1 23.6365 1.311e-06 ***
## AfrAm       12376    1  5.1659    0.0232 *  
## SBP1       110331    1 46.0521 1.772e-11 ***
## Residuals 3001929 1253                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The top line is a test of Sex as a predictor. The F-value (23.6365) is obtained by comparing a model with and without Sex as a predictor, but including the others:

mod_no_sex <- lm(Chol ~ AfrAm + SBP1, data = lou_data)

anova(mod_no_sex, chol_mod_2)
## Analysis of Variance Table
## 
## Model 1: Chol ~ AfrAm + SBP1
## Model 2: Chol ~ Sex + AfrAm + SBP1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   1254 3058557                                  
## 2   1253 3001929  1     56628 23.637 1.311e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the anova output, we see the same F-ratio and p-value

Here’s how we get the test of AfrAm:

mod_no_AA <- lm(Chol ~ Sex + SBP1, data = lou_data)

anova(mod_no_AA, chol_mod_2)
## Analysis of Variance Table
## 
## Model 1: Chol ~ Sex + SBP1
## Model 2: Chol ~ Sex + AfrAm + SBP1
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)  
## 1   1254 3014306                             
## 2   1253 3001929  1     12376 5.1659 0.0232 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova() command is especially useful when we have factors with 3 or more categories as predictors. Again, it systematically runs the necessary anova() model comparisons by omitting all regressors associated with a factor. Here’s an example we can run by making Age into a categorical variable

lou_data$Age_cat <- cut(lou_data$Age, breaks = c(30, 60, 75, 100), labels = c("below 60","60-75","above 75"), right = F)

xtabs(~ Age_cat, data = lou_data)
## Age_cat
## below 60    60-75 above 75 
##      115      515      627

We’ll run a model with Age_cat, Sex, AfrAm, and SBP and view the results from both lm() and Anova()

age_mod <- lm(Chol ~ Age_cat + Sex + AfrAm + SBP1, data = lou_data)

summary(age_mod)
## 
## Call:
## lm(formula = Chol ~ Age_cat + Sex + AfrAm + SBP1, data = lou_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -139.985  -32.150   -4.189   26.910  293.043 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     127.98734    7.43761  17.208  < 2e-16 ***
## Age_cat60-75     -8.76306    5.10351  -1.717 0.086215 .  
## Age_catabove 75 -17.01641    5.11115  -3.329 0.000896 ***
## Sexfemale        15.94878    2.85812   5.580 2.94e-08 ***
## AfrAmAfrAm        5.20772    3.20992   1.622 0.104974    
## SBP1              0.29331    0.04167   7.038 3.20e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.7 on 1251 degrees of freedom
## Multiple R-squared:  0.08154,    Adjusted R-squared:  0.07787 
## F-statistic: 22.21 on 5 and 1251 DF,  p-value: < 2.2e-16
contrasts(lou_data$Age_cat)
##          60-75 above 75
## below 60     0        0
## 60-75        1        0
## above 75     0        1

Above we see that we need 2 regressors to include the Age_cat variable in the model. These regressors are dummy regressors, and the coefficients give the difference in Chol between the comparison group (below 60) and the two other age groups for the reference groups of Sex and AfrAm, and average SBP1. This isn’t always what we want. If instead, we simply want to test for some main effect of age, i.e., that at least one age group is different than the others, we can get that with Anova()

Anova(age_mod)
## Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
##      arithmetic operators in their names;
##   the printed representation of the hypothesis will be omitted
## Anova Table (Type II tests)
## 
## Response: Chol
##            Sum Sq   Df F value    Pr(>F)    
## Age_cat     35087    2  7.3973 0.0006401 ***
## Sex         73847    1 31.1383 2.943e-08 ***
## AfrAm        6242    1  2.6321 0.1049737    
## SBP1       117486    1 49.5391 3.197e-12 ***
## Residuals 2966842 1251                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F and p-values for Age_cat are those obtained with this model comparison:

anova(chol_mod_2, age_mod)
## Analysis of Variance Table
## 
## Model 1: Chol ~ Sex + AfrAm + SBP1
## Model 2: Chol ~ Age_cat + Sex + AfrAm + SBP1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   1253 3001929                                  
## 2   1251 2966842  2     35087 7.3973 0.0006401 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summing it up

The lm() command is the workhorse of both regression and ANOVA in R. The syntax is pretty standard across other R functions. We will use lmer, which allows for multiple levels (random effects) and we’ll see that the syntax is similar.

Don’t forget to try to run all of this code yourself to gain practice!