This is a tutorial on data exploration and linear modeling syntax in R using Lou’s Data
Data from 1257 individuals with various vitals, demographics, and measures of total cholesterol.
You will need the car
, psych
, and
foreign
packages.
# load the foreign package
library(foreign)
# read the data directly from SPSS
lou_data <- read.spss("PBH504_Homework_LinearRegression_cardio_key.sav", to.data.frame = T)
The first issue we have is with factor labeling. In R, Factors are the name used for categorical variables. When we import from SPSS, we lose some of that information. What we want to do is to set the attributes of the factors in the data set we’re using so they work well for regression (and other stuff). Let’s see how to do this:
In the Cardio dataset, the numeric codes for the existing indicator variable for African American are 1 = African American, 0 = non African American.
The sex variable is similar: The numeric codes are 1 = Female, 0 = Male.
Below, we use the factor()
function to tell R that the
0
’s and 1
’s are actually codes. We use the
levels
argument to tell R what the data look like, and we
use the labels
argument to label the levels.
Note: the order of the labels must match the order of the levels
# change AfrAm to a factor
lou_data$AfrAm <- factor(lou_data$AfrAm, levels = c(0,1), labels = c("nonAfrAm","AfrAm"))
# check the reference group (it should be nonAfrAm)
contrasts(lou_data$AfrAm)
## AfrAm
## nonAfrAm 0
## AfrAm 1
# change Sex to factor
lou_data$Sex <- factor(lou_data$Sex, levels = c(0,1), labels = c("male","female"))
# check ref group
contrasts(lou_data$Sex)
## female
## male 0
## female 1
A vital step in all data analysis is actually looking at the distributions of variables, including factors. It’s very important to do this first, before running any inferential analyses as we’re able to identify issues with the data that might lead to issues in modeling.
Let’s do this with some tools from the car
and
psych
packages
First, let’s use the psych package to summarize all the numeric
variables. We’ll also get the quantiles of Age
for
comparison.
## vars n mean sd median trimmed mad min max range skew kurtosis
## Age 1 1257 73.63 10.36 74 74.33 10.38 31 99 68 -0.67 0.70
## SBP1 2 1257 150.29 33.98 147 148.89 34.10 55 266 211 0.42 0.03
## DBP1 3 1257 79.99 19.55 79 79.17 16.31 20 156 136 0.43 0.76
## Pulse1 4 1257 89.95 23.69 87 88.29 22.24 20 224 204 0.78 1.25
## Resp1 5 1257 23.23 6.35 21 22.34 4.45 4 80 76 1.96 7.97
## Chol 6 1257 170.26 50.71 166 167.10 44.48 22 464 442 0.82 2.13
## se
## Age 0.29
## SBP1 0.96
## DBP1 0.55
## Pulse1 0.67
## Resp1 0.18
## Chol 1.43
## 0% 25% 50% 75% 100%
## 31 68 74 81 99
It’s often better to visualize the distribution of numeric variables, especially the outcome variable to see if we need to perform any transformations or use special regression method.
First we make a histogram of Chol
Histograms are a bit awkward at times, so we can also use what’s
called a kernal density smoother. Below, we use densityPlot
from the car
package.
When we want to have an exploratory look at any relationship between a numeric outcome and a categorical variable, the best way to go is to use boxplots. As a reminder, they are non-parametric plots of the 25th, 50th and 75th percentile, with the thick line representing the median, and the borders of the box being the 25th and 75th percentiles. The “whiskers” position the border for identifying outliers.
Again, the car package has a nice function using a capital “B”, outliers are labeled by row-number in the console.
## [1] "677" "234" "334" "473" "691" "789" "832" "841" "1068" "1254"
## [11] "427" "482" "1022" "344" "1081" "1161" "1031" "1213" "1188" "78"
## [21] "615" "54"
lou_data_no_outliers <- lou_data[-c(344,1022) , -9]
Boxplot(Chol ~ Sex, data = lou_data_no_outliers)
## [1] "677" "234" "334" "473" "691" "789" "832" "841" "1068" "1254"
## [11] "427" "482" "54" "78" "615" "797" "1031" "1081" "1161" "1188"
## [21] "1213"
Finally, what if we want to visualize the linear relationships we hope to model?
Let’s first make a neat scatterplot of Systolic Blood Pressure and Cholesterol
Now let’s condition on African American v. Non AA and smooth the line a bit
scatterplot(Chol ~ SBP1 | AfrAm, data=lou_data,
legend=list(coords="bottomright", inset=0.1),
smooth=list(span=0.9))
There are many other tools for exploratory data analysis, for example: R for Data Science Ch 7
In R, the syntax for linear models relies on the lm()
function. That function makes use of the formula
interface, a concise way of organizing data objects into dependent
and independent variables.
Let’s first fit a model with only categorical predictors: Sex and African American
##
## Call:
## lm(formula = Chol ~ Sex + AfrAm, data = lou_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.533 -33.297 -3.655 26.467 289.467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158.174 2.250 70.294 < 2e-16 ***
## Sexfemale 16.359 2.831 5.779 9.48e-09 ***
## AfrAmAfrAm 11.123 3.172 3.506 0.00047 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.82 on 1254 degrees of freedom
## Multiple R-squared: 0.03653, Adjusted R-squared: 0.03499
## F-statistic: 23.77 on 2 and 1254 DF, p-value: 7.367e-11
## Test stat Pr(>|Test stat|)
## Sex
## AfrAm
## Tukey test -1.4946 0.135
summary
prints out the information about each \(\beta\) and residualsPlots
provides plots of the residuals against the predictors and the fitted
values for diagnostic purposes.
Next, let’s get confidence intervals for all coefficients
Confint(chol_mod_1) # this is from the car package and gives the betas along with confidence intervals
## Estimate 2.5 % 97.5 %
## (Intercept) 158.17400 153.759502 162.58851
## Sexfemale 16.35890 10.805233 21.91256
## AfrAmAfrAm 11.12251 4.899226 17.34579
We can also return F-tests as in SPSS: we use the car package’s
Anova
command (note the capital A)
## Anova Table (Type II tests)
##
## Response: Chol
## Sum Sq Df F value Pr(>F)
## Sex 82882 1 33.395 9.48e-09 ***
## AfrAm 30513 1 12.294 0.0004704 ***
## Residuals 3112261 1254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let’s add a numeric predictor: Systolic Blood Pressure
chol_mod_2 <- lm(Chol ~ Sex + AfrAm + SBP1, data = lou_data)
# or run
# chol_mod_2 <- update(chol_mod_1, . ~ . + SBP1)
summary(chol_mod_2)
##
## Call:
## lm(formula = Chol ~ Sex + AfrAm + SBP1, data = lou_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.947 -33.843 -4.287 27.325 297.636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.10494 6.30484 18.732 < 2e-16 ***
## Sexfemale 13.65954 2.80960 4.862 1.31e-06 ***
## AfrAmAfrAm 7.20421 3.16968 2.273 0.0232 *
## SBP1 0.28360 0.04179 6.786 1.77e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.95 on 1253 degrees of freedom
## Multiple R-squared: 0.07068, Adjusted R-squared: 0.06846
## F-statistic: 31.77 on 3 and 1253 DF, p-value: < 2.2e-16
Now let’s see a different way to test the “significance” of SBP1: via
a nested model comparison. The F-test returned is a test of the
reduction in the residual SS that is achieved by adding
SBP1
to the model. Note that the p-value will be identical
to the p-value for t-test of the \(\beta\) for SBP1
in the full
model.
## Analysis of Variance Table
##
## Model 1: Chol ~ Sex + AfrAm
## Model 2: Chol ~ Sex + AfrAm + SBP1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1254 3112261
## 2 1253 3001929 1 110331 46.052 1.772e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The previous anova()
ran an F-test between two nested
models. This is essentially what the Anova
command does on
a single model: It runs all the different model comparisons that are
needed to test the significance of each predictor.
Here’s a demonstration:
## Anova Table (Type II tests)
##
## Response: Chol
## Sum Sq Df F value Pr(>F)
## Sex 56628 1 23.6365 1.311e-06 ***
## AfrAm 12376 1 5.1659 0.0232 *
## SBP1 110331 1 46.0521 1.772e-11 ***
## Residuals 3001929 1253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The top line is a test of Sex as a predictor. The F-value (23.6365) is obtained by comparing a model with and without Sex as a predictor, but including the others:
## Analysis of Variance Table
##
## Model 1: Chol ~ AfrAm + SBP1
## Model 2: Chol ~ Sex + AfrAm + SBP1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1254 3058557
## 2 1253 3001929 1 56628 23.637 1.311e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the anova
output, we see the same F-ratio and
p-value
Here’s how we get the test of AfrAm:
## Analysis of Variance Table
##
## Model 1: Chol ~ Sex + SBP1
## Model 2: Chol ~ Sex + AfrAm + SBP1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1254 3014306
## 2 1253 3001929 1 12376 5.1659 0.0232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova()
command is especially useful when we have
factors with 3 or more categories as predictors. Again, it
systematically runs the necessary anova()
model comparisons
by omitting all regressors associated with a factor. Here’s an example
we can run by making Age
into a categorical variable
lou_data$Age_cat <- cut(lou_data$Age, breaks = c(30, 60, 75, 100), labels = c("below 60","60-75","above 75"), right = F)
xtabs(~ Age_cat, data = lou_data)
## Age_cat
## below 60 60-75 above 75
## 115 515 627
We’ll run a model with Age_cat, Sex, AfrAm, and SBP and view the
results from both lm()
and Anova()
##
## Call:
## lm(formula = Chol ~ Age_cat + Sex + AfrAm + SBP1, data = lou_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.985 -32.150 -4.189 26.910 293.043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 127.98734 7.43761 17.208 < 2e-16 ***
## Age_cat60-75 -8.76306 5.10351 -1.717 0.086215 .
## Age_catabove 75 -17.01641 5.11115 -3.329 0.000896 ***
## Sexfemale 15.94878 2.85812 5.580 2.94e-08 ***
## AfrAmAfrAm 5.20772 3.20992 1.622 0.104974
## SBP1 0.29331 0.04167 7.038 3.20e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.7 on 1251 degrees of freedom
## Multiple R-squared: 0.08154, Adjusted R-squared: 0.07787
## F-statistic: 22.21 on 5 and 1251 DF, p-value: < 2.2e-16
## 60-75 above 75
## below 60 0 0
## 60-75 1 0
## above 75 0 1
Above we see that we need 2 regressors to include the Age_cat
variable in the model. These regressors are dummy regressors, and the
coefficients give the difference in Chol
between the
comparison group (below 60) and the two other age groups for the
reference groups of Sex and AfrAm, and average SBP1. This isn’t always
what we want. If instead, we simply want to test for some main effect of
age, i.e., that at least one age group is different than the others, we
can get that with Anova()
## Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
## arithmetic operators in their names;
## the printed representation of the hypothesis will be omitted
## Anova Table (Type II tests)
##
## Response: Chol
## Sum Sq Df F value Pr(>F)
## Age_cat 35087 2 7.3973 0.0006401 ***
## Sex 73847 1 31.1383 2.943e-08 ***
## AfrAm 6242 1 2.6321 0.1049737
## SBP1 117486 1 49.5391 3.197e-12 ***
## Residuals 2966842 1251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F and p-values for Age_cat are those obtained with this model comparison:
## Analysis of Variance Table
##
## Model 1: Chol ~ Sex + AfrAm + SBP1
## Model 2: Chol ~ Age_cat + Sex + AfrAm + SBP1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1253 3001929
## 2 1251 2966842 2 35087 7.3973 0.0006401 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The lm()
command is the workhorse of both regression and
ANOVA in R. The syntax is pretty standard across other R functions. We
will use lmer
, which allows for multiple levels (random
effects) and we’ll see that the syntax is similar.
Don’t forget to try to run all of this code yourself to gain practice!