lavaan
?The R package lavaan
was developed by Yves Rosseel to
better enable students, researchers, and instructors to perform
structural equation modeling in a non-commercial open-source environment
(more info here).
The leading commercial software options include: Mplus, AMOS, LISREL, and EQS
Though it doesn’t provide a graphical user interface (i.e., point and click) as do some other options, it is free and enables fitting complex SEM models with very simple and easy to understand syntax.
This lesson has 5 goals:
lavaan
syntaxlavaan
If you have not already done so, please install lavaan
using either the install packages tool in the packages tab, or by typing
install.packages("lavaan")
in your console.
Remember, you must also load the package with the
library(lavaan)
command
lavaan
syntaxTo specify a model in lavaan
there is a special syntax,
based on the LISREL syntax. We will begin with the example from the
lavaan
tutorial.
In the R
environment, there is a formula interface that
underpins all regression models:
Y ~ x1 + x2 + x3 ... + x4
For example, let’s say I want to fit a regression model with health
related quality of life as the dependent variable, and treatment type,
age, and baseline pain as predictors. In R
I’d use the
linear model command lm()
with the following formula:
HRQOL ~ treatment + age + baseline
We read this as “Health related quality of life is regressed on treatment, age, and baseline pain.”
In latent variable models, we’ll have a number of different ways that we need to relate variable together besides simply regressed on
From the lavaan
tutorial:
Formula Type | Operator | Mnemonic |
---|---|---|
Latent variable definition | =~ |
is measured as |
Regression | ~ |
is regressed on |
Variances and Covariances | ~~ |
is correlated with |
Intercept | ~ 1 |
setting the intercept |
Suppose I have the following model:
And the following relationships in the model
We would use the lavaan interface in the following way
# in lavaan
mod1.lav <- '
# regressions
Alienation67 ~ SES
Alienation71 ~ Alienation67 + SES
# latent variables
SES =~ Education + SEI
Alienation67 =~ Anomia67 + Powerless67
Alienation71 =~ Anomia71 + Powerless71
# optional variance or covariance specifications
Anomia67 ~~ Anomia71
Powerless67 ~~ Powerless71
# optional intercepts
SES ~ 1
'
Note the use of comments #
in lavaan.
It is even easier to specify and fit confirmatory factor analysis
models in lavaan
as the basic versions will only include
the latent variable equations, what we usually call the
measurement model
For example, let’s look at the data that are packaged with
lavaan
from Holzinger and Swineford’s (1939) well-known and
oft used study of mental tests in children. From the help file
`?HolzingerSwineford1939:
The classic Holzinger and Swineford (1939) dataset consists of mental ability test scores of seventh- and eighth-grade children from two different schools (Pasteur and Grant-White). In the original dataset (available in the MBESS package), there are scores for 26 tests. However, a smaller subset with 9 variables is more widely used in the literature (for example in Joreskog’s 1969 paper, which also uses the 145 subjects from the Grant-White school only).
## 'data.frame': 301 obs. of 15 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 11 ...
## $ sex : int 1 2 2 1 2 2 1 2 2 2 ...
## $ ageyr : int 13 13 13 13 12 14 12 12 13 12 ...
## $ agemo : int 1 7 1 2 2 1 1 2 0 5 ...
## $ school: Factor w/ 2 levels "Grant-White",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ grade : int 7 7 7 7 7 7 7 7 7 7 ...
## $ x1 : num 3.33 5.33 4.5 5.33 4.83 ...
## $ x2 : num 7.75 5.25 5.25 7.75 4.75 5 6 6.25 5.75 5.25 ...
## $ x3 : num 0.375 2.125 1.875 3 0.875 ...
## $ x4 : num 2.33 1.67 1 2.67 2.67 ...
## $ x5 : num 5.75 3 1.75 4.5 4 3 6 4.25 5.75 5 ...
## $ x6 : num 1.286 1.286 0.429 2.429 2.571 ...
## $ x7 : num 3.39 3.78 3.26 3 3.7 ...
## $ x8 : num 5.75 6.25 3.9 5.3 6.3 6.65 6.2 5.15 4.65 4.55 ...
## $ x9 : num 6.36 7.92 4.42 4.86 5.92 ...
Key:
Let’s look at the correlation matrix of the 9 indicators. Can you see any patterns? Try looking at the intercorrelations between tests 1,2,3 and between 4,5,6, and 7,8,9:
## x1 x2 x3 x4 x5 x6 x7 x8 x9
## x1 1.00 0.30 0.44 0.37 0.29 0.36 0.07 0.22 0.39
## x2 0.30 1.00 0.34 0.15 0.14 0.19 -0.08 0.09 0.21
## x3 0.44 0.34 1.00 0.16 0.08 0.20 0.07 0.19 0.33
## x4 0.37 0.15 0.16 1.00 0.73 0.70 0.17 0.11 0.21
## x5 0.29 0.14 0.08 0.73 1.00 0.72 0.10 0.14 0.23
## x6 0.36 0.19 0.20 0.70 0.72 1.00 0.12 0.15 0.21
## x7 0.07 -0.08 0.07 0.17 0.10 0.12 1.00 0.49 0.34
## x8 0.22 0.09 0.19 0.11 0.14 0.15 0.49 1.00 0.45
## x9 0.39 0.21 0.33 0.21 0.23 0.21 0.34 0.45 1.00
Though a more complicated model was tested in the original paper, these data were used by Joreskog in his seminal paper on confirmatory factor analysis and are often used as a standard to look at ways of estimating simple CFAs. The figure below shows the hypothesized factor structure:
All we need to do to fit this model is specify the latent structure
using lavaan’s syntax and this sequence of commands (we assume the
library(lavaan)
command has already been run)
# specify the factor structure using lavaan syntax and save it
holzmod <- "
# latent variables
Visual =~ x1 + x2 + x3
Textual =~ x4 + x5 + x6
Speed =~ x7 + x8 + x9
"
# using the cfa command to fit the model
holzfit <- cfa(holzmod, data = HolzingerSwineford1939)
We’ve already seen how the syntax works. Here we’re only using the
=~
and +
operators as that’s all we need to
define latent variables. There are no “structural” regressions.
Then, the cfa()
command takes that text and fits a model
using the data we specify. A few things are important to mention about
this function:
cfa()
uses the “scale by indicator”
specification to make the model identifiablecfa()
is actually a wrapper function for the more
flexible sem()
function in the same package, which is used
to fit more elaborate structural equation models.Now, let’s see what the results look like. Fortunately, there is a
convenient summary()
method for this model class. It breaks
the results of the model fit into several parts:
standardized = TRUE
)## lavaan 0.6-19 ended normally after 35 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 21
##
## Number of observations 301
##
## Model Test User Model:
##
## Test statistic 85.306
## Degrees of freedom 24
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 918.852
## Degrees of freedom 36
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.931
## Tucker-Lewis Index (TLI) 0.896
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -3737.745
## Loglikelihood unrestricted model (H1) -3695.092
##
## Akaike (AIC) 7517.490
## Bayesian (BIC) 7595.339
## Sample-size adjusted Bayesian (SABIC) 7528.739
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.092
## 90 Percent confidence interval - lower 0.071
## 90 Percent confidence interval - upper 0.114
## P-value H_0: RMSEA <= 0.050 0.001
## P-value H_0: RMSEA >= 0.080 0.840
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.065
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Visual =~
## x1 1.000
## x2 0.554 0.100 5.554 0.000
## x3 0.729 0.109 6.685 0.000
## Textual =~
## x4 1.000
## x5 1.113 0.065 17.014 0.000
## x6 0.926 0.055 16.703 0.000
## Speed =~
## x7 1.000
## x8 1.180 0.165 7.152 0.000
## x9 1.082 0.151 7.155 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Visual ~~
## Textual 0.408 0.074 5.552 0.000
## Speed 0.262 0.056 4.660 0.000
## Textual ~~
## Speed 0.173 0.049 3.518 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 0.549 0.114 4.833 0.000
## .x2 1.134 0.102 11.146 0.000
## .x3 0.844 0.091 9.317 0.000
## .x4 0.371 0.048 7.779 0.000
## .x5 0.446 0.058 7.642 0.000
## .x6 0.356 0.043 8.277 0.000
## .x7 0.799 0.081 9.823 0.000
## .x8 0.488 0.074 6.573 0.000
## .x9 0.566 0.071 8.003 0.000
## Visual 0.809 0.145 5.564 0.000
## Textual 0.979 0.112 8.737 0.000
## Speed 0.384 0.086 4.451 0.000
You should note that in the model output, we did not receive unstandardized estimates of factor loadings for items x1, x4, or x7. This is because we used them to set the scale of measurement of the latent variables. There is another way to do this, set each latent variable to have a variance of 1. The consequence is that we will then, no longer have an estimate of latent-variable variance. The two methods will yield identical fits, but one may be preferred over the other for different reasons.
holzmod2 <- "
# three-factor model
visual =~ NA*x1 + x2 + x3
textual =~ NA*x4 + x5 + x6
speed =~ NA*x7 + x8 + x9
# fix variance of the factors
visual ~~ 1*visual
textual ~~ 1*textual
speed ~~ 1*speed
"
holzfit2 <- cfa(holzmod2, data = HolzingerSwineford1939)
# second model
fitmeasures(holzfit2)[c("chisq","cfi","tli")]
## chisq cfi tli
## 85.3055218 0.9305597 0.8958395
## chisq cfi tli
## 85.3055218 0.9305597 0.8958395
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 visual =~ x1 0.900 0.081 11.128 0 0.741 1.058
## 2 visual =~ x2 0.498 0.077 6.429 0 0.346 0.650
## 3 visual =~ x3 0.656 0.074 8.817 0 0.510 0.802
## 4 textual =~ x4 0.990 0.057 17.474 0 0.879 1.101
## 5 textual =~ x5 1.102 0.063 17.576 0 0.979 1.224
## 6 textual =~ x6 0.917 0.054 17.082 0 0.811 1.022
## 7 speed =~ x7 0.619 0.070 8.903 0 0.483 0.756
## 8 speed =~ x8 0.731 0.066 11.090 0 0.602 0.860
## 9 speed =~ x9 0.670 0.065 10.305 0 0.543 0.797
## 10 visual ~~ visual 1.000 0.000 NA NA 1.000 1.000
## 11 textual ~~ textual 1.000 0.000 NA NA 1.000 1.000
## 12 speed ~~ speed 1.000 0.000 NA NA 1.000 1.000
## 13 x1 ~~ x1 0.549 0.114 4.833 0 0.326 0.772
## 14 x2 ~~ x2 1.134 0.102 11.146 0 0.934 1.333
## 15 x3 ~~ x3 0.844 0.091 9.317 0 0.667 1.022
## 16 x4 ~~ x4 0.371 0.048 7.779 0 0.278 0.465
## 17 x5 ~~ x5 0.446 0.058 7.642 0 0.332 0.561
## 18 x6 ~~ x6 0.356 0.043 8.277 0 0.272 0.441
## 19 x7 ~~ x7 0.799 0.081 9.823 0 0.640 0.959
## 20 x8 ~~ x8 0.488 0.074 6.573 0 0.342 0.633
## 21 x9 ~~ x9 0.566 0.071 8.003 0 0.427 0.705
## 22 visual ~~ textual 0.459 0.064 7.189 0 0.334 0.584
## 23 visual ~~ speed 0.471 0.073 6.461 0 0.328 0.613
## 24 textual ~~ speed 0.283 0.069 4.117 0 0.148 0.418
There are several ways of critiquing a model. First, one can look at
the local and global fits. The global fit is given via the
summary()
command, but for small models, we can also look
at the residual matrix, which should have small entries if the fit is
good. Remember, all we’re doing is comparing between our actual
(observed) covariance matrix among the x’s and the one “implied” by the
model.
We’ll look just at the normalized residuals. These are like z-scores, and we will look for residuals greater than or equal to \(\pm 1.96\) though, since these tests are not independent, we would expect some values to exceed that threshold, even in well-fitting models.
## x1 x2 x3 x4 x5 x6 x7 x8 x9
## x1 0.000
## x2 -1.996 0.000
## x3 -0.997 2.689 0.000
## x4 2.679 -0.284 -1.899 0.000
## x5 -0.359 -0.591 -4.157 1.545 0.000
## x6 2.155 0.681 -0.711 -2.588 0.942 0.000
## x7 -3.773 -3.654 -1.858 0.865 -0.842 -0.326 0.000
## x8 -1.380 -1.119 -0.300 -2.021 -1.099 -0.641 4.823 0.000
## x9 4.077 1.606 3.518 1.225 1.701 1.423 -2.325 -4.132 0.000
A more efficient method for figuring out what we might do to update our model is to use the so-called modification indices. Remember, we have 24 degrees of freedom currently, which means our model is identified, and we might like to free some more parameters. Modification indices are a series of tests where a single parameter, which is currently set to \(0\) is allowed to be estimated. The test gives us information about improvement in model fit via a change in the \(\chi^{2}\), if we added this parameter to the model and refit it.
For very large models, there are many parameters, and often many small modification indices. Here, we’ll ask only for those \(\geq 10\).
## lhs op rhs mi epc sepc.lv sepc.all sepc.nox
## 30 Visual =~ x9 36.411 0.577 0.519 0.515 0.515
## 76 x7 ~~ x8 34.145 0.536 0.536 0.859 0.859
## 28 Visual =~ x7 18.631 -0.422 -0.380 -0.349 -0.349
## 78 x8 ~~ x9 14.946 -0.423 -0.423 -0.805 -0.805
The first row shows us what we saw in the residual matrix, that perhaps x9 should be made to load on the visual factor as well as the speed factor. Let’s refit our model and see what happens:
# specify the factor structure using lavaan syntax and save it
holzmod3 <- "
# latent variables
Visual =~ x1 + x2 + x3 + x9
Textual =~ x4 + x5 + x6
Speed =~ x7 + x8 + x9
"
# fit the model
holzfit3 <- cfa(holzmod3, data = HolzingerSwineford1939)
# see results
summary(holzfit3)
## lavaan 0.6-19 ended normally after 34 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 22
##
## Number of observations 301
##
## Model Test User Model:
##
## Test statistic 52.382
## Degrees of freedom 23
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Visual =~
## x1 1.000
## x2 0.578 0.098 5.918 0.000
## x3 0.754 0.103 7.291 0.000
## x9 0.437 0.081 5.367 0.000
## Textual =~
## x4 1.000
## x5 1.115 0.066 17.016 0.000
## x6 0.926 0.056 16.685 0.000
## Speed =~
## x7 1.000
## x8 1.207 0.185 6.540 0.000
## x9 0.675 0.112 6.037 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Visual ~~
## Textual 0.396 0.072 5.506 0.000
## Speed 0.177 0.055 3.239 0.001
## Textual ~~
## Speed 0.136 0.051 2.675 0.007
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 0.576 0.100 5.731 0.000
## .x2 1.120 0.100 11.153 0.000
## .x3 0.830 0.087 9.515 0.000
## .x9 0.558 0.060 9.336 0.000
## .x4 0.373 0.048 7.800 0.000
## .x5 0.444 0.058 7.602 0.000
## .x6 0.357 0.043 8.285 0.000
## .x7 0.740 0.086 8.595 0.000
## .x8 0.375 0.094 3.973 0.000
## Visual 0.783 0.134 5.842 0.000
## Textual 0.978 0.112 8.728 0.000
## Speed 0.444 0.097 4.567 0.000
Ok we see that x9 loads significantly on the Visual factor, albeit smaller than the others. Let’s test to see if we improved the fit
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## holzfit3 23 7486.6 7568.1 52.382
## holzfit 24 7517.5 7595.3 85.305 32.923 0.32567 1 9.586e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that since the original model is nested in the new one, we can compare using the difference between the \(\chi^{2}\) values. What this test actually does is test the null hypothesis that the model does not get worse when we set the new paramter to zero. The test is significant, which says that constraining the model to have x9 only load on Speed worsens the fit.
We have now learned the basic steps to perform confirmatory factor analysis. There is more to the story, but this is a big chunk of it!
Next week, we will tackle the thorny issue of measurement scale of our indicators, and whether we need to be careful when we switch from normally distributed indicators to discrete variables.