What is lavaan?

The R package lavaan was developed by Yves Rosseel to better enable students, researchers, and instructors to perform structural equation modeling in a non-commercial open-source environment (more info here).

The leading commercial software options include: Mplus, AMOS, LISREL, and EQS

Though it doesn’t provide a graphical user interface (i.e., point and click) as do some other options, it is free and enables fitting complex SEM models with very simple and easy to understand syntax.

This lesson has 5 goals:

  1. Learn lavaan syntax
  2. Fit simple confirmatory factor analysis models
  3. Understand the output
  4. Use the output for inference
  5. Learn how to critique and compare models

Installing lavaan

If you have not already done so, please install lavaan using either the install packages tool in the packages tab, or by typing install.packages("lavaan") in your console.

Remember, you must also load the package with the library(lavaan) command

1. lavaan syntax

The Formula Interface

To specify a model in lavaan there is a special syntax, based on the LISREL syntax. We will begin with the example from the lavaan tutorial.

In the R environment, there is a formula interface that underpins all regression models:

Y ~ x1 + x2 + x3 ... + x4

For example, let’s say I want to fit a regression model with health related quality of life as the dependent variable, and treatment type, age, and baseline pain as predictors. In R I’d use the linear model command lm() with the following formula:

HRQOL ~ treatment + age + baseline

We read this as “Health related quality of life is regressed on treatment, age, and baseline pain.”

Agumenting the tilde

In latent variable models, we’ll have a number of different ways that we need to relate variable together besides simply regressed on

From the lavaan tutorial:

Formula Type Operator Mnemonic
Latent variable definition =~ is measured as
Regression ~ is regressed on
Variances and Covariances ~~ is correlated with
Intercept ~ 1 setting the intercept

Full Example

Suppose I have the following model:

  1. Three latent variables: Alienation in 1967, Alienation in 1971, SES
  2. Six indicators: Education, SEI, Anomia in 1967, Anomia in 1971, Powerlessness in 1967, Powerlessness in 1971

And the following relationships in the model

We would use the lavaan interface in the following way

# in lavaan
mod1.lav <- '
# regressions
  Alienation67 ~ SES
  Alienation71 ~ Alienation67 + SES

# latent variables
  SES =~ Education + SEI
  Alienation67 =~ Anomia67 + Powerless67
  Alienation71 =~ Anomia71 + Powerless71

# optional variance or covariance specifications
  Anomia67 ~~ Anomia71
  Powerless67 ~~ Powerless71

# optional intercepts
  SES ~ 1

'

Note the use of comments # in lavaan.

Specifying a basic CFA

It is even easier to specify and fit confirmatory factor analysis models in lavaan as the basic versions will only include the latent variable equations, what we usually call the measurement model

For example, let’s look at the data that are packaged with lavaan from Holzinger and Swineford’s (1939) well-known and oft used study of mental tests in children. From the help file `?HolzingerSwineford1939:

The classic Holzinger and Swineford (1939) dataset consists of mental ability test scores of seventh- and eighth-grade children from two different schools (Pasteur and Grant-White). In the original dataset (available in the MBESS package), there are scores for 26 tests. However, a smaller subset with 9 variables is more widely used in the literature (for example in Joreskog’s 1969 paper, which also uses the 145 subjects from the Grant-White school only).

## 'data.frame':    301 obs. of  15 variables:
##  $ id    : int  1 2 3 4 5 6 7 8 9 11 ...
##  $ sex   : int  1 2 2 1 2 2 1 2 2 2 ...
##  $ ageyr : int  13 13 13 13 12 14 12 12 13 12 ...
##  $ agemo : int  1 7 1 2 2 1 1 2 0 5 ...
##  $ school: Factor w/ 2 levels "Grant-White",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ grade : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ x1    : num  3.33 5.33 4.5 5.33 4.83 ...
##  $ x2    : num  7.75 5.25 5.25 7.75 4.75 5 6 6.25 5.75 5.25 ...
##  $ x3    : num  0.375 2.125 1.875 3 0.875 ...
##  $ x4    : num  2.33 1.67 1 2.67 2.67 ...
##  $ x5    : num  5.75 3 1.75 4.5 4 3 6 4.25 5.75 5 ...
##  $ x6    : num  1.286 1.286 0.429 2.429 2.571 ...
##  $ x7    : num  3.39 3.78 3.26 3 3.7 ...
##  $ x8    : num  5.75 6.25 3.9 5.3 6.3 6.65 6.2 5.15 4.65 4.55 ...
##  $ x9    : num  6.36 7.92 4.42 4.86 5.92 ...

Key:

  1. x1 = Visual Perception
  2. x2 = Cubes
  3. x3 = Lozenges
  4. x4 = Paragraph Comprehension
  5. x5 = Sentences Completion
  6. x6 = Word Meaning
  7. x7 = Speeded addition
  8. x8 = Speeded counting of dots
  9. x9 = Speeded discrimination straight and cuved capitals

Let’s look at the correlation matrix of the 9 indicators. Can you see any patterns? Try looking at the intercorrelations between tests 1,2,3 and between 4,5,6, and 7,8,9:

##      x1    x2   x3   x4   x5   x6    x7   x8   x9
## x1 1.00  0.30 0.44 0.37 0.29 0.36  0.07 0.22 0.39
## x2 0.30  1.00 0.34 0.15 0.14 0.19 -0.08 0.09 0.21
## x3 0.44  0.34 1.00 0.16 0.08 0.20  0.07 0.19 0.33
## x4 0.37  0.15 0.16 1.00 0.73 0.70  0.17 0.11 0.21
## x5 0.29  0.14 0.08 0.73 1.00 0.72  0.10 0.14 0.23
## x6 0.36  0.19 0.20 0.70 0.72 1.00  0.12 0.15 0.21
## x7 0.07 -0.08 0.07 0.17 0.10 0.12  1.00 0.49 0.34
## x8 0.22  0.09 0.19 0.11 0.14 0.15  0.49 1.00 0.45
## x9 0.39  0.21 0.33 0.21 0.23 0.21  0.34 0.45 1.00

2. Fitting the model

Though a more complicated model was tested in the original paper, these data were used by Joreskog in his seminal paper on confirmatory factor analysis and are often used as a standard to look at ways of estimating simple CFAs. The figure below shows the hypothesized factor structure:

All we need to do to fit this model is specify the latent structure using lavaan’s syntax and this sequence of commands (we assume the library(lavaan) command has already been run)

# specify the factor structure using lavaan syntax and save it

holzmod <- "
  # latent variables
  Visual =~ x1 + x2 + x3
  Textual =~ x4 + x5 + x6
  Speed =~ x7 + x8 + x9

"

# using the cfa command to fit the model

holzfit <- cfa(holzmod, data = HolzingerSwineford1939)

We’ve already seen how the syntax works. Here we’re only using the =~ and + operators as that’s all we need to define latent variables. There are no “structural” regressions.

Then, the cfa() command takes that text and fits a model using the data we specify. A few things are important to mention about this function:

  1. By default, cfa() uses the “scale by indicator” specification to make the model identifiable
  2. All relevant variances and covariances are automatically added. If we wanted to constrain them (i.e., make the factors uncorrelated, set variances = 1, etc.) we would need more explicit statements in the model text
  3. The default estimation is maximum likelihood, which is all we’ll consider this week. There are many others, two of which we’ll talk about next week
  4. cfa() is actually a wrapper function for the more flexible sem() function in the same package, which is used to fit more elaborate structural equation models.

3 and 4: Interpret the output

Now, let’s see what the results look like. Fortunately, there is a convenient summary() method for this model class. It breaks the results of the model fit into several parts:

  1. Information about the fitting process
  2. The test statistic (a \(\chi^{2}\) value) for the specified model (Test User Model)
  3. A baseline test statistic that can be ignored
  4. Comparative fit indices
  5. Loglikelihood-based info
  6. RMSEA and SMRM
  7. Parameter estimates (unstandardized in this output, to see standardized, add the argument standardized = TRUE)
summary(holzfit, fit.measures = TRUE)
## lavaan 0.6-19 ended normally after 35 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
## 
##   Number of observations                           301
## 
## Model Test User Model:
##                                                       
##   Test statistic                                85.306
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745
##   Loglikelihood unrestricted model (H1)      -3695.092
##                                                       
##   Akaike (AIC)                                7517.490
##   Bayesian (BIC)                              7595.339
##   Sample-size adjusted Bayesian (SABIC)       7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent confidence interval - lower         0.071
##   90 Percent confidence interval - upper         0.114
##   P-value H_0: RMSEA <= 0.050                    0.001
##   P-value H_0: RMSEA >= 0.080                    0.840
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.100    5.554    0.000
##     x3                0.729    0.109    6.685    0.000
##   Textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.065   17.014    0.000
##     x6                0.926    0.055   16.703    0.000
##   Speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.165    7.152    0.000
##     x9                1.082    0.151    7.155    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Visual ~~                                           
##     Textual           0.408    0.074    5.552    0.000
##     Speed             0.262    0.056    4.660    0.000
##   Textual ~~                                          
##     Speed             0.173    0.049    3.518    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.114    4.833    0.000
##    .x2                1.134    0.102   11.146    0.000
##    .x3                0.844    0.091    9.317    0.000
##    .x4                0.371    0.048    7.779    0.000
##    .x5                0.446    0.058    7.642    0.000
##    .x6                0.356    0.043    8.277    0.000
##    .x7                0.799    0.081    9.823    0.000
##    .x8                0.488    0.074    6.573    0.000
##    .x9                0.566    0.071    8.003    0.000
##     Visual            0.809    0.145    5.564    0.000
##     Textual           0.979    0.112    8.737    0.000
##     Speed             0.384    0.086    4.451    0.000

Changing or respecifying a model

You should note that in the model output, we did not receive unstandardized estimates of factor loadings for items x1, x4, or x7. This is because we used them to set the scale of measurement of the latent variables. There is another way to do this, set each latent variable to have a variance of 1. The consequence is that we will then, no longer have an estimate of latent-variable variance. The two methods will yield identical fits, but one may be preferred over the other for different reasons.

holzmod2 <- "
# three-factor model
  visual =~ NA*x1 + x2 + x3
  textual =~ NA*x4 + x5 + x6
  speed   =~ NA*x7 + x8 + x9
# fix variance of the factors
    visual ~~ 1*visual
    textual ~~ 1*textual
    speed ~~ 1*speed
    
"

holzfit2 <- cfa(holzmod2, data = HolzingerSwineford1939)

# second model
fitmeasures(holzfit2)[c("chisq","cfi","tli")]
##      chisq        cfi        tli 
## 85.3055218  0.9305597  0.8958395
#first model
fitmeasures(holzfit)[c("chisq","cfi","tli")]
##      chisq        cfi        tli 
## 85.3055218  0.9305597  0.8958395
parameterestimates(holzfit2)
##        lhs op     rhs   est    se      z pvalue ci.lower ci.upper
## 1   visual =~      x1 0.900 0.081 11.128      0    0.741    1.058
## 2   visual =~      x2 0.498 0.077  6.429      0    0.346    0.650
## 3   visual =~      x3 0.656 0.074  8.817      0    0.510    0.802
## 4  textual =~      x4 0.990 0.057 17.474      0    0.879    1.101
## 5  textual =~      x5 1.102 0.063 17.576      0    0.979    1.224
## 6  textual =~      x6 0.917 0.054 17.082      0    0.811    1.022
## 7    speed =~      x7 0.619 0.070  8.903      0    0.483    0.756
## 8    speed =~      x8 0.731 0.066 11.090      0    0.602    0.860
## 9    speed =~      x9 0.670 0.065 10.305      0    0.543    0.797
## 10  visual ~~  visual 1.000 0.000     NA     NA    1.000    1.000
## 11 textual ~~ textual 1.000 0.000     NA     NA    1.000    1.000
## 12   speed ~~   speed 1.000 0.000     NA     NA    1.000    1.000
## 13      x1 ~~      x1 0.549 0.114  4.833      0    0.326    0.772
## 14      x2 ~~      x2 1.134 0.102 11.146      0    0.934    1.333
## 15      x3 ~~      x3 0.844 0.091  9.317      0    0.667    1.022
## 16      x4 ~~      x4 0.371 0.048  7.779      0    0.278    0.465
## 17      x5 ~~      x5 0.446 0.058  7.642      0    0.332    0.561
## 18      x6 ~~      x6 0.356 0.043  8.277      0    0.272    0.441
## 19      x7 ~~      x7 0.799 0.081  9.823      0    0.640    0.959
## 20      x8 ~~      x8 0.488 0.074  6.573      0    0.342    0.633
## 21      x9 ~~      x9 0.566 0.071  8.003      0    0.427    0.705
## 22  visual ~~ textual 0.459 0.064  7.189      0    0.334    0.584
## 23  visual ~~   speed 0.471 0.073  6.461      0    0.328    0.613
## 24 textual ~~   speed 0.283 0.069  4.117      0    0.148    0.418

5. Critiquing and comparing models

There are several ways of critiquing a model. First, one can look at the local and global fits. The global fit is given via the summary() command, but for small models, we can also look at the residual matrix, which should have small entries if the fit is good. Remember, all we’re doing is comparing between our actual (observed) covariance matrix among the x’s and the one “implied” by the model.

We’ll look just at the normalized residuals. These are like z-scores, and we will look for residuals greater than or equal to \(\pm 1.96\) though, since these tests are not independent, we would expect some values to exceed that threshold, even in well-fitting models.

lavResiduals(holzfit)$cov.z
##        x1     x2     x3     x4     x5     x6     x7     x8     x9
## x1  0.000                                                        
## x2 -1.996  0.000                                                 
## x3 -0.997  2.689  0.000                                          
## x4  2.679 -0.284 -1.899  0.000                                   
## x5 -0.359 -0.591 -4.157  1.545  0.000                            
## x6  2.155  0.681 -0.711 -2.588  0.942  0.000                     
## x7 -3.773 -3.654 -1.858  0.865 -0.842 -0.326  0.000              
## x8 -1.380 -1.119 -0.300 -2.021 -1.099 -0.641  4.823  0.000       
## x9  4.077  1.606  3.518  1.225  1.701  1.423 -2.325 -4.132  0.000

A more efficient method for figuring out what we might do to update our model is to use the so-called modification indices. Remember, we have 24 degrees of freedom currently, which means our model is identified, and we might like to free some more parameters. Modification indices are a series of tests where a single parameter, which is currently set to \(0\) is allowed to be estimated. The test gives us information about improvement in model fit via a change in the \(\chi^{2}\), if we added this parameter to the model and refit it.

For very large models, there are many parameters, and often many small modification indices. Here, we’ll ask only for those \(\geq 10\).

modificationindices(holzfit, minimum.value = 10, sort = TRUE)
##       lhs op rhs     mi    epc sepc.lv sepc.all sepc.nox
## 30 Visual =~  x9 36.411  0.577   0.519    0.515    0.515
## 76     x7 ~~  x8 34.145  0.536   0.536    0.859    0.859
## 28 Visual =~  x7 18.631 -0.422  -0.380   -0.349   -0.349
## 78     x8 ~~  x9 14.946 -0.423  -0.423   -0.805   -0.805

The first row shows us what we saw in the residual matrix, that perhaps x9 should be made to load on the visual factor as well as the speed factor. Let’s refit our model and see what happens:

# specify the factor structure using lavaan syntax and save it

holzmod3 <- "
  # latent variables
  Visual =~ x1 + x2 + x3 + x9
  Textual =~ x4 + x5 + x6
  Speed =~ x7 + x8 + x9

"

# fit the model

holzfit3 <- cfa(holzmod3, data = HolzingerSwineford1939)

# see results

summary(holzfit3)
## lavaan 0.6-19 ended normally after 34 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        22
## 
##   Number of observations                           301
## 
## Model Test User Model:
##                                                       
##   Test statistic                                52.382
##   Degrees of freedom                                23
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Visual =~                                           
##     x1                1.000                           
##     x2                0.578    0.098    5.918    0.000
##     x3                0.754    0.103    7.291    0.000
##     x9                0.437    0.081    5.367    0.000
##   Textual =~                                          
##     x4                1.000                           
##     x5                1.115    0.066   17.016    0.000
##     x6                0.926    0.056   16.685    0.000
##   Speed =~                                            
##     x7                1.000                           
##     x8                1.207    0.185    6.540    0.000
##     x9                0.675    0.112    6.037    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Visual ~~                                           
##     Textual           0.396    0.072    5.506    0.000
##     Speed             0.177    0.055    3.239    0.001
##   Textual ~~                                          
##     Speed             0.136    0.051    2.675    0.007
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.576    0.100    5.731    0.000
##    .x2                1.120    0.100   11.153    0.000
##    .x3                0.830    0.087    9.515    0.000
##    .x9                0.558    0.060    9.336    0.000
##    .x4                0.373    0.048    7.800    0.000
##    .x5                0.444    0.058    7.602    0.000
##    .x6                0.357    0.043    8.285    0.000
##    .x7                0.740    0.086    8.595    0.000
##    .x8                0.375    0.094    3.973    0.000
##     Visual            0.783    0.134    5.842    0.000
##     Textual           0.978    0.112    8.728    0.000
##     Speed             0.444    0.097    4.567    0.000

Ok we see that x9 loads significantly on the Visual factor, albeit smaller than the others. Let’s test to see if we improved the fit

anova(holzfit, holzfit3)
## 
## Chi-Squared Difference Test
## 
##          Df    AIC    BIC  Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)    
## holzfit3 23 7486.6 7568.1 52.382                                          
## holzfit  24 7517.5 7595.3 85.305     32.923 0.32567       1  9.586e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that since the original model is nested in the new one, we can compare using the difference between the \(\chi^{2}\) values. What this test actually does is test the null hypothesis that the model does not get worse when we set the new paramter to zero. The test is significant, which says that constraining the model to have x9 only load on Speed worsens the fit.

Wrapping Up

We have now learned the basic steps to perform confirmatory factor analysis. There is more to the story, but this is a big chunk of it!

Next week, we will tackle the thorny issue of measurement scale of our indicators, and whether we need to be careful when we switch from normally distributed indicators to discrete variables.