Tutorial: Rasch and 2PL Model in R

Setup
Data
Rasch Model
2PL Model
Model Fit
Person Parameters
- ML
- MAP and EAP
Item and Test Information
References

Recently, I wrote a summary of some illustrative IRT analyses for my students. Quickly, I realized that this might be of interest to others as well, and I am posting here a tutorial for the Rasch model and the 2PL model in R. It is meant for people with a basic understanding of these models who have heard terms like ICC or item difficulty before and who would like to see a practical, worked example. Possibly, the code may be copied and applied to your own data. Let me know if there is anything that needs further clarification.

Setup

Herein, we will use the following three R packages: eRm (Mair & Hatzinger, 2007), ltm (Rizopoulos, 2006), and difR (Magis, Béland, Tuerlinckx, & De Boeck, 2010). Those need to be loaded via library() and installed beforehand if necessary.

# install.packages("eRm")
# install.packages("ltm")
# install.packages("difR")
library("eRm")
library("ltm")
library("difR")

Data

An illustrative data set on verbal aggression will be used, which is comprised of 24 dichotomized items and 316 persons. The data set called verbal is included in the difR package and can be loaded using the function data(). Documentation for the data is available through ?verbal.

data(verbal, package = "difR")
# ?verbal

Rasch Model

A Rasch model is fit to the data using conditional maximum likelihood (CML) estimation of the item parameters as provided in the function RM() of the eRm package.

The item difficulty parameters are returned and the output shows that S2WantCurse is the easiest item and S3DoShout is the most difficult item, for which only people with high levels of verbal aggression are expected to agree with.

dat_1 <- verbal[, 1:24]

res_rm_1 <- RM(dat_1)

# res_rm_1                   # Short summary of item parameters
# summary(res_rm_1)          # Longer summary of item parameters
betas <- -coef(res_rm_1)     # Item difficulty parameters
round(sort(betas), 2)
#> beta S2WantCurse   beta S1DoCurse beta S1wantCurse beta S4WantCurse 
#>            -1.91            -1.38            -1.38            -1.25 
#>   beta S2DoCurse   beta S4DoCurse beta S2WantScold beta S1WantScold 
#>            -1.04            -0.87            -0.87            -0.73 
#> beta S3WantCurse   beta S1DoScold beta S1WantShout beta S2WantShout 
#>            -0.70            -0.56            -0.25            -0.18 
#>   beta S2DoScold   beta S3DoCurse beta S4WantScold   beta S4DoScold 
#>            -0.11             0.04             0.18             0.21 
#> beta S3WantScold   beta S1DoShout beta S4WantShout   beta S2DoShout 
#>             0.51             0.70             0.87             1.31 
#>   beta S3DoScold beta S3WantShout   beta S4DoShout   beta S3DoShout 
#>             1.33             1.36             1.84             2.87

Plots

We may plot the ICC of a single or all items, which is just a graphical illustration of the estimated item parameters.

plotICC(res_rm_1, item.subset = "S2WantShout")
abline(v = -0.18, col = "grey")
abline(h = .5, col = "grey")

plotjointICC(res_rm_1, item.subset =  1:12, cex = .6)

plotjointICC(res_rm_1, item.subset = 13:24, cex = .6)

From the plots of the ICCs, it seems like the “want”-items (items 1:12) are easier than the “do”-items (items 13:24). We may further investigate this by comparing the means of the item difficulties. This shows that this is indeed the case. (The two means sum to 0, because of the identification constraint, see below.)

mean(betas[ 1:12])           # Want-items
#> [1] -0.3622058
mean(betas[13:24])           # Do-items
#> [1] 0.3622058

Furthermore, a very informative plot called a person-item map is available in eRm:

A person-item map displays the location of item (and threshold) parameters as well as the distribution of person parameters.along the latent dimension. Person-item maps are useful to compare the range and position of the item measure distribution (lower panel) to the range and position of the person measure distribution (upper panel). Items should ideally be located along the whole scale to meaningfully measure the ‘ability’ of all persons. (From ?plotPImap)

plotPImap(res_rm_1, cex.gen = .55)

plotPImap(res_rm_1, cex.gen = .55, sorted = TRUE)

Model Identification

The model needs an identification constraint, because otherwise the location of the latent scale is undetermined. When looking at the help (?RM), we can see that the identification constraint in RM() is a sum-to-zero constraint by default (sum0 = TRUE). We can verify that by calculating the sum of the item parameters.

round(sum(betas), 10)
#> [1] 0

When setting sum0 = FALSE, the first item parameter is set to zero. These are the only two options implemented in RM().

tmp1 <- RM(dat_1, sum0 = FALSE)
round(coef(tmp1), 2)
#> beta S1wantCurse beta S1WantScold beta S1WantShout beta S2WantCurse 
#>             0.00            -0.65            -1.13             0.53 
#> beta S2WantScold beta S2WantShout beta S3WantCurse beta S3WantScold 
#>            -0.51            -1.20            -0.69            -1.90 
#> beta S3WantShout beta S4WantCurse beta S4WantScold beta S4WantShout 
#>            -2.74            -0.14            -1.56            -2.25 
#>   beta S1DoCurse   beta S1DoScold   beta S1DoShout   beta S2DoCurse 
#>             0.00            -0.83            -2.08            -0.35 
#>   beta S2DoScold   beta S2DoShout   beta S3DoCurse   beta S3DoScold 
#>            -1.27            -2.70            -1.42            -2.72 
#>   beta S3DoShout   beta S4DoCurse   beta S4DoScold   beta S4DoShout 
#>            -4.25            -0.51            -1.60            -3.22

Note on Item Parameters in eRm Package

Note that the eRm package works with easiness parameters, which are called beta, and those can be transformed into difficulty parameters, which are called eta, and $-\eta = \beta$ .

Note further that often only $I-1$ item parameters are returned (here 23), which illustrates the fact that only 23 item parameters can be estimated and that one has to be fixed either as $\beta_1=\sum_2^I \beta_i$ (if sum0 = TRUE) or $\beta_1 = 0$ (if sum0 = FALSE). The first item parameter can then be calculated (not estimated) accordingly.

# No difficulty (eta) parameter is returned for the first item S1wantCurse
summary(res_rm_1)
#> 
#> Results of RM estimation: 
#> 
#> Call:  RM(X = dat_1) 
#> 
#> Conditional log-likelihood: -3049.923 
#> Number of iterations: 31 
#> Number of parameters: 23 
#> 
#> Item (Category) Difficulty Parameters (eta): with 0.95 CI:
#>             Estimate Std. Error lower CI upper CI
#> S1WantScold   -0.731      0.131   -0.987   -0.475
#> S1WantShout   -0.249      0.128   -0.501    0.003
#> S2WantCurse   -1.909      0.153   -2.210   -1.608
#> S2WantScold   -0.873      0.132   -1.132   -0.614
#> S2WantShout   -0.181      0.128   -0.433    0.070
#> S3WantCurse   -0.696      0.130   -0.951   -0.440
#> S3WantScold    0.514      0.132    0.254    0.773
#> S3WantShout    1.358      0.149    1.065    1.650
#> S4WantCurse   -1.245      0.137   -1.514   -0.976
#> S4WantScold    0.178      0.129   -0.076    0.432
#> S4WantShout    0.871      0.138    0.601    1.141
#> S1DoCurse     -1.383      0.140   -1.658   -1.109
#> S1DoScold     -0.557      0.129   -0.810   -0.303
#> S1DoShout      0.698      0.135    0.434    0.963
#> S2DoCurse     -1.037      0.134   -1.300   -0.774
#> S2DoScold     -0.113      0.128   -0.365    0.138
#> S2DoShout      1.312      0.148    1.022    1.602
#> S3DoCurse      0.040      0.129   -0.212    0.293
#> S3DoScold      1.335      0.149    1.044    1.626
#> S3DoShout      2.871      0.222    2.436    3.306
#> S4DoCurse     -0.873      0.132   -1.132   -0.614
#> S4DoScold      0.213      0.130   -0.042    0.467
#> S4DoShout      1.840      0.165    1.516    2.165
#> 
#> Item Easiness Parameters (beta) with 0.95 CI:
#>                  Estimate Std. Error lower CI upper CI
#> beta S1wantCurse    1.383      0.140    1.109    1.658
#> beta S1WantScold    0.731      0.131    0.475    0.987
#> beta S1WantShout    0.249      0.128   -0.003    0.501
#> beta S2WantCurse    1.909      0.153    1.608    2.210
#> beta S2WantScold    0.873      0.132    0.614    1.132
#> beta S2WantShout    0.181      0.128   -0.070    0.433
#> beta S3WantCurse    0.696      0.130    0.440    0.951
#> beta S3WantScold   -0.514      0.132   -0.773   -0.254
#> beta S3WantShout   -1.358      0.149   -1.650   -1.065
#> beta S4WantCurse    1.245      0.137    0.976    1.514
#> beta S4WantScold   -0.178      0.129   -0.432    0.076
#> beta S4WantShout   -0.871      0.138   -1.141   -0.601
#> beta S1DoCurse      1.383      0.140    1.109    1.658
#> beta S1DoScold      0.557      0.129    0.303    0.810
#> beta S1DoShout     -0.698      0.135   -0.963   -0.434
#> beta S2DoCurse      1.037      0.134    0.774    1.300
#> beta S2DoScold      0.113      0.128   -0.138    0.365
#> beta S2DoShout     -1.312      0.148   -1.602   -1.022
#> beta S3DoCurse     -0.040      0.129   -0.293    0.212
#> beta S3DoScold     -1.335      0.149   -1.626   -1.044
#> beta S3DoShout     -2.871      0.222   -3.306   -2.436
#> beta S4DoCurse      0.873      0.132    0.614    1.132
#> beta S4DoScold     -0.213      0.130   -0.467    0.042
#> beta S4DoShout     -1.840      0.165   -2.165   -1.516

# Calculate eta for item S1wantCurse
-sum(res_rm_1$etapar)
#> [1] -1.383377

MML Estimation

In contrast to the CML estimation above, the model is now fit using marginal maximum likelihood (MML) estimation using the function rasch() from the ltm package. Therein, the model is identified by assuming a standard normal distribution of the person parameters. Thus, all item parameters can be freely estimated because the mean of the person parameters is fixed to 0 (which was not the case with CML).

res_rm_2 <- rasch(dat_1)
res_rm_2
#> 
#> Call:
#> rasch(data = dat_1)
#> 
#> Coefficients:
#> Dffclt.S1wantCurse  Dffclt.S1WantScold  Dffclt.S1WantShout  
#>             -0.889              -0.409              -0.055  
#> Dffclt.S2WantCurse  Dffclt.S2WantScold  Dffclt.S2WantShout  
#>             -1.274              -0.513              -0.005  
#> Dffclt.S3WantCurse  Dffclt.S3WantScold  Dffclt.S3WantShout  
#>             -0.383               0.505               1.120  
#> Dffclt.S4WantCurse  Dffclt.S4WantScold  Dffclt.S4WantShout  
#>             -0.787               0.259               0.767  
#>   Dffclt.S1DoCurse    Dffclt.S1DoScold    Dffclt.S1DoShout  
#>             -0.888              -0.281               0.641  
#>   Dffclt.S2DoCurse    Dffclt.S2DoScold    Dffclt.S2DoShout  
#>             -0.634               0.045               1.087  
#>   Dffclt.S3DoCurse    Dffclt.S3DoScold    Dffclt.S3DoShout  
#>              0.158               1.103               2.176  
#>   Dffclt.S4DoCurse    Dffclt.S4DoScold    Dffclt.S4DoShout  
#>             -0.513               0.285               1.464  
#>             Dscrmn  
#>              1.368  
#> 
#> Log.Lik: -4037.402

Note further that rasch() did not only fix the mean of the person-parameter distribution, but also the variance. Thus, the variability or units of the latent scale are determined by this. With CML estimation, this was fixed by assuming discrimination parameters equal to 1. However, two constraints (variance of person-parameter distribution as well as discrimination parameter) would be overly restrictive, and thus a common discrimination parameter is freely estimated by rasch(), which equals 1.37. Don’t get distracted by these technical issues, the software takes care of it.

The difficulty parameters estimated via MML or CML are nearly identical here highlighting the fact that the normal-distribution assumption in MML did not do any harm.

cor(coef(res_rm_2)[, 1], betas)
#> [1] 0.9999381

2PL Model

A 2PL model cannot be estimated with eRm, but with the package ltm, for example.

The estimated difficulty parameters (Dffclt) seem roughly comparable to those from the Rasch model. The discrimination parameters (Dscrmn), which were all equal in the Rasch model, show some but not massive heterogeneity. This is also evident from the plot of the ICCs.

Note that res_2pl_1 and res_rm_1 were estimated using two different packages. Thus, the structure of these objects is completely different and also the output if they are printed or the functions you may use with them (e.g., plot() vs. plotICC()).

# ltm() takes a formula with a tilde (~) as its first argument. The left-hand side
# must be the data and the right-hand side are the latent variables (only one
# here).
res_2pl_1 <- ltm(dat_1 ~ z1)
res_2pl_1
#> 
#> Call:
#> ltm(formula = dat_1 ~ z1)
#> 
#> Coefficients:
#>              Dffclt  Dscrmn
#> S1wantCurse  -0.913   1.358
#> S1WantScold  -0.409   1.525
#> S1WantShout  -0.078   1.340
#> S2WantCurse  -1.243   1.468
#> S2WantScold  -0.500   1.567
#> S2WantShout  -0.026   1.249
#> S3WantCurse  -0.531   0.880
#> S3WantScold   0.475   1.409
#> S3WantShout   1.455   0.912
#> S4WantCurse  -0.906   1.129
#> S4WantScold   0.214   1.588
#> S4WantShout   0.946   0.967
#> S1DoCurse    -0.821   1.666
#> S1DoScold    -0.251   2.272
#> S1DoShout     0.605   1.420
#> S2DoCurse    -0.630   1.475
#> S2DoScold     0.009   1.987
#> S2DoShout     0.968   1.625
#> S3DoCurse     0.164   1.089
#> S3DoScold     1.098   1.342
#> S3DoShout     2.457   1.126
#> S4DoCurse    -0.537   1.363
#> S4DoScold     0.255   1.430
#> S4DoShout     1.592   1.180
#> 
#> Log.Lik: -4017.413

# ?plot.ltm
plot(res_2pl_1, items =  1:12)

plot(res_2pl_1, items = 13:24)

Model Fit

Relative Fit of Rasch and 2PL Model

To compare the Rasch and the 2PL model, both need to be fit using the same assumptions and the same algorithm. Therefore, the comparison is made here based on the models estimated via MML.

Likelihood-Ratio Test

Since the two models are nested, they can be compared using a Likelihood-Ratio (LR) test. Here, the test is significant (p = .015) indicating that both models do not fit equally well and that the more complex 2PL model is preferred.

Note that a model with more parameters always leads to a higher likelihood, which is also the case here. Note further, that the degrees of freedom (df) for the LR test equal the number of constrained parameters. In the 2PL model, 24 discrimination parameters were estimated, and in the Rasch model only 1, which gives df = 23.

anova(res_rm_2, res_2pl_1)
#> 
#>  Likelihood Ratio Table
#>               AIC    BIC  log.Lik   LRT df p.value
#> res_rm_2  8124.80 8218.7 -4037.40                 
#> res_2pl_1 8130.83 8311.1 -4017.41 39.98 23   0.015

AIC and BIC

To compare two models descriptively, they need not necessarily be nested. With respect to AIC and BIC, lower values indicate better fit. Herein, the Rasch model is preferred by both criteria.

Comparison of Item Parameters

To further evaluate differences between the two models, the difficulty parameters can be correlated across models. Again, the extremely high correlation indicates that the differences between models is small.

cor(coef(res_rm_2)[, 1], coef(res_2pl_1)[, 1])
#> [1] 0.9942499

Absolute Fit of the Rasch Model

In the remainder of this subsection, a few procedures for assessing the fit of the Rasch model will be illustrated.

Several of them are based on the idea that the item parameters—estimated separately in two groups—should be equal if the model holds. For illustrative purposes, the variable sex is used here but other/additional variables can of course be used.

Andersen’s Likelihood-Ratio Test and Graphical Model Check

The item parameters should be equal if the model holds and thus be close to the line in the following plot. The ellipses indicate the confidence intervals, and many of them overlap the line indicating good fit. A few items stand out, for example item 6 (S2WantShout), which is easier for women than for men.

The significance test that formalizes this idea is called Andersen’s Likelihood Ratio test (Andersen, 1973). Herein, it is significant indicating bad fit due to too many items with diverging item difficulties.

lrt_1 <- LRtest(res_rm_1, splitcr = verbal$Gender)
plotGOF(lrt_1, conf = list(), tlab = "number",
        xlab = "Women", ylab = "Men")

lrt_1
#> 
#> Andersen LR-test: 
#> LR-value: 70.693 
#> Chi-square df: 23 
#> p-value:  0

Removing bad fitting items should improve fit, and this is indeed the case here if item 6 is removed: The test statistic is smaller now even though still significant.

tmp1 <- RM(dat_1[, -6])
LRtest(tmp1, splitcr = verbal$Gender)
#> 
#> Andersen LR-test: 
#> LR-value: 59.533 
#> Chi-square df: 22 
#> p-value:  0

Wald Test

The Wald test is also based on the comparison of two groups; however it is used for individual items. Again, misfit is indicated for item 6 among others. It is interesting to note that the z-statistic is mostly negative for do-items and positive for want-items. Negative values indicate that the item is easier for the second group, namely, men. Thus, it seems easier for men compared to women to “do” or show verbal aggression.

Waldtest(res_rm_1, splitcr = verbal$Gender)
#> 
#> Wald test on item level (z-values):
#> 
#>                  z-statistic p-value
#> beta S1wantCurse       1.402   0.161
#> beta S1WantScold       1.306   0.192
#> beta S1WantShout       1.408   0.159
#> beta S2WantCurse       2.028   0.043
#> beta S2WantScold       1.595   0.111
#> beta S2WantShout       3.260   0.001
#> beta S3WantCurse       0.518   0.604
#> beta S3WantScold      -1.304   0.192
#> beta S3WantShout       1.579   0.114
#> beta S4WantCurse       1.518   0.129
#> beta S4WantScold      -0.146   0.884
#> beta S4WantShout       2.038   0.042
#> beta S1DoCurse        -0.675   0.500
#> beta S1DoScold        -1.963   0.050
#> beta S1DoShout         0.789   0.430
#> beta S2DoCurse        -2.610   0.009
#> beta S2DoScold        -2.898   0.004
#> beta S2DoShout        -0.368   0.713
#> beta S3DoCurse        -2.645   0.008
#> beta S3DoScold        -2.218   0.027
#> beta S3DoShout        -0.773   0.440
#> beta S4DoCurse        -1.315   0.189
#> beta S4DoScold        -1.547   0.122
#> beta S4DoShout         1.046   0.295

Infit and Outfit

Infit and outfit statistics are not based on the comparison of groups, but on residuals. Descriptively, the MSQ values should be, for instance, within the range from 0.7 to 1.3 (Wright & Linacre, 1994). A t-test is also reported for infit and outfit, and the t-value should conventionally be within the range from -1.96 to +1.96. Here, the previously flagged item 6 performs quite good, but other items do not, for example item 14 (S1DoScold). The eRm package also provides a plot of the infit t-values, called a Pathway Map, which is also available for persons instead of items (i.e., plotPWmap(res_rm_1, pmap = TRUE, imap = FALSE)).

# To calculate infit/outfit via itemfit(), the person parameters have to be
# estimated first
pp_ml_1 <- person.parameter(res_rm_1)
itemfit(pp_ml_1)
#> 
#> Itemfit Statistics: 
#>                Chisq  df p-value Outfit MSQ Infit MSQ Outfit t Infit t
#> S1wantCurse  333.752 306   0.132      1.087     0.973     0.58   -0.36
#> S1WantScold  285.433 306   0.795      0.930     0.959    -0.58   -0.69
#> S1WantShout  303.008 306   0.538      0.987     0.981    -0.09   -0.33
#> S2WantCurse  231.914 306   0.999      0.755     0.976    -1.18   -0.25
#> S2WantScold  274.211 306   0.904      0.893     0.950    -0.85   -0.83
#> S2WantShout  296.628 306   0.639      0.966     1.001    -0.30    0.04
#> S3WantCurse  370.936 306   0.006      1.208     1.138     1.73    2.27
#> S3WantScold  267.323 306   0.946      0.871     0.956    -1.10   -0.72
#> S3WantShout  399.960 306   0.000      1.303     1.097     1.58    1.19
#> S4WantCurse  298.519 306   0.609      0.972     1.051    -0.14    0.77
#> S4WantScold  297.046 306   0.633      0.968     0.929    -0.27   -1.27
#> S4WantShout  366.629 306   0.010      1.194     1.082     1.35    1.21
#> S1DoCurse    256.352 306   0.982      0.835     0.895    -1.04   -1.53
#> S1DoScold    226.531 306   1.000      0.738     0.837    -2.59   -3.01
#> S1DoShout    294.109 306   0.677      0.958     0.956    -0.29   -0.68
#> S2DoCurse    301.914 306   0.555      0.983     0.951    -0.08   -0.77
#> S2DoScold    246.872 306   0.994      0.804     0.891    -2.01   -2.02
#> S2DoShout    267.965 306   0.943      0.873     0.910    -0.69   -1.14
#> S3DoCurse    345.909 306   0.058      1.127     1.069     1.21    1.23
#> S3DoScold    263.468 306   0.962      0.858     1.006    -0.77    0.10
#> S3DoShout   1001.107 306   0.000      3.261     0.986     3.70   -0.04
#> S4DoCurse    285.345 306   0.796      0.929     0.970    -0.54   -0.49
#> S4DoScold    289.620 306   0.741      0.943     0.996    -0.50   -0.05
#> S4DoShout    312.841 306   0.382      1.019     1.035     0.16    0.38

plotPWmap(res_rm_1)

DIF

Herein, differential item functioning (DIF) or item bias or measurement invariance is treated as a topic within Model Fit, but it could also be treated as a topic of its own given the high relevance in the IRT literature.

Above, we already compared item parameters across groups from a fit perspective. However, if the groups are the focal and the reference group, these procedures can also be interpreted from a DIF perspective. Thus, the Wald test indicated that item 6 showed DIF, namely, uniform DIF.

Mantel-Haenszel

The Mantel-Haenszel method is a non-parametric approach to DIF. Herein, again item 6 is flagged having significant DIF (p = .002) with a large effect size of $\Delta_{MH}$ =-2.49. The negative effect size indicates that the item is harder for the focal group, namely, men, mirroring the results from above. The plot gives a compact summary across all items.

tmp1 <- difMH(dat_1, group = verbal$Gender, focal.name = 1)
tmp1
#> 
#> Detection of Differential Item Functioning using Mantel-Haenszel method 
#> with continuity correction and without item purification
#> 
#> Results based on asymptotic inference 
#>  
#> Matching variable: test score 
#>  
#> No set of anchor items was provided 
#>  
#> No p-value adjustment for multiple comparisons 
#>  
#> Mantel-Haenszel Chi-square statistic: 
#>  
#>             Stat.  P-value   
#> S1wantCurse 1.7076 0.1913    
#> S1WantScold 2.1486 0.1427    
#> S1WantShout 0.9926 0.3191    
#> S2WantCurse 1.9302 0.1647    
#> S2WantScold 2.9540 0.0857  . 
#> S2WantShout 9.6032 0.0019  **
#> S3WantCurse 0.0013 0.9711    
#> S3WantScold 0.6752 0.4112    
#> S3WantShout 0.8185 0.3656    
#> S4WantCurse 1.6292 0.2018    
#> S4WantScold 0.0152 0.9020    
#> S4WantShout 4.1188 0.0424  * 
#> S1DoCurse   0.1324 0.7160    
#> S1DoScold   2.7501 0.0972  . 
#> S1DoShout   0.0683 0.7938    
#> S2DoCurse   6.3029 0.0121  * 
#> S2DoScold   6.8395 0.0089  **
#> S2DoShout   0.2170 0.6414    
#> S3DoCurse   5.7817 0.0162  * 
#> S3DoScold   3.8880 0.0486  * 
#> S3DoShout   0.2989 0.5846    
#> S4DoCurse   1.1220 0.2895    
#> S4DoScold   1.4491 0.2287    
#> S4DoShout   0.8390 0.3597    
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
#> 
#> Detection threshold: 3.8415 (significance level: 0.05)
#> 
#> Items detected as DIF items: 
#>             
#>  S2WantShout
#>  S4WantShout
#>  S2DoCurse  
#>  S2DoScold  
#>  S3DoCurse  
#>  S3DoScold  
#> 
#>  
#> Effect size (ETS Delta scale): 
#>  
#> Effect size code: 
#>  'A': negligible effect 
#>  'B': moderate effect 
#>  'C': large effect 
#>  
#>             alphaMH deltaMH  
#> S1wantCurse  1.7005 -1.2476 B
#> S1WantScold  1.7702 -1.3420 B
#> S1WantShout  1.4481 -0.8701 A
#> S2WantCurse  1.9395 -1.5567 C
#> S2WantScold  1.9799 -1.6052 C
#> S2WantShout  2.8804 -2.4861 C
#> S3WantCurse  0.9439  0.1358 A
#> S3WantScold  0.7194  0.7741 A
#> S3WantShout  1.5281 -0.9965 A
#> S4WantCurse  1.6849 -1.2260 B
#> S4WantScold  1.0901 -0.2028 A
#> S4WantShout  2.3458 -2.0036 C
#> S1DoCurse    0.7967  0.5340 A
#> S1DoScold    0.4995  1.6313 C
#> S1DoShout    1.1765 -0.3821 A
#> S2DoCurse    0.3209  2.6709 C
#> S2DoScold    0.3746  2.3072 C
#> S2DoShout    0.7931  0.5447 A
#> S3DoCurse    0.4616  1.8165 C
#> S3DoScold    0.4727  1.7606 C
#> S3DoShout    0.6373  1.0585 B
#> S4DoCurse    0.6444  1.0327 B
#> S4DoScold    0.6385  1.0541 B
#> S4DoShout    1.6053 -1.1123 B
#> 
#> Effect size codes: 0 'A' 1.0 'B' 1.5 'C' 
#>  (for absolute values of 'deltaMH') 
#>  
#> Output was not captured!

plot(tmp1)

#> The plot was not captured!

Lord’s Chi-Square-Test

Lord’s $\chi^2$ -Test allows to test for non-uniform (crossing) DIF, because a 2PL (or even 3PL) model is estimated in each group rather than a Rasch model, which was used for the Wald test.

Here, item 6 has still a relatively high $\chi^2$ -value but it is non-significant (p = .075). The ICCs indicate that item 6 is a little bit harder for men and also a little bit less discriminating (uniform and non-uniform DIF)—but non-significant.

tmp1 <- difLord(dat_1, group = verbal$Gender, focal.name = 1, model = "2PL")
# tmp1 <- difLord(dat_1, group = verbal$Gender, focal.name = 1, model = "1PL", discr = NULL)
tmp1
#> 
#> Detection of Differential Item Functioning using Lord's method 
#> with 2PL model and without item purification
#> 
#> Engine 'ltm' for item parameter estimation 
#>  
#> No set of anchor items was provided 
#>  
#> No p-value adjustment for multiple comparisons 
#>  
#> Lord's chi-square statistic: 
#>  
#>             Stat.  P-value  
#> S1wantCurse 1.3458 0.5102   
#> S1WantScold 2.3185 0.3137   
#> S1WantShout 0.8788 0.6444   
#> S2WantCurse 4.5812 0.1012   
#> S2WantScold 3.3121 0.1909   
#> S2WantShout 5.1932 0.0745  .
#> S3WantCurse 1.1636 0.5589   
#> S3WantScold 2.2611 0.3229   
#> S3WantShout 1.0015 0.6061   
#> S4WantCurse 2.8886 0.2359   
#> S4WantScold 0.6412 0.7257   
#> S4WantShout 2.1526 0.3409   
#> S1DoCurse   1.8026 0.4060   
#> S1DoScold   3.7448 0.1538   
#> S1DoShout   0.3376 0.8447   
#> S2DoCurse   7.6852 0.0214  *
#> S2DoScold   6.9684 0.0307  *
#> S2DoShout   2.7932 0.2474   
#> S3DoCurse   5.6040 0.0607  .
#> S3DoScold   5.6378 0.0597  .
#> S3DoShout   0.9704 0.6156   
#> S4DoCurse   3.5649 0.1682   
#> S4DoScold   4.2514 0.1193   
#> S4DoShout   0.4827 0.7856   
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
#>  
#> Detection threshold: 5.9915 (significance level: 0.05)
#> 
#> Items detected as DIF items: 
#>           
#>  S2DoCurse
#>  S2DoScold
#> 
#> Output was not captured!

plot(tmp1)

#> The plot was not captured!
plot(tmp1, plot = "itemCurve", item = 6)

#> The plot was not captured!

Person Parameters

An ability parameter for every person can be estimated using different methods, and those will be illustrated and compared in the following.

ML

Maximum likelihood (ML) estimation without further assumptions/information is available in the eRm package. ML estimation does not allow an estimate for persons who solved none or all items. Estimates are calculated via extrapolation for those persons per default (i.e., coef(extrapolated = TRUE)), but this can be disabled.

tmp1 <- person.parameter(res_rm_1)
pp_ml <- coef(tmp1)

MAP and EAP

Maximum a Posteriori (MAP) also known as Empirical Bayes (EB) estimates are available in the ltm package, as well as the closely related Expected a Posteriori (EAP) estimates. Both assume a normal distribution prior for the person parameters.

The extremely high correlation among the estimates indicates that the methods lead to the same conclusions for the present data set, and this will often be the case.

pp_map <- factor.scores(res_rm_2, method = "EB", resp.patterns = dat_1)
pp_eap <- factor.scores(res_rm_2, method = "EAP", resp.patterns = dat_1)
tmp1 <- data.frame(ML = pp_ml, MAP = pp_map$score.dat$z1, EAP = pp_eap$score.dat$z1)
round(cor(tmp1), 4)
#>         ML    MAP    EAP
#> ML  1.0000 0.9950 0.9954
#> MAP 0.9950 1.0000 0.9999
#> EAP 0.9954 0.9999 1.0000

Furthermore, the scatter plot of the ML and the MAP estimates illustrates the effect of the prior. The relationship is linear for intermediate person parameters. However, shrinkage towards 0 is observed for extreme person parameters such that the MAP estimates are less extreme than the ML estimates.

plot(tmp1[, 1:2])

Furthermore, we may compare the estimates for the Rasch and the 2PL model as a supplement to the model comparison from above. Again, the person parameters are almost identical highlighting again that the difference between the two models is not large for the present data set.

pp_2pl <- factor.scores(res_2pl_1, method = "EB", resp.patterns = dat_1)
cor(pp_map$score.dat$z1, pp_2pl$score.dat$z1)
#> [1] 0.9956129

Item and Test Information

Item information curves (IICs) are available in both the eRm (plotINFO()) and ltm (plot(type = "IIC")) package. Those curves are an illustration of the difficulty and discrimination parameters. Therefore, it is expected that maximum information is highest for the items with the highest discrimination parameter (e.g., item 14 S1DoScold). It is interesting to see that the do-items show generally larger discrimination parameters, which is often interpreted as being a closer or better measure of the latent variable.

res_2pl_1
#> 
#> Call:
#> ltm(formula = dat_1 ~ z1)
#> 
#> Coefficients:
#>              Dffclt  Dscrmn
#> S1wantCurse  -0.913   1.358
#> S1WantScold  -0.409   1.525
#> S1WantShout  -0.078   1.340
#> S2WantCurse  -1.243   1.468
#> S2WantScold  -0.500   1.567
#> S2WantShout  -0.026   1.249
#> S3WantCurse  -0.531   0.880
#> S3WantScold   0.475   1.409
#> S3WantShout   1.455   0.912
#> S4WantCurse  -0.906   1.129
#> S4WantScold   0.214   1.588
#> S4WantShout   0.946   0.967
#> S1DoCurse    -0.821   1.666
#> S1DoScold    -0.251   2.272
#> S1DoShout     0.605   1.420
#> S2DoCurse    -0.630   1.475
#> S2DoScold     0.009   1.987
#> S2DoShout     0.968   1.625
#> S3DoCurse     0.164   1.089
#> S3DoScold     1.098   1.342
#> S3DoShout     2.457   1.126
#> S4DoCurse    -0.537   1.363
#> S4DoScold     0.255   1.430
#> S4DoShout     1.592   1.180
#> 
#> Log.Lik: -4017.413

plot(res_2pl_1, items =  1:12, type = "IIC", ylim = c(0, 1.3))

plot(res_2pl_1, items = 13:24, type = "IIC", ylim = c(0, 1.3))

plot(res_2pl_1, items = 0, type = "IIC")

References

Andersen, E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, 123–140. https://doi.org/10.1007/BF02291180

Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862. https://doi.org/10.3758/BRM.42.3.847

Mair, P., & Hatzinger, R. (2007). Extended Rasch modeling: The eRm package for the application of IRT models in R. Journal of Statistical Software, 20(9), 1–20. https://doi.org/10.18637/jss.v020.i09

Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 1–25. Retrieved from http://www.jstatsoft.org/v17/i05/

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. Retrieved from https://www.rasch.org/rmt/rmt83b.htm

irt psychometrics r tutorial