Part 2: CTT and CRT Test and Item Analysis

Overview

Teaching: 30 min
Exercises: 25 min
Questions
  • How do I conduct basic CTT/CRT item analyses?

  • How do I investigate the reliability/dependability of a test?

  • How do I extract indices of interest for reporting and analysis?

Objectives
  • Conduct classical test theory item and test analysis using psych.

  • Use rcrtan to carry out criterion-referenced test and item analyses.

  • Use functions from dplyr and tidyr to carry out analyses on results.

We start by loading the required packages.

library(tidyverse)
library(psych)
library(CTT)
library(rcrtan)

If not still in the workspace, load the data we saved in the previous lesson.

test_results_1 <- read_csv('data/placement_1.csv') %>%
  mutate(country = as.factor(country)) # changes the country variable to a factor
Parsed with column specification:
cols(
  .default = col_double(),
  names = col_character(),
  country = col_character(),
  admin_date = col_datetime(format = "")
)
See spec(...) for full column specifications.

Preparing the data

Our skills with dplyr and tidyr will be useful for prepping the data for analysis. Two of the three test analysis packages we are working require only item-level data (psych and CTT). The third (rcrtan) sometimes requires a column of total test scores in addition to item-level data.

Let’s prep the data for the former two packages first.

place_ctt <- test_results_1 %>%
  select(., (q1_list_mi:q70_read_det_an))

Classical test theory analysis

Now that we have the items, we are ready to carry out some analyses. We will start with a CTT analysis using the psych package. First let’s take a look at what arguments our function requires. The three we are most concerned with are x, keys, and delete

?psych::alpha
ctt_res <- psych::alpha(place_ctt, delete = FALSE) # we want to retain the items even if the indices cannot be estimated
Some items ( q25_list_prag ) were negatively correlated with the total scale and 
probably should be reversed.  
To do this, run the function again with the 'check.keys=TRUE' option

The output of psych::alpha is a list. A list is a data structure that conatains the same or different types of objects. In the case of this output, the objects are different. Running str(ctt_res) or glimpse(ctt_res) shows that there are three dataframes of different dimensions and 11 vectors of differing data types. Let’s take a peek into the first three dataframes: total, alpha.drop, and item.stats.

Exercise

With the person next to you, take a look at the help documentation for the alpha command (hint: ?). Read over the items listed under Value (this is what is returned by the command).

  • How would you access information about the reliability of the test? Which object would you extract?

Solution

# 1. 
ctt_res$total
 raw_alpha std.alpha   G6(smc) average_r      S/N        ase      mean
 0.8970782 0.8989688 0.9816376 0.1127778 8.897934 0.01533376 0.5996753
        sd  median_r
 0.1592408 0.1132277
# or 2.
ctt_res[['total']]
 raw_alpha std.alpha   G6(smc) average_r      S/N        ase      mean
 0.8970782 0.8989688 0.9816376 0.1127778 8.897934 0.01533376 0.5996753
        sd  median_r
 0.1592408 0.1132277
  • How would you find information about the item difficulty and discrimination parameters? Which object would you extract?

Solution

# 1. 
head(ctt_res$item.stats) # print the first six rows
             n     raw.r     std.r     r.cor    r.drop      mean        sd
q1_list_mi  88 0.3121197 0.3084994 0.3009708 0.2716729 0.6136364 0.4897059
q2_list_det 88 0.3509734 0.3555070 0.3537122 0.3150069 0.7272727 0.4479140
q3_list_det 88 0.3428502 0.3454095 0.3407921 0.3024045 0.5568182 0.4996080
q4_list_det 88 0.5537371 0.5537423 0.5552338 0.5242060 0.7045455 0.4588614
q5_list_det 88 0.4458832 0.4445266 0.4378579 0.4092890 0.4090909 0.4944837
q6_list_det 88 0.4588281 0.4719024 0.4715676 0.4321825 0.8409091 0.3678569
# or 2.
head(ctt_res[['item.stats']]) # print the first six rows
             n     raw.r     std.r     r.cor    r.drop      mean        sd
q1_list_mi  88 0.3121197 0.3084994 0.3009708 0.2716729 0.6136364 0.4897059
q2_list_det 88 0.3509734 0.3555070 0.3537122 0.3150069 0.7272727 0.4479140
q3_list_det 88 0.3428502 0.3454095 0.3407921 0.3024045 0.5568182 0.4996080
q4_list_det 88 0.5537371 0.5537423 0.5552338 0.5242060 0.7045455 0.4588614
q5_list_det 88 0.4458832 0.4445266 0.4378579 0.4092890 0.4090909 0.4944837
q6_list_det 88 0.4588281 0.4719024 0.4715676 0.4321825 0.8409091 0.3678569
  • How would you find information about what the reliability of the test would if the items were removed from the test? Which object would you extract?

Solution

# 1. 
head(ctt_res$alpha.drop) # print the first six rows
            raw_alpha std.alpha   G6(smc) average_r      S/N   alpha se
q1_list_mi  0.8962040 0.8980962 0.9809285 0.1132607 8.813174 0.01545914
q2_list_det 0.8957416 0.8976417 0.9800693 0.1127639 8.769603 0.01552716
q3_list_det 0.8958922 0.8977396 0.9805334 0.1128706 8.778958 0.01550373
q4_list_det 0.8936735 0.8956855 0.9801878 0.1106689 8.586398 0.01585415
q5_list_det 0.8947543 0.8967713 0.9808946 0.1118231 8.687226 0.01568184
q6_list_det 0.8948733 0.8965010 0.9801727 0.1115338 8.661928 0.01565662
                 var.r     med.r
q1_list_mi  0.01523910 0.1132277
q2_list_det 0.01513273 0.1132277
q3_list_det 0.01523350 0.1138670
q4_list_det 0.01474208 0.1112424
q5_list_det 0.01517983 0.1113556
q6_list_det 0.01505777 0.1119945
# or 2. 
head(ctt_res[['alpha.drop']]) # print the first six rows
            raw_alpha std.alpha   G6(smc) average_r      S/N   alpha se
q1_list_mi  0.8962040 0.8980962 0.9809285 0.1132607 8.813174 0.01545914
q2_list_det 0.8957416 0.8976417 0.9800693 0.1127639 8.769603 0.01552716
q3_list_det 0.8958922 0.8977396 0.9805334 0.1128706 8.778958 0.01550373
q4_list_det 0.8936735 0.8956855 0.9801878 0.1106689 8.586398 0.01585415
q5_list_det 0.8947543 0.8967713 0.9808946 0.1118231 8.687226 0.01568184
q6_list_det 0.8948733 0.8965010 0.9801727 0.1115338 8.661928 0.01565662
                 var.r     med.r
q1_list_mi  0.01523910 0.1132277
q2_list_det 0.01513273 0.1132277
q3_list_det 0.01523350 0.1138670
q4_list_det 0.01474208 0.1112424
q5_list_det 0.01517983 0.1113556
q6_list_det 0.01505777 0.1119945

Sometimes we want to do further analyses of the item-level data (i.e., summary of subtests or objectives). In order to do this with the output from psych::alpha, we need to massage the dataframe. We can read the separate function below as “separate the question_info column into question, skill, objective, and anchor; separate at _; do not remove the original column; if any of the four new columns have missing data, fill the rightmost column with NA”.

ctt_items <- ctt_res[['item.stats']] %>%
  rownames_to_column(var = 'question_info') %>% # makes the rownames a variable in the dataframe
  separate(question_info, into = c('question', 'skill', 'objective', 'anchor_status'), sep = "_", remove = FALSE, fill = 'right') %>%
  select((question_info:n), r.drop, mean) %>%# select the columns of interest
  rename('discrimination' = r.drop, 'difficulty' = mean) %>% # rename the columns
  as_tibble(.) # so it prints responsibly

ctt_items
# A tibble: 70 x 8
   question_info question skill objective anchor_status     n
   <chr>         <chr>    <chr> <chr>     <chr>         <dbl>
 1 q1_list_mi    q1       list  mi        <NA>             88
 2 q2_list_det   q2       list  det       <NA>             88
 3 q3_list_det   q3       list  det       <NA>             88
 4 q4_list_det   q4       list  det       <NA>             88
 5 q5_list_det   q5       list  det       <NA>             88
 6 q6_list_det   q6       list  det       <NA>             88
 7 q7_list_det   q7       list  det       <NA>             88
 8 q8_list_det   q8       list  det       <NA>             88
 9 q9_list_det   q9       list  det       <NA>             88
10 q10_list_mi   q10      list  mi        <NA>             88
# … with 60 more rows, and 2 more variables: discrimination <dbl>,
#   difficulty <dbl>

One analysis we might be interested in is how the item indices differ across the two subskills (we will need our handy dplyr and tidyr skills):

# with wide data

skill_summary_wide <- ctt_items %>%
  select(skill, difficulty, discrimination) %>%
  group_by(skill) %>%
  summarise(n = n(),
            'Mean p' = mean(difficulty),
            'SD p' = sd(difficulty),
            'Mean d' = mean(discrimination),
            'SD d' = sd(discrimination)) %>%
  mutate_if(is.double, round, 2) # conditionally rounds all columns that are doubles to the nearest hundredth

# with long data
skill_summary_long <- ctt_items %>%
  select(skill, difficulty, discrimination) %>%
  gather(key = 'index', value, -skill) %>%
  group_by(skill, index) %>%
  summarise(n = n(),
            Mean = mean(value),
            SD = sd(value)) %>%
  mutate_if(is.double, round, 2)

# note: the kableExtra package provides a nice set of tools for creating nested tables
# for HTML and PDF documents (interesting for the results with the long data)

Exercise

Carry out the same type of summary analysis grouped by skill and objective.

Solution

obj_summary <- ctt_items %>%
  select(skill, objective, difficulty, discrimination) %>%
  group_by(skill, objective) %>%
  summarise(n = n(),
            'Mean p' = mean(difficulty),
            'SD p' = sd(difficulty),
            'Mean d' = mean(discrimination),
            'SD d' = sd(discrimination)) %>%
  mutate_if(is.double, round, 2)

obj_summary
# A tibble: 10 x 7
# Groups:   skill [2]
   skill objective     n `Mean p` `SD p` `Mean d` `SD d`
   <chr> <chr>     <int>    <dbl>  <dbl>    <dbl>  <dbl>
 1 list  det          20     0.61   0.16    0.38    0.14
 2 list  inf           3     0.69   0.13    0.21    0.14
 3 list  mi            7     0.61   0.18    0.28    0.1 
 4 list  prag          5     0.47   0.17    0.19    0.26
 5 read  det          10     0.68   0.22    0.36    0.07
 6 read  inf           3     0.45   0.09    0.39    0.14
 7 read  mi            8     0.73   0.18    0.28    0.11
 8 read  purp          4     0.39   0.1     0.290   0.12
 9 read  torg          6     0.45   0.11    0.31    0.13
10 read  voc           4     0.74   0.23    0.290   0.18
  • How would you do this analysis by skill on only the anchor data? (hint: filter)

Solution

an_summary <- ctt_items %>%
  filter(anchor_status == 'an') %>%
  select(skill, difficulty, discrimination) %>%
  group_by(skill) %>%
  summarise(n = n(),
            'Mean p' = mean(difficulty),
            'SD p' = sd(difficulty),
            'Mean d' = mean(discrimination),
            'SD d' = sd(discrimination)) %>%
  mutate_if(is.double, round, 2)

an_summary
# A tibble: 2 x 6
  skill     n `Mean p` `SD p` `Mean d` `SD d`
  <chr> <int>    <dbl>  <dbl>    <dbl>  <dbl>
1 list      9     0.55   0.19    0.31    0.16
2 read     11     0.4    0.16    0.290   0.12

Criterion-referenced test analysis

In criterion-referenced test theory, the focus of the analysis is the dependability of classifications (e.g., master v. non-master) when evaluating the whole test and the extent to which item indices “agree” with whole test decisions.

To evaluate the dependability of classifications, there is a function in the rcrtan package called subkoviak. It implements Subkoviak’s single administration kappa and agreement coefficients. It requires three arguments:

We will use the placement_1.csv dataset for this analysis. The function returns three indices (z, z_rounded, KR_est) that were used to look up the agreement (agree_coef.r_*) and kappa (kappa_coef.r_*) coefficients.

depend <- subkoviak(test_results_1, items = 5:74, raw_cut_score = 49) # cut-score = 70%

depend
     z z_rounded    KR_est agree_coef.r_0.9 kappa_coef.r_0.9
1 0.59       0.6 0.8970782             0.88              0.7

There are also functions for carrying out CRT item analyses. The omnibus function that will return results for a number of these analyses is crt_iteman. This takes similar arguments as subkoviak with one difference being that the cut-score can be in raw or percent form.

crt_res <- crt_iteman(test_results_1, items = 5:74, cut_score = 49, scale = 'raw') # cut-score = 70%

crt_res
# A tibble: 70 x 7
   items       if_pass if_fail if_total b_index agree   phi
   <chr>         <dbl>   <dbl>    <dbl>   <dbl> <dbl> <dbl>
 1 q1_list_mi    0.692   0.581    0.614   0.112 0.500 0.105
 2 q2_list_det   0.885   0.661    0.727   0.223 0.500 0.229
 3 q3_list_det   0.769   0.468    0.557   0.301 0.602 0.277
 4 q4_list_det   1       0.581    0.705   0.419 0.591 0.419
 5 q5_list_det   0.731   0.274    0.409   0.457 0.727 0.424
 6 q6_list_det   1       0.774    0.841   0.226 0.455 0.282
 7 q7_list_det   1       0.581    0.705   0.419 0.591 0.419
 8 q8_list_det   0.885   0.629    0.705   0.256 0.523 0.256
 9 q9_list_det   0.962   0.758    0.818   0.203 0.455 0.241
10 q10_list_mi   0.923   0.710    0.773   0.213 0.477 0.232
# … with 60 more rows

Exercise

Carry out a summary analysis of the item indices if_total, b_index, agree, and phi.

Solution

crt_summary <- crt_res %>%
separate(items, into = c('question', 'skill', 'objective', 'anchor_status'), sep = "_", remove = FALSE, fill = 'right') %>%
  select(skill, if_total, b_index, agree, phi) %>%
  gather(key = index, value, -skill) %>%
  group_by(skill, index) %>%
  summarise(n = n(),
            'Mean' = mean(value),
            'SD' = sd(value)) %>%
  mutate_if(is.double, round, 2)

crt_summary
# A tibble: 8 x 5
# Groups:   skill [2]
  skill index        n  Mean    SD
  <chr> <chr>    <int> <dbl> <dbl>
1 list  agree       35 0.570  0.1 
2 list  b_index     35 0.27   0.15
3 list  if_total    35 0.6    0.17
4 list  phi         35 0.27   0.14
5 read  agree       35 0.56   0.12
6 read  b_index     35 0.24   0.13
7 read  if_total    35 0.6    0.22
8 read  phi         35 0.25   0.12

Key Points

  • psych are two packages that facilitate classical test theory analysis.

  • rcrtan facilitates criterion-referenced test and item analyses.

  • dplyr and tidy can be used to analyze the output of psychometric analysis.