Part 2: CTT and CRT Test and Item Analysis
Overview
Teaching: 30 min
Exercises: 25 minQuestions
How do I conduct basic CTT/CRT item analyses?
How do I investigate the reliability/dependability of a test?
How do I extract indices of interest for reporting and analysis?
Objectives
Conduct classical test theory item and test analysis using
psych
.Use
rcrtan
to carry out criterion-referenced test and item analyses.Use functions from
dplyr
andtidyr
to carry out analyses on results.
We start by loading the required packages.
library(tidyverse)
library(psych)
library(CTT)
library(rcrtan)
If not still in the workspace, load the data we saved in the previous lesson.
test_results_1 <- read_csv('data/placement_1.csv') %>%
mutate(country = as.factor(country)) # changes the country variable to a factor
Parsed with column specification:
cols(
.default = col_double(),
names = col_character(),
country = col_character(),
admin_date = col_datetime(format = "")
)
See spec(...) for full column specifications.
Preparing the data
Our skills with dplyr
and tidyr
will be useful for prepping the data for analysis.
Two of the three test analysis packages we are working require only item-level data (psych
and CTT
).
The third (rcrtan
) sometimes requires a column of total test scores in addition to item-level
data.
Let’s prep the data for the former two packages first.
place_ctt <- test_results_1 %>%
select(., (q1_list_mi:q70_read_det_an))
Classical test theory analysis
Now that we have the items, we are ready to carry out some analyses. We will start with a CTT analysis
using the psych
package. First let’s take a look at what arguments our function requires. The three
we are most concerned with are x
, keys
, and delete
x
: A data.frame or matrix of data, or a covariance or correlation matrixkeys
: If some items are to be reversed keyed, then either specify the direction of all items or just a vector of which items to reverse.delete
: Delete items with no variance and issue a warning
?psych::alpha
ctt_res <- psych::alpha(place_ctt, delete = FALSE) # we want to retain the items even if the indices cannot be estimated
Some items ( q25_list_prag ) were negatively correlated with the total scale and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
The output of psych::alpha
is a list. A list is a data structure that conatains the same or different
types of objects. In the case of this output, the objects are different. Running str(ctt_res)
or glimpse(ctt_res)
shows that there are three dataframes of different dimensions and 11 vectors of differing data types. Let’s take a peek into
the first three dataframes: total
, alpha.drop
, and item.stats
.
Exercise
With the person next to you, take a look at the help documentation for the
alpha
command (hint:?
). Read over the items listed under Value (this is what is returned by the command).
- How would you access information about the reliability of the test? Which object would you extract?
Solution
# 1. ctt_res$total
raw_alpha std.alpha G6(smc) average_r S/N ase mean 0.8970782 0.8989688 0.9816376 0.1127778 8.897934 0.01533376 0.5996753 sd median_r 0.1592408 0.1132277
# or 2. ctt_res[['total']]
raw_alpha std.alpha G6(smc) average_r S/N ase mean 0.8970782 0.8989688 0.9816376 0.1127778 8.897934 0.01533376 0.5996753 sd median_r 0.1592408 0.1132277
- How would you find information about the item difficulty and discrimination parameters? Which object would you extract?
Solution
# 1. head(ctt_res$item.stats) # print the first six rows
n raw.r std.r r.cor r.drop mean sd q1_list_mi 88 0.3121197 0.3084994 0.3009708 0.2716729 0.6136364 0.4897059 q2_list_det 88 0.3509734 0.3555070 0.3537122 0.3150069 0.7272727 0.4479140 q3_list_det 88 0.3428502 0.3454095 0.3407921 0.3024045 0.5568182 0.4996080 q4_list_det 88 0.5537371 0.5537423 0.5552338 0.5242060 0.7045455 0.4588614 q5_list_det 88 0.4458832 0.4445266 0.4378579 0.4092890 0.4090909 0.4944837 q6_list_det 88 0.4588281 0.4719024 0.4715676 0.4321825 0.8409091 0.3678569
# or 2. head(ctt_res[['item.stats']]) # print the first six rows
n raw.r std.r r.cor r.drop mean sd q1_list_mi 88 0.3121197 0.3084994 0.3009708 0.2716729 0.6136364 0.4897059 q2_list_det 88 0.3509734 0.3555070 0.3537122 0.3150069 0.7272727 0.4479140 q3_list_det 88 0.3428502 0.3454095 0.3407921 0.3024045 0.5568182 0.4996080 q4_list_det 88 0.5537371 0.5537423 0.5552338 0.5242060 0.7045455 0.4588614 q5_list_det 88 0.4458832 0.4445266 0.4378579 0.4092890 0.4090909 0.4944837 q6_list_det 88 0.4588281 0.4719024 0.4715676 0.4321825 0.8409091 0.3678569
- How would you find information about what the reliability of the test would if the items were removed from the test? Which object would you extract?
Solution
# 1. head(ctt_res$alpha.drop) # print the first six rows
raw_alpha std.alpha G6(smc) average_r S/N alpha se q1_list_mi 0.8962040 0.8980962 0.9809285 0.1132607 8.813174 0.01545914 q2_list_det 0.8957416 0.8976417 0.9800693 0.1127639 8.769603 0.01552716 q3_list_det 0.8958922 0.8977396 0.9805334 0.1128706 8.778958 0.01550373 q4_list_det 0.8936735 0.8956855 0.9801878 0.1106689 8.586398 0.01585415 q5_list_det 0.8947543 0.8967713 0.9808946 0.1118231 8.687226 0.01568184 q6_list_det 0.8948733 0.8965010 0.9801727 0.1115338 8.661928 0.01565662 var.r med.r q1_list_mi 0.01523910 0.1132277 q2_list_det 0.01513273 0.1132277 q3_list_det 0.01523350 0.1138670 q4_list_det 0.01474208 0.1112424 q5_list_det 0.01517983 0.1113556 q6_list_det 0.01505777 0.1119945
# or 2. head(ctt_res[['alpha.drop']]) # print the first six rows
raw_alpha std.alpha G6(smc) average_r S/N alpha se q1_list_mi 0.8962040 0.8980962 0.9809285 0.1132607 8.813174 0.01545914 q2_list_det 0.8957416 0.8976417 0.9800693 0.1127639 8.769603 0.01552716 q3_list_det 0.8958922 0.8977396 0.9805334 0.1128706 8.778958 0.01550373 q4_list_det 0.8936735 0.8956855 0.9801878 0.1106689 8.586398 0.01585415 q5_list_det 0.8947543 0.8967713 0.9808946 0.1118231 8.687226 0.01568184 q6_list_det 0.8948733 0.8965010 0.9801727 0.1115338 8.661928 0.01565662 var.r med.r q1_list_mi 0.01523910 0.1132277 q2_list_det 0.01513273 0.1132277 q3_list_det 0.01523350 0.1138670 q4_list_det 0.01474208 0.1112424 q5_list_det 0.01517983 0.1113556 q6_list_det 0.01505777 0.1119945
Sometimes we want to do further analyses of the item-level data (i.e., summary of subtests or objectives). In order to do this
with the output from psych::alpha
, we need to massage the dataframe. We can read the separate
function below as “separate
the question_info
column into question
, skill
, objective
, and anchor
; separate at _
; do not remove the original
column; if any of the four new columns have missing data, fill the rightmost column with NA
”.
ctt_items <- ctt_res[['item.stats']] %>%
rownames_to_column(var = 'question_info') %>% # makes the rownames a variable in the dataframe
separate(question_info, into = c('question', 'skill', 'objective', 'anchor_status'), sep = "_", remove = FALSE, fill = 'right') %>%
select((question_info:n), r.drop, mean) %>%# select the columns of interest
rename('discrimination' = r.drop, 'difficulty' = mean) %>% # rename the columns
as_tibble(.) # so it prints responsibly
ctt_items
# A tibble: 70 x 8
question_info question skill objective anchor_status n
<chr> <chr> <chr> <chr> <chr> <dbl>
1 q1_list_mi q1 list mi <NA> 88
2 q2_list_det q2 list det <NA> 88
3 q3_list_det q3 list det <NA> 88
4 q4_list_det q4 list det <NA> 88
5 q5_list_det q5 list det <NA> 88
6 q6_list_det q6 list det <NA> 88
7 q7_list_det q7 list det <NA> 88
8 q8_list_det q8 list det <NA> 88
9 q9_list_det q9 list det <NA> 88
10 q10_list_mi q10 list mi <NA> 88
# … with 60 more rows, and 2 more variables: discrimination <dbl>,
# difficulty <dbl>
One analysis we might be interested in is how the item indices differ across the two subskills (we will need our
handy dplyr
and tidyr
skills):
# with wide data
skill_summary_wide <- ctt_items %>%
select(skill, difficulty, discrimination) %>%
group_by(skill) %>%
summarise(n = n(),
'Mean p' = mean(difficulty),
'SD p' = sd(difficulty),
'Mean d' = mean(discrimination),
'SD d' = sd(discrimination)) %>%
mutate_if(is.double, round, 2) # conditionally rounds all columns that are doubles to the nearest hundredth
# with long data
skill_summary_long <- ctt_items %>%
select(skill, difficulty, discrimination) %>%
gather(key = 'index', value, -skill) %>%
group_by(skill, index) %>%
summarise(n = n(),
Mean = mean(value),
SD = sd(value)) %>%
mutate_if(is.double, round, 2)
# note: the kableExtra package provides a nice set of tools for creating nested tables
# for HTML and PDF documents (interesting for the results with the long data)
Exercise
Carry out the same type of summary analysis grouped by skill and objective.
Solution
obj_summary <- ctt_items %>% select(skill, objective, difficulty, discrimination) %>% group_by(skill, objective) %>% summarise(n = n(), 'Mean p' = mean(difficulty), 'SD p' = sd(difficulty), 'Mean d' = mean(discrimination), 'SD d' = sd(discrimination)) %>% mutate_if(is.double, round, 2) obj_summary
# A tibble: 10 x 7 # Groups: skill [2] skill objective n `Mean p` `SD p` `Mean d` `SD d` <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> 1 list det 20 0.61 0.16 0.38 0.14 2 list inf 3 0.69 0.13 0.21 0.14 3 list mi 7 0.61 0.18 0.28 0.1 4 list prag 5 0.47 0.17 0.19 0.26 5 read det 10 0.68 0.22 0.36 0.07 6 read inf 3 0.45 0.09 0.39 0.14 7 read mi 8 0.73 0.18 0.28 0.11 8 read purp 4 0.39 0.1 0.290 0.12 9 read torg 6 0.45 0.11 0.31 0.13 10 read voc 4 0.74 0.23 0.290 0.18
- How would you do this analysis by skill on only the anchor data? (hint:
filter
)Solution
an_summary <- ctt_items %>% filter(anchor_status == 'an') %>% select(skill, difficulty, discrimination) %>% group_by(skill) %>% summarise(n = n(), 'Mean p' = mean(difficulty), 'SD p' = sd(difficulty), 'Mean d' = mean(discrimination), 'SD d' = sd(discrimination)) %>% mutate_if(is.double, round, 2) an_summary
# A tibble: 2 x 6 skill n `Mean p` `SD p` `Mean d` `SD d` <chr> <int> <dbl> <dbl> <dbl> <dbl> 1 list 9 0.55 0.19 0.31 0.16 2 read 11 0.4 0.16 0.290 0.12
Criterion-referenced test analysis
In criterion-referenced test theory, the focus of the analysis is the dependability of classifications (e.g., master v. non-master) when evaluating the whole test and the extent to which item indices “agree” with whole test decisions.
To evaluate the dependability of classifications, there is a function in the rcrtan
package
called subkoviak
. It implements Subkoviak’s single administration kappa and agreement coefficients.
It requires three arguments:
data
: A dataframe of dichotomously scored itemsitems
: The column indices that can be used to locate the items in the dataframeraw_cut_score
: The raw cut-score of the test.
We will use the placement_1.csv
dataset for this analysis. The function returns three indices
(z
, z_rounded
, KR_est
) that were used to look up the agreement (agree_coef.r_*
) and
kappa (kappa_coef.r_*
) coefficients.
depend <- subkoviak(test_results_1, items = 5:74, raw_cut_score = 49) # cut-score = 70%
depend
z z_rounded KR_est agree_coef.r_0.9 kappa_coef.r_0.9
1 0.59 0.6 0.8970782 0.88 0.7
There are also functions for carrying out CRT item analyses. The omnibus function that will return results
for a number of these analyses is crt_iteman
. This takes similar arguments as subkoviak
with one difference
being that the cut-score can be in raw or percent form.
crt_res <- crt_iteman(test_results_1, items = 5:74, cut_score = 49, scale = 'raw') # cut-score = 70%
crt_res
# A tibble: 70 x 7
items if_pass if_fail if_total b_index agree phi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 q1_list_mi 0.692 0.581 0.614 0.112 0.500 0.105
2 q2_list_det 0.885 0.661 0.727 0.223 0.500 0.229
3 q3_list_det 0.769 0.468 0.557 0.301 0.602 0.277
4 q4_list_det 1 0.581 0.705 0.419 0.591 0.419
5 q5_list_det 0.731 0.274 0.409 0.457 0.727 0.424
6 q6_list_det 1 0.774 0.841 0.226 0.455 0.282
7 q7_list_det 1 0.581 0.705 0.419 0.591 0.419
8 q8_list_det 0.885 0.629 0.705 0.256 0.523 0.256
9 q9_list_det 0.962 0.758 0.818 0.203 0.455 0.241
10 q10_list_mi 0.923 0.710 0.773 0.213 0.477 0.232
# … with 60 more rows
Exercise
Carry out a summary analysis of the item indices
if_total
,b_index
,agree
, andphi
.Solution
crt_summary <- crt_res %>% separate(items, into = c('question', 'skill', 'objective', 'anchor_status'), sep = "_", remove = FALSE, fill = 'right') %>% select(skill, if_total, b_index, agree, phi) %>% gather(key = index, value, -skill) %>% group_by(skill, index) %>% summarise(n = n(), 'Mean' = mean(value), 'SD' = sd(value)) %>% mutate_if(is.double, round, 2) crt_summary
# A tibble: 8 x 5 # Groups: skill [2] skill index n Mean SD <chr> <chr> <int> <dbl> <dbl> 1 list agree 35 0.570 0.1 2 list b_index 35 0.27 0.15 3 list if_total 35 0.6 0.17 4 list phi 35 0.27 0.14 5 read agree 35 0.56 0.12 6 read b_index 35 0.24 0.13 7 read if_total 35 0.6 0.22 8 read phi 35 0.25 0.12
Key Points
psych
are two packages that facilitate classical test theory analysis.
rcrtan
facilitates criterion-referenced test and item analyses.
dplyr
andtidy
can be used to analyze the output of psychometric analysis.