Part 4: Test Equating

Overview

Teaching: 30 min
Exercises: 20 min

Questions

How do I prepare data for test equating?

How do I conduct test equating?

How do I extract indices of interest for comparison, reporting, and analysis?

how do I visualize equated relationships?

Objectives

Prepare frequency tables of test scores.

Use equate to conduct equating studies.

Extract estimates of error for presenting in figures.

We start by loading the required packages.

library(tidyverse)
library(equate)

If not still in the workspace, load the data we saved in the previous lesson. We also need to calculate the raw total and raw anchor scores for the reading and listening tests,

# load test_1_raw_totals
test_1_raw <- read_csv('data_output/test_1_raw_totals.csv') %>% # read in our total score data from earlier
  select(-country)

# load placement 1 and 2 data
test_results_1 <- read_csv('data/placement_1.csv')

test_results_2 <- read_csv('data/placement_2.csv')

# create total raw and anchor scores
test_2_raw <- test_results_2 %>%
  mutate(raw_total = rowSums(.[5:69], na.rm = TRUE)) %>%
  mutate(read_raw_total = rowSums(select(., contains("_read_")), na.rm = TRUE)) %>%
  mutate(list_raw_total = rowSums(select(., contains("_list_")), na.rm = TRUE)) %>%
  mutate(list_an_raw = rowSums(select(., matches("q\\d+_list_\\w{2,4}_an")), na.rm = TRUE)) %>% # ugly regex for summing listening anchor items
  mutate(read_an_raw = rowSums(select(., matches("q\\d+_read_\\w{2,4}_an")), na.rm = TRUE)) %>%
  select(ID, contains('raw'))

# write_csv(test_2_raw, 'data_output/test_2_raw_totals.csv')

Preparing the data

The equate function requires frequency tables of the test scores that are to be equated. We are working with data that was collected under the Non-equivalent group anchor test (NEAT) design. As a result, we need bivariate frequency tables, or frequency tables of score combinations on the anchor and total test. The equate package has a function named freqtab to help us compute these. We will be using two arguments in this function: x and scales. We will start with the listening test.

One quick way for us to figure our how many items are on the total test and how many are on the anchor test is to select the columns of the items and then count them:

listen_1_total_q <- test_results_1 %>%
  select(contains('list'))

listen_1_an_q <- test_results_1 %>%
  select(contains('list')) %>%
  select(contains('an'))

ncol(listen_1_total_q)

[1] 35

ncol(listen_1_an_q)

[1] 9

listen_1_freq <- freqtab(test_1_raw[c('list_raw_total', 'list_an_raw')], scales = list(0:35, 0:9))

Exercise

Figure out the total and anchor scales for the test_2_raw data and then create a frequency table for it.

Solution

# How many questions are on the total and anchor forms?
listen_2_total_q <- test_results_2 %>%
  select(contains('list'))

listen_2_an_q <- test_results_2 %>%
  select(contains('list')) %>%
  select(contains('an'))

ncol(listen_2_total_q)

[1] 30

ncol(listen_2_an_q)

[1] 9

# Create the frequency table

listen_2_freq <- freqtab(test_2_raw[c('list_raw_total', 'list_an_raw')], scales = list(0:30, 0:9))

Now that we have the frequency tables, we are ready to equate. There are a number of consderations in choosing an equating method Kolen & Brennan’s Test Equating, Scaling, and Linking is a nice resource for those of you who want to take a deep dive into it.

We will use the circle-arc method right now to equate form 1 on to scores from form 2:

list_ca <- equate(listen_1_freq, listen_2_freq, type = 'circle-arc', lowp = c(0, 0), highp = c(35, 30))

Exercise

What does a score of 32 on form 1 concord to on form 2? (hint: str or glimpse)

Solution

list_ca$concordance

   scale         yx
    0  0.0000000
    1  0.3920108
    2  0.8166926
    3  1.2729186
    4  1.7596900
    5  2.2761212
    6  2.8214275
    7  3.3949150
    8  3.9959714
   9  4.6240594
  10  5.2787099
  11  5.9595173
  12  6.6661353
  13  7.3982727
  14  8.1556914
  15  8.9382033
  16  9.7456691
  17 10.5779963
  18 11.4351392
  19 12.3170976
  20 13.2239176
  21 14.1556914
  22 15.1125584
  23 16.0947067
  24 17.1023745
  25 18.1358527
  26 19.1954879
  27 20.2816857
  28 21.3949150
  29 22.5357132
  30 23.7046926
  31 24.9025471
  32 26.1300615
  33 27.3881212
  34 28.6777250
  35 30.0000000

There are three types of error associated with equating: random error, systematic error (or bias), and total error. To calculate the latter two, we would need a criterion, or gold standard, equating relationship between the two test forms that we could compare our circle-arc methods to. We dont have that, so we will estimate the standard error by bootstrapping the equating relationship. One way to inspect the random error is by plotting it:

list_ca_see <- bootstrap(list_ca, reps = 100)

list_ca_see$se

 [1] 3.071556e-15 1.028392e-01 1.960803e-01 2.805879e-01 3.570847e-01
 [6] 4.261793e-01 4.883871e-01 5.441462e-01 5.938296e-01 6.377548e-01
[11] 6.761917e-01 7.093678e-01 7.374740e-01 7.606676e-01 7.790755e-01
[16] 7.927970e-01 8.019048e-01 8.064469e-01 8.064469e-01 8.019048e-01
[21] 7.927970e-01 7.790755e-01 7.606676e-01 7.374740e-01 7.093678e-01
[26] 6.761917e-01 6.377548e-01 5.938296e-01 5.441462e-01 4.883871e-01
[31] 4.261793e-01 3.570847e-01 2.805879e-01 1.960803e-01 1.028392e-01
[36] 3.071556e-15

plot(list_ca_see, out = 'se')

plot of chunk unnamed-chunk-5

Usually, in carrying out a full equating study, multiple equating relationships are estimated and compared. Below is a demonstration of how this can be done in a few lines of code.

neat_args <- list(identity = list(type = "identity"),
                  mean_tuck = list(type = "mean", method = "tucker"),
                  mean_nomi = list(type = "mean", method = "nominal weights"),
                  line_tuck = list(type = "linear", method = "tucker"),
                  line_chai = list(type = "linear", method = "chained"),
                  circ_tuck = list(type = "circle-arc", method = "tucker"),
                  circ_chai = list(type = "circle-arc", method = "chained", chainmidp = "linear"))

comp_meth <- bootstrap(x = listen_1_freq, y = listen_2_freq, reps = 100, args = neat_args)

plot(comp_meth, out = "se", addident = FALSE, legendplace = 'top')

plot of chunk unnamed-chunk-6

round(summary(comp_meth), 2)

            se se_w
identity  0.00 0.00
mean_tuck 0.44 0.44
mean_nomi 0.52 0.52
line_tuck 0.74 0.54
line_chai 0.78 0.58
circ_tuck 0.34 0.40
circ_chai 0.35 0.41

# using ggplot and dplyr/tidyr

fiver <- comp_meth$se %>%
  as_tibble() %>%
  select(-identity) %>%
  mutate(score = 0:35) %>%
  slice(seq(1, 36, by = 5)) %>%
  gather(key = method, value, -score)

comp_plot <- comp_meth$se %>%
  as_tibble() %>%
  select(-identity) %>%
  mutate(score = 0:35) %>%
  gather(key = method, value, -score) %>%
  ggplot(., aes(x = score, y = value, colour = method, shape = method)) +
  geom_point(data = fiver, aes(size = value), show.legend = FALSE) +
  geom_line() +
  theme_gray(base_size = 18) +
  scale_colour_viridis_d() +
  scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35), limits = c(0, 35)) +
  theme(panel.grid.minor = element_blank())

comp_plot

plot of chunk unnamed-chunk-7

comp_table <- comp_meth$se %>%
  as_tibble() %>%
  summarise_all(., 'mean') %>%
  mutate_all(., 'round', 2)

comp_table

# A tibble: 1 x 7
  identity mean_tuck mean_nomi line_tuck line_chai circ_tuck circ_chai
     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1        0      0.44      0.52      0.69      0.74      0.31      0.32

Key Points

equate can be used for small- and large-sample test equating.

previous episode

R for Assessment Specialists

next episode

Part 4: Test Equating

Overview

Preparing the data

Exercise

Solution

Exercise

Solution

Key Points

previous episode

next episode