Skip to content

Introduction

tmatta edited this page Jul 8, 2017 · 4 revisions

This vignette is a three-part introductory walk-though that illustrates a potential work-flow for generating large-scale assessment data.

  1. Generating background questionnaire data
  2. Generating achievement data
  3. Combining background questionnaire data with achievement data

Generating background questionnaire data

The function questionnaire_gen generates correlated ordinal and continuous data which resembles background questionnaire data. The required arguments are

  • n_obs: the number of observations (e.g., test takers).
  • cat_prop: the cumulative proportions for up to a given category.
  • cor_matrix: a possibly heterogenous correlation matrix, consisting of polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.

The arguments c_mean and c_sd are scaling parameters for continuous variables and will be addressed in future tutorials. If the logical argument theta is TRUE then the first continuous variable generated will be labeled 'theta'.

backgroud_q <- questionnaire_gen(n_obs, cat_prop, cor_matrix, c_mean = NULL, 
                                   c_sd = NULL, theta = FALSE)

The proportion_gen function

The cat_prop argument in backgroud_q requires a list of vectors where each vector contains the cumulative proportions for each category of a given item. In the event that the actual cumulative proportions do not matter, lsasim has a function for generating random cumulative proportions, proportion_gen.

Two arguments are required for proportion_gen,

  • cat_options: a vector containing all the types of category options for the background questionnaire items.
  • n_cat_options: a vector containing the number of items that correspond to each cat_option.

Below, we specify resp_typs to be a length-3 vector meaning there will 3 different item types, 1 category items (continuous), 3 category items, and 5 category items. With that, n_types is also a length-3 vector that indicates the number of items for each response type in resp_typs. That is, there will be one 1-category item, five 3-category items, and five 5-category items -- 11 items in all. We then use those objects for the proportion_gen arguments.

cat_pr is a length-11 list of vectors. Each vector contains the cumulative proportions for a given item. The length of each vector corresponds to the number of response categories for that item.

resp_typs <- c(1, 3, 5)
n_typs    <- c(1, 5, 5)

cat_pr <- proportion_gen(cat_options = resp_typs, n_cat_options = n_typs)
print(cat_pr)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 0.25 0.45 1.00
## 
## [[3]]
## [1] 0.35 0.90 1.00
## 
## [[4]]
## [1] 0.16 0.95 1.00
## 
## [[5]]
## [1] 0.11 0.83 1.00
## 
## [[6]]
## [1] 0.36 0.80 1.00
## 
## [[7]]
## [1] 0.44 0.70 0.81 0.95 1.00
## 
## [[8]]
## [1] 0.40 0.58 0.71 1.00 1.00
## 
## [[9]]
## [1] 0.28 0.41 0.74 0.79 1.00
## 
## [[10]]
## [1] 0.52 0.79 0.80 0.99 1.00
## 
## [[11]]
## [1] 0.32 0.85 0.93 1.00 1.00

The cor_gen function

The cor_matrix argument in backgroud_q requires a correlation matrix with dimensions equal to the length of cat_prop. Like the specification of cumulative proportions above, if the actual correlations are not important, lsasim has a function to generate random correlation matrices, cor_gen. The only argument required got cor_gen is n_var, the number of variables.

Below, n_vars is the number of variables from cat_pr. This ensures dim(cor_mat)[1] will equal length(cat_pr).

n_vars  <- length(rep(resp_typs, n_typs))  
cor_mat <- cor_gen(n_var = n_vars)
round(cor_mat, 2)
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
##  [1,]  1.00 -0.05 -0.16 -0.07 -0.13 -0.35  0.50 -0.42  0.56  0.02  0.41
##  [2,] -0.05  1.00  0.54 -0.23  0.08 -0.43  0.02  0.16  0.51  0.12 -0.05
##  [3,] -0.16  0.54  1.00  0.01 -0.04 -0.44  0.00  0.26  0.06 -0.52  0.20
##  [4,] -0.07 -0.23  0.01  1.00  0.24 -0.26 -0.42 -0.16 -0.31 -0.47 -0.23
##  [5,] -0.13  0.08 -0.04  0.24  1.00 -0.21 -0.37 -0.11 -0.11 -0.17 -0.58
##  [6,] -0.35 -0.43 -0.44 -0.26 -0.21  1.00 -0.08  0.23 -0.21  0.44  0.04
##  [7,]  0.50  0.02  0.00 -0.42 -0.37 -0.08  1.00 -0.02  0.52  0.26  0.26
##  [8,] -0.42  0.16  0.26 -0.16 -0.11  0.23 -0.02  1.00 -0.13 -0.17 -0.09
##  [9,]  0.56  0.51  0.06 -0.31 -0.11 -0.21  0.52 -0.13  1.00  0.23 -0.01
## [10,]  0.02  0.12 -0.52 -0.47 -0.17  0.44  0.26 -0.17  0.23  1.00  0.17
## [11,]  0.41 -0.05  0.20 -0.23 -0.58  0.04  0.26 -0.09 -0.01  0.17  1.00

The questionnaire_gen function

Now that we have cat_pr and cor_mat specified, all that is left is to specify the number of subjects we want to generate data for, which we set to 100. Because we will generate a continuous variable, we can specify theta = TRUE.

nn <- 100
backgroud_q <- questionnaire_gen(n_obs = nn, cat_prop = cat_pr, 
                                   cor_matrix = cor_mat, theta = TRUE)
str(backgroud_q)
## 'data.frame':    100 obs. of  12 variables:
##  $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ theta  : num  0.88 0.19 1.458 0.592 -0.06 ...
##  $ q1     : Factor w/ 3 levels "1","2","3": 1 3 2 3 2 3 2 3 3 1 ...
##  $ q2     : Factor w/ 3 levels "1","2","3": 1 2 2 2 3 1 1 2 3 1 ...
##  $ q3     : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 1 2 2 ...
##  $ q4     : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 3 2 1 3 1 ...
##  $ q5     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 1 2 3 1 2 ...
##  $ q6     : Factor w/ 5 levels "1","2","3","4",..: 5 4 4 1 2 2 1 4 1 3 ...
##  $ q7     : Factor w/ 4 levels "1","2","3","4": 1 4 1 1 1 1 3 4 4 3 ...
##  $ q8     : Factor w/ 5 levels "1","2","3","4",..: 5 4 3 3 1 3 1 1 3 2 ...
##  $ q9     : Factor w/ 5 levels "1","2","3","4",..: 3 1 2 1 1 2 2 4 1 2 ...
##  $ q10    : Factor w/ 4 levels "1","2","3","4": 2 2 4 2 2 2 2 3 2 2 ...

Achievement data

The lsasim package comes with five functions to help generate so-called achievement data.

The item_gen function

The function item_gen is a flexible tool for generating item parameters from a range of items response models.

The arguments n_2pl and n_3pl specify how many of each item type are to be generated. For this example, we will generate ten 2PL items and 20 3PL items. The argument thresholds specifies how many thresholds in the 2PL. Because thresholds = 2, we have specified those 10 2PL items to come from a generalized partial credit model with 2 thresholds. Finally, the arguments b_bounds, a_bounds and c_bounds specify the bounds of the b, a, and c parameters, respectively. Not that c_bounds are only applied to the 20 3PL items.

item_pool <- item_gen(n_2pl = 10, n_3pl = 20, thresholds = 2, 
                        b_bounds = c(-2, 2), a_bounds = c(.75, 1.25), 
                        c_bounds = c(0, .25))
print(item_pool)
##    item     b    d1   d2    a    c k p
## 1     1 -0.50 -0.26 0.26 0.76 0.00 2 2
## 2     2 -1.14 -0.34 0.35 0.85 0.00 2 2
## 3     3  1.12 -0.74 0.74 0.78 0.00 2 2
## 4     4 -0.34 -0.50 0.49 1.05 0.00 2 2
## 5     5 -0.96 -0.63 0.64 1.01 0.00 2 2
## 6     6  0.81 -0.62 0.62 1.22 0.00 2 2
## 7     7 -0.59 -0.74 0.74 1.01 0.00 2 2
## 8     8  0.32 -0.28 0.29 0.96 0.00 2 2
## 9     9 -0.25 -0.66 0.67 1.02 0.00 2 2
## 10   10  0.44 -0.63 0.62 0.86 0.00 2 2
## 11   11  0.72  0.00 0.00 1.13 0.11 1 3
## 12   12 -1.61  0.00 0.00 0.92 0.00 1 3
## 13   13 -1.55  0.00 0.00 0.90 0.06 1 3
## 14   14 -0.15  0.00 0.00 1.21 0.07 1 3
## 15   15 -0.35  0.00 0.00 1.09 0.11 1 3
## 16   16  0.91  0.00 0.00 0.84 0.02 1 3
## 17   17  1.45  0.00 0.00 0.76 0.08 1 3
## 18   18 -1.59  0.00 0.00 1.04 0.24 1 3
## 19   19 -0.56  0.00 0.00 0.91 0.02 1 3
## 20   20  0.53  0.00 0.00 0.78 0.02 1 3
## 21   21  0.82  0.00 0.00 0.80 0.10 1 3
## 22   22  1.39  0.00 0.00 0.78 0.12 1 3
## 23   23  0.28  0.00 0.00 0.80 0.03 1 3
## 24   24 -0.41  0.00 0.00 1.10 0.18 1 3
## 25   25 -1.01  0.00 0.00 0.95 0.21 1 3
## 26   26  0.50  0.00 0.00 1.13 0.11 1 3
## 27   27 -0.87  0.00 0.00 0.99 0.08 1 3
## 28   28 -0.29  0.00 0.00 0.84 0.04 1 3
## 29   29  1.94  0.00 0.00 0.85 0.14 1 3
## 30   30  1.98  0.00 0.00 1.23 0.12 1 3

The multi-matrix sampling functions

The lsasim package has three functions that facilitate a multi-matrix sampling designs.

The first function, block_design, takes the a set of item parameters and distributes them across a set of item blocks (sometimes referred to as clusters). The n_blocks argument specifies the number of blocks while the item_parameters argument takes a data frame of item parameters. The default item-block allocation is a spiraling design.

The result of block_design is a length-2 list.

blocks <- block_design(n_blocks = 5, item_parameters = item_pool)

print(blocks$block_assignment)
##    b1 b2 b3 b4 b5
## i1  1  2  3  4  5
## i2  6  7  8  9 10
## i3 11 12 13 14 15
## i4 16 17 18 19 20
## i5 21 22 23 24 25
## i6 26 27 28 29 30
print(blocks$block_descriptives)
##    block length average difficulty
## b1            6              0.543
## b2            6             -0.228
## b3            6             -0.285
## b4            6              0.038
## b5            6              0.105

The second function, booklet_design, assigns the item blocks to test booklets. THe only argument used below is item_block_assignment which takes a n item-block matrix. Since we are using booklet_design in tandem with block_design, we can use the output from block_design.

#--- Assign blocks to booklets
booklets <- booklet_design(item_block_assignment = blocks$block_assignment)
print(booklets)
##     B1 B2 B3 B4 B5
## i1   1  2  3  4  1
## i2   6  7  8  9  6
## i3  11 12 13 14 11
## i4  16 17 18 19 16
## i5  21 22 23 24 21
## i6  26 27 28 29 26
## i7   2  3  4  5  5
## i8   7  8  9 10 10
## i9  12 13 14 15 15
## i10 17 18 19 20 20
## i11 22 23 24 25 25
## i12 27 28 29 30 30

The final function for generating item responses under a multi-matrix design is booklet_sample which facilitates the distribution of test booklets to test takers. The argument n_subj is the number of test takers while the argument book_item_design is the output from booklet_design. The default is to equally distribute test booklets to test takers.

#--- Assign booklets to subjects 
subj_booklets <- booklet_sample(n_subj = nn, book_item_design = booklets)
head(subj_booklets, 10)
##    subject book item
## 1        1    1    1
## 2        1    1    6
## 3        1    1   11
## 4        1    1   16
## 5        1    1   21
## 6        1    1   26
## 7        1    1    2
## 8        1    1    7
## 9        1    1   12
## 10       1    1   17
tail(subj_booklets, 10)
##      subject book item
## 1191     100    4   14
## 1192     100    4   19
## 1193     100    4   24
## 1194     100    4   29
## 1195     100    4    5
## 1196     100    4   10
## 1197     100    4   15
## 1198     100    4   20
## 1199     100    4   25
## 1200     100    4   30

The response_gen function

The final function to generate achievement data is the response_gen function. The function is designed to take information from three distinct sources

  1. The test booklet - test taker information which come from the booklet_sample function.
  2. The latent true score for each student, which may from from the questionnaire_gen function.
  3. The item parameters which come from the item_gen function.

Notice that d_par takes a list. These are the deviations from the item difficulty (b_par) mean for partial credit items.

item_responses <- response_gen(subject = subj_booklets$subject, item = subj_booklets$item, 
                                 theta = backgroud_q$theta, b_par = item_pool$b, 
                                 a_par = item_pool$a, c_par = item_pool$c, 
                                 d_par = list(item_pool$d1, item_pool$d2))
str(item_responses)
## 'data.frame':    100 obs. of  31 variables:
##  $ i001   : num  0 NA NA NA NA 1 NA 2 NA NA ...
##  $ i002   : num  2 2 NA 0 1 NA NA 0 NA NA ...
##  $ i003   : num  NA 1 1 1 2 NA 1 NA NA NA ...
##  $ i004   : num  NA NA 2 NA NA NA 1 NA 0 1 ...
##  $ i005   : num  NA NA NA NA NA 2 NA NA 0 1 ...
##  $ i006   : num  2 NA NA NA NA 1 NA 0 NA NA ...
##  $ i007   : num  0 2 NA 2 1 NA NA 1 NA NA ...
##  $ i008   : num  NA 1 2 1 1 NA 1 NA NA NA ...
##  $ i009   : num  NA NA 2 NA NA NA 0 NA 1 1 ...
##  $ i010   : num  NA NA NA NA NA 0 NA NA 1 1 ...
##  $ i011   : num  1 NA NA NA NA 0 NA 1 NA NA ...
##  $ i012   : num  1 0 NA 1 1 NA NA 1 NA NA ...
##  $ i013   : num  NA 1 1 1 1 NA 1 NA NA NA ...
##  $ i014   : num  NA NA 1 NA NA NA 0 NA 0 0 ...
##  $ i015   : num  NA NA NA NA NA 1 NA NA 1 0 ...
##  $ i016   : num  1 NA NA NA NA 1 NA 0 NA NA ...
##  $ i017   : num  0 0 NA 1 0 NA NA 0 NA NA ...
##  $ i018   : num  NA 1 1 1 1 NA 1 NA NA NA ...
##  $ i019   : num  NA NA 0 NA NA NA 0 NA 0 0 ...
##  $ i020   : num  NA NA NA NA NA 1 NA NA 0 0 ...
##  $ i021   : num  0 NA NA NA NA 1 NA 0 NA NA ...
##  $ i022   : num  0 1 NA 1 0 NA NA 0 NA NA ...
##  $ i023   : num  NA 0 1 1 0 NA 1 NA NA NA ...
##  $ i024   : num  NA NA 0 NA NA NA 0 NA 1 1 ...
##  $ i025   : num  NA NA NA NA NA 1 NA NA 0 0 ...
##  $ i026   : num  0 NA NA NA NA 0 NA 0 NA NA ...
##  $ i027   : num  1 0 NA 1 1 NA NA 1 NA NA ...
##  $ i028   : num  NA 1 1 0 0 NA 0 NA NA NA ...
##  $ i029   : num  NA NA 1 NA NA NA 0 NA 0 0 ...
##  $ i030   : num  NA NA NA NA NA 0 NA NA 0 0 ...
##  $ subject: int  1 2 3 4 5 6 7 8 9 10 ...

Combine survey data and cognitive data

Finally, we use the merge function to combine backgroud_q and item_responses into a single data frame.

final_data <- merge(backgroud_q, item_responses, by = "subject")
str(final_data)
## 'data.frame':    100 obs. of  42 variables:
##  $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ theta  : num  0.88 0.19 1.458 0.592 -0.06 ...
##  $ q1     : Factor w/ 3 levels "1","2","3": 1 3 2 3 2 3 2 3 3 1 ...
##  $ q2     : Factor w/ 3 levels "1","2","3": 1 2 2 2 3 1 1 2 3 1 ...
##  $ q3     : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 1 2 2 ...
##  $ q4     : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 3 2 1 3 1 ...
##  $ q5     : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 1 2 3 1 2 ...
##  $ q6     : Factor w/ 5 levels "1","2","3","4",..: 5 4 4 1 2 2 1 4 1 3 ...
##  $ q7     : Factor w/ 4 levels "1","2","3","4": 1 4 1 1 1 1 3 4 4 3 ...
##  $ q8     : Factor w/ 5 levels "1","2","3","4",..: 5 4 3 3 1 3 1 1 3 2 ...
##  $ q9     : Factor w/ 5 levels "1","2","3","4",..: 3 1 2 1 1 2 2 4 1 2 ...
##  $ q10    : Factor w/ 4 levels "1","2","3","4": 2 2 4 2 2 2 2 3 2 2 ...
##  $ i001   : num  0 NA NA NA NA 1 NA 2 NA NA ...
##  $ i002   : num  2 2 NA 0 1 NA NA 0 NA NA ...
##  $ i003   : num  NA 1 1 1 2 NA 1 NA NA NA ...
##  $ i004   : num  NA NA 2 NA NA NA 1 NA 0 1 ...
##  $ i005   : num  NA NA NA NA NA 2 NA NA 0 1 ...
##  $ i006   : num  2 NA NA NA NA 1 NA 0 NA NA ...
##  $ i007   : num  0 2 NA 2 1 NA NA 1 NA NA ...
##  $ i008   : num  NA 1 2 1 1 NA 1 NA NA NA ...
##  $ i009   : num  NA NA 2 NA NA NA 0 NA 1 1 ...
##  $ i010   : num  NA NA NA NA NA 0 NA NA 1 1 ...
##  $ i011   : num  1 NA NA NA NA 0 NA 1 NA NA ...
##  $ i012   : num  1 0 NA 1 1 NA NA 1 NA NA ...
##  $ i013   : num  NA 1 1 1 1 NA 1 NA NA NA ...
##  $ i014   : num  NA NA 1 NA NA NA 0 NA 0 0 ...
##  $ i015   : num  NA NA NA NA NA 1 NA NA 1 0 ...
##  $ i016   : num  1 NA NA NA NA 1 NA 0 NA NA ...
##  $ i017   : num  0 0 NA 1 0 NA NA 0 NA NA ...
##  $ i018   : num  NA 1 1 1 1 NA 1 NA NA NA ...
##  $ i019   : num  NA NA 0 NA NA NA 0 NA 0 0 ...
##  $ i020   : num  NA NA NA NA NA 1 NA NA 0 0 ...
##  $ i021   : num  0 NA NA NA NA 1 NA 0 NA NA ...
##  $ i022   : num  0 1 NA 1 0 NA NA 0 NA NA ...
##  $ i023   : num  NA 0 1 1 0 NA 1 NA NA NA ...
##  $ i024   : num  NA NA 0 NA NA NA 0 NA 1 1 ...
##  $ i025   : num  NA NA NA NA NA 1 NA NA 0 0 ...
##  $ i026   : num  0 NA NA NA NA 0 NA 0 NA NA ...
##  $ i027   : num  1 0 NA 1 1 NA NA 1 NA NA ...
##  $ i028   : num  NA 1 1 0 0 NA 0 NA NA NA ...
##  $ i029   : num  NA NA 1 NA NA NA 0 NA 0 0 ...
##  $ i030   : num  NA NA NA NA NA 0 NA NA 0 0 ...
Clone this wiki locally