-
Notifications
You must be signed in to change notification settings - Fork 5
Introduction
This vignette is a three-part introductory walk-though that illustrates a potential work-flow for generating large-scale assessment data.
- Generating background questionnaire data
- Generating achievement data
- Combining background questionnaire data with achievement data
The function questionnaire_gen
generates correlated ordinal and continuous data which resembles background questionnaire data. The required arguments are
-
n_obs
: the number of observations (e.g., test takers). -
cat_prop
: the cumulative proportions for up to a given category. -
cor_matrix
: a possibly heterogenous correlation matrix, consisting of polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.
The arguments c_mean
and c_sd
are scaling parameters for continuous variables and will be addressed in future tutorials. If the logical argument theta
is TRUE
then the first continuous variable generated will be labeled 'theta'.
backgroud_q <- questionnaire_gen(n_obs, cat_prop, cor_matrix, c_mean = NULL,
c_sd = NULL, theta = FALSE)
The cat_prop
argument in backgroud_q
requires a list of vectors where each vector contains the cumulative proportions for each category of a given item. In the event that the actual cumulative proportions do not matter, lsasim
has a function for generating random cumulative proportions, proportion_gen
.
Two arguments are required for proportion_gen
,
-
cat_options
: a vector containing all the types of category options for the background questionnaire items. -
n_cat_options
: a vector containing the number of items that correspond to eachcat_option
.
Below, we specify resp_typs
to be a length-3 vector meaning there will 3 different item types, 1 category items (continuous), 3 category items, and 5 category items. With that, n_types
is also a length-3 vector that indicates the number of items for each response type in resp_typs
. That is, there will be one 1-category item, five 3-category items, and five 5-category items -- 11 items in all. We then use those objects for the proportion_gen
arguments.
cat_pr
is a length-11 list of vectors. Each vector contains the cumulative proportions for a given item. The length of each vector corresponds to the number of response categories for that item.
resp_typs <- c(1, 3, 5)
n_typs <- c(1, 5, 5)
cat_pr <- proportion_gen(cat_options = resp_typs, n_cat_options = n_typs)
print(cat_pr)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 0.25 0.45 1.00
##
## [[3]]
## [1] 0.35 0.90 1.00
##
## [[4]]
## [1] 0.16 0.95 1.00
##
## [[5]]
## [1] 0.11 0.83 1.00
##
## [[6]]
## [1] 0.36 0.80 1.00
##
## [[7]]
## [1] 0.44 0.70 0.81 0.95 1.00
##
## [[8]]
## [1] 0.40 0.58 0.71 1.00 1.00
##
## [[9]]
## [1] 0.28 0.41 0.74 0.79 1.00
##
## [[10]]
## [1] 0.52 0.79 0.80 0.99 1.00
##
## [[11]]
## [1] 0.32 0.85 0.93 1.00 1.00
The cor_matrix
argument in backgroud_q
requires a correlation matrix with dimensions equal to the length of cat_prop
. Like the specification of cumulative proportions above, if the actual correlations are not important, lsasim
has a function to generate random correlation matrices, cor_gen
. The only argument required got cor_gen
is n_var
, the number of variables.
Below, n_vars
is the number of variables from cat_pr
. This ensures dim(cor_mat)[1]
will equal length(cat_pr)
.
n_vars <- length(rep(resp_typs, n_typs))
cor_mat <- cor_gen(n_var = n_vars)
round(cor_mat, 2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 1.00 -0.05 -0.16 -0.07 -0.13 -0.35 0.50 -0.42 0.56 0.02 0.41
## [2,] -0.05 1.00 0.54 -0.23 0.08 -0.43 0.02 0.16 0.51 0.12 -0.05
## [3,] -0.16 0.54 1.00 0.01 -0.04 -0.44 0.00 0.26 0.06 -0.52 0.20
## [4,] -0.07 -0.23 0.01 1.00 0.24 -0.26 -0.42 -0.16 -0.31 -0.47 -0.23
## [5,] -0.13 0.08 -0.04 0.24 1.00 -0.21 -0.37 -0.11 -0.11 -0.17 -0.58
## [6,] -0.35 -0.43 -0.44 -0.26 -0.21 1.00 -0.08 0.23 -0.21 0.44 0.04
## [7,] 0.50 0.02 0.00 -0.42 -0.37 -0.08 1.00 -0.02 0.52 0.26 0.26
## [8,] -0.42 0.16 0.26 -0.16 -0.11 0.23 -0.02 1.00 -0.13 -0.17 -0.09
## [9,] 0.56 0.51 0.06 -0.31 -0.11 -0.21 0.52 -0.13 1.00 0.23 -0.01
## [10,] 0.02 0.12 -0.52 -0.47 -0.17 0.44 0.26 -0.17 0.23 1.00 0.17
## [11,] 0.41 -0.05 0.20 -0.23 -0.58 0.04 0.26 -0.09 -0.01 0.17 1.00
Now that we have cat_pr
and cor_mat
specified, all that is left is to specify the number of subjects we want to generate data for, which we set to 100. Because we will generate a continuous variable, we can specify theta = TRUE
.
nn <- 100
backgroud_q <- questionnaire_gen(n_obs = nn, cat_prop = cat_pr,
cor_matrix = cor_mat, theta = TRUE)
str(backgroud_q)
## 'data.frame': 100 obs. of 12 variables:
## $ subject: int 1 2 3 4 5 6 7 8 9 10 ...
## $ theta : num 0.88 0.19 1.458 0.592 -0.06 ...
## $ q1 : Factor w/ 3 levels "1","2","3": 1 3 2 3 2 3 2 3 3 1 ...
## $ q2 : Factor w/ 3 levels "1","2","3": 1 2 2 2 3 1 1 2 3 1 ...
## $ q3 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 1 2 2 ...
## $ q4 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 3 2 1 3 1 ...
## $ q5 : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 1 2 3 1 2 ...
## $ q6 : Factor w/ 5 levels "1","2","3","4",..: 5 4 4 1 2 2 1 4 1 3 ...
## $ q7 : Factor w/ 4 levels "1","2","3","4": 1 4 1 1 1 1 3 4 4 3 ...
## $ q8 : Factor w/ 5 levels "1","2","3","4",..: 5 4 3 3 1 3 1 1 3 2 ...
## $ q9 : Factor w/ 5 levels "1","2","3","4",..: 3 1 2 1 1 2 2 4 1 2 ...
## $ q10 : Factor w/ 4 levels "1","2","3","4": 2 2 4 2 2 2 2 3 2 2 ...
The lsasim
package comes with five functions to help generate so-called achievement data.
The function item_gen
is a flexible tool for generating item parameters from a range of items response models.
The arguments n_2pl
and n_3pl
specify how many of each item type are to be generated. For this example, we will generate ten 2PL items and 20 3PL items. The argument thresholds
specifies how many thresholds in the 2PL. Because thresholds = 2
, we have specified those 10 2PL items to come from a generalized partial credit model with 2 thresholds. Finally, the arguments b_bounds
, a_bounds
and c_bounds
specify the bounds of the b, a, and c parameters, respectively. Not that c_bounds
are only applied to the 20 3PL items.
item_pool <- item_gen(n_2pl = 10, n_3pl = 20, thresholds = 2,
b_bounds = c(-2, 2), a_bounds = c(.75, 1.25),
c_bounds = c(0, .25))
print(item_pool)
## item b d1 d2 a c k p
## 1 1 -0.50 -0.26 0.26 0.76 0.00 2 2
## 2 2 -1.14 -0.34 0.35 0.85 0.00 2 2
## 3 3 1.12 -0.74 0.74 0.78 0.00 2 2
## 4 4 -0.34 -0.50 0.49 1.05 0.00 2 2
## 5 5 -0.96 -0.63 0.64 1.01 0.00 2 2
## 6 6 0.81 -0.62 0.62 1.22 0.00 2 2
## 7 7 -0.59 -0.74 0.74 1.01 0.00 2 2
## 8 8 0.32 -0.28 0.29 0.96 0.00 2 2
## 9 9 -0.25 -0.66 0.67 1.02 0.00 2 2
## 10 10 0.44 -0.63 0.62 0.86 0.00 2 2
## 11 11 0.72 0.00 0.00 1.13 0.11 1 3
## 12 12 -1.61 0.00 0.00 0.92 0.00 1 3
## 13 13 -1.55 0.00 0.00 0.90 0.06 1 3
## 14 14 -0.15 0.00 0.00 1.21 0.07 1 3
## 15 15 -0.35 0.00 0.00 1.09 0.11 1 3
## 16 16 0.91 0.00 0.00 0.84 0.02 1 3
## 17 17 1.45 0.00 0.00 0.76 0.08 1 3
## 18 18 -1.59 0.00 0.00 1.04 0.24 1 3
## 19 19 -0.56 0.00 0.00 0.91 0.02 1 3
## 20 20 0.53 0.00 0.00 0.78 0.02 1 3
## 21 21 0.82 0.00 0.00 0.80 0.10 1 3
## 22 22 1.39 0.00 0.00 0.78 0.12 1 3
## 23 23 0.28 0.00 0.00 0.80 0.03 1 3
## 24 24 -0.41 0.00 0.00 1.10 0.18 1 3
## 25 25 -1.01 0.00 0.00 0.95 0.21 1 3
## 26 26 0.50 0.00 0.00 1.13 0.11 1 3
## 27 27 -0.87 0.00 0.00 0.99 0.08 1 3
## 28 28 -0.29 0.00 0.00 0.84 0.04 1 3
## 29 29 1.94 0.00 0.00 0.85 0.14 1 3
## 30 30 1.98 0.00 0.00 1.23 0.12 1 3
The lsasim
package has three functions that facilitate a multi-matrix sampling designs.
The first function, block_design
, takes the a set of item parameters and distributes them across a set of item blocks (sometimes referred to as clusters). The n_blocks
argument specifies the number of blocks while the item_parameters
argument takes a data frame of item parameters. The default item-block allocation is a spiraling design.
The result of block_design
is a length-2 list.
blocks <- block_design(n_blocks = 5, item_parameters = item_pool)
print(blocks$block_assignment)
## b1 b2 b3 b4 b5
## i1 1 2 3 4 5
## i2 6 7 8 9 10
## i3 11 12 13 14 15
## i4 16 17 18 19 20
## i5 21 22 23 24 25
## i6 26 27 28 29 30
print(blocks$block_descriptives)
## block length average difficulty
## b1 6 0.543
## b2 6 -0.228
## b3 6 -0.285
## b4 6 0.038
## b5 6 0.105
The second function, booklet_design
, assigns the item blocks to test booklets. THe only argument used below is item_block_assignment
which takes a n item-block matrix. Since we are using booklet_design
in tandem with block_design
, we can use the output from block_design
.
#--- Assign blocks to booklets
booklets <- booklet_design(item_block_assignment = blocks$block_assignment)
print(booklets)
## B1 B2 B3 B4 B5
## i1 1 2 3 4 1
## i2 6 7 8 9 6
## i3 11 12 13 14 11
## i4 16 17 18 19 16
## i5 21 22 23 24 21
## i6 26 27 28 29 26
## i7 2 3 4 5 5
## i8 7 8 9 10 10
## i9 12 13 14 15 15
## i10 17 18 19 20 20
## i11 22 23 24 25 25
## i12 27 28 29 30 30
The final function for generating item responses under a multi-matrix design is booklet_sample
which facilitates the distribution of test booklets to test takers. The argument n_subj
is the number of test takers while the argument book_item_design
is the output from booklet_design
. The default is to equally distribute test booklets to test takers.
#--- Assign booklets to subjects
subj_booklets <- booklet_sample(n_subj = nn, book_item_design = booklets)
head(subj_booklets, 10)
## subject book item
## 1 1 1 1
## 2 1 1 6
## 3 1 1 11
## 4 1 1 16
## 5 1 1 21
## 6 1 1 26
## 7 1 1 2
## 8 1 1 7
## 9 1 1 12
## 10 1 1 17
tail(subj_booklets, 10)
## subject book item
## 1191 100 4 14
## 1192 100 4 19
## 1193 100 4 24
## 1194 100 4 29
## 1195 100 4 5
## 1196 100 4 10
## 1197 100 4 15
## 1198 100 4 20
## 1199 100 4 25
## 1200 100 4 30
The final function to generate achievement data is the response_gen
function. The function is designed to take information from three distinct sources
- The test booklet - test taker information which come from the
booklet_sample
function. - The latent true score for each student, which may from from the
questionnaire_gen
function. - The item parameters which come from the
item_gen
function.
Notice that d_par
takes a list. These are the deviations from the item difficulty (b_par) mean for partial credit items.
item_responses <- response_gen(subject = subj_booklets$subject, item = subj_booklets$item,
theta = backgroud_q$theta, b_par = item_pool$b,
a_par = item_pool$a, c_par = item_pool$c,
d_par = list(item_pool$d1, item_pool$d2))
str(item_responses)
## 'data.frame': 100 obs. of 31 variables:
## $ i001 : num 0 NA NA NA NA 1 NA 2 NA NA ...
## $ i002 : num 2 2 NA 0 1 NA NA 0 NA NA ...
## $ i003 : num NA 1 1 1 2 NA 1 NA NA NA ...
## $ i004 : num NA NA 2 NA NA NA 1 NA 0 1 ...
## $ i005 : num NA NA NA NA NA 2 NA NA 0 1 ...
## $ i006 : num 2 NA NA NA NA 1 NA 0 NA NA ...
## $ i007 : num 0 2 NA 2 1 NA NA 1 NA NA ...
## $ i008 : num NA 1 2 1 1 NA 1 NA NA NA ...
## $ i009 : num NA NA 2 NA NA NA 0 NA 1 1 ...
## $ i010 : num NA NA NA NA NA 0 NA NA 1 1 ...
## $ i011 : num 1 NA NA NA NA 0 NA 1 NA NA ...
## $ i012 : num 1 0 NA 1 1 NA NA 1 NA NA ...
## $ i013 : num NA 1 1 1 1 NA 1 NA NA NA ...
## $ i014 : num NA NA 1 NA NA NA 0 NA 0 0 ...
## $ i015 : num NA NA NA NA NA 1 NA NA 1 0 ...
## $ i016 : num 1 NA NA NA NA 1 NA 0 NA NA ...
## $ i017 : num 0 0 NA 1 0 NA NA 0 NA NA ...
## $ i018 : num NA 1 1 1 1 NA 1 NA NA NA ...
## $ i019 : num NA NA 0 NA NA NA 0 NA 0 0 ...
## $ i020 : num NA NA NA NA NA 1 NA NA 0 0 ...
## $ i021 : num 0 NA NA NA NA 1 NA 0 NA NA ...
## $ i022 : num 0 1 NA 1 0 NA NA 0 NA NA ...
## $ i023 : num NA 0 1 1 0 NA 1 NA NA NA ...
## $ i024 : num NA NA 0 NA NA NA 0 NA 1 1 ...
## $ i025 : num NA NA NA NA NA 1 NA NA 0 0 ...
## $ i026 : num 0 NA NA NA NA 0 NA 0 NA NA ...
## $ i027 : num 1 0 NA 1 1 NA NA 1 NA NA ...
## $ i028 : num NA 1 1 0 0 NA 0 NA NA NA ...
## $ i029 : num NA NA 1 NA NA NA 0 NA 0 0 ...
## $ i030 : num NA NA NA NA NA 0 NA NA 0 0 ...
## $ subject: int 1 2 3 4 5 6 7 8 9 10 ...
Finally, we use the merge
function to combine backgroud_q
and item_responses
into a single data frame.
final_data <- merge(backgroud_q, item_responses, by = "subject")
str(final_data)
## 'data.frame': 100 obs. of 42 variables:
## $ subject: int 1 2 3 4 5 6 7 8 9 10 ...
## $ theta : num 0.88 0.19 1.458 0.592 -0.06 ...
## $ q1 : Factor w/ 3 levels "1","2","3": 1 3 2 3 2 3 2 3 3 1 ...
## $ q2 : Factor w/ 3 levels "1","2","3": 1 2 2 2 3 1 1 2 3 1 ...
## $ q3 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 1 2 2 ...
## $ q4 : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 3 2 1 3 1 ...
## $ q5 : Factor w/ 3 levels "1","2","3": 2 1 1 1 1 1 2 3 1 2 ...
## $ q6 : Factor w/ 5 levels "1","2","3","4",..: 5 4 4 1 2 2 1 4 1 3 ...
## $ q7 : Factor w/ 4 levels "1","2","3","4": 1 4 1 1 1 1 3 4 4 3 ...
## $ q8 : Factor w/ 5 levels "1","2","3","4",..: 5 4 3 3 1 3 1 1 3 2 ...
## $ q9 : Factor w/ 5 levels "1","2","3","4",..: 3 1 2 1 1 2 2 4 1 2 ...
## $ q10 : Factor w/ 4 levels "1","2","3","4": 2 2 4 2 2 2 2 3 2 2 ...
## $ i001 : num 0 NA NA NA NA 1 NA 2 NA NA ...
## $ i002 : num 2 2 NA 0 1 NA NA 0 NA NA ...
## $ i003 : num NA 1 1 1 2 NA 1 NA NA NA ...
## $ i004 : num NA NA 2 NA NA NA 1 NA 0 1 ...
## $ i005 : num NA NA NA NA NA 2 NA NA 0 1 ...
## $ i006 : num 2 NA NA NA NA 1 NA 0 NA NA ...
## $ i007 : num 0 2 NA 2 1 NA NA 1 NA NA ...
## $ i008 : num NA 1 2 1 1 NA 1 NA NA NA ...
## $ i009 : num NA NA 2 NA NA NA 0 NA 1 1 ...
## $ i010 : num NA NA NA NA NA 0 NA NA 1 1 ...
## $ i011 : num 1 NA NA NA NA 0 NA 1 NA NA ...
## $ i012 : num 1 0 NA 1 1 NA NA 1 NA NA ...
## $ i013 : num NA 1 1 1 1 NA 1 NA NA NA ...
## $ i014 : num NA NA 1 NA NA NA 0 NA 0 0 ...
## $ i015 : num NA NA NA NA NA 1 NA NA 1 0 ...
## $ i016 : num 1 NA NA NA NA 1 NA 0 NA NA ...
## $ i017 : num 0 0 NA 1 0 NA NA 0 NA NA ...
## $ i018 : num NA 1 1 1 1 NA 1 NA NA NA ...
## $ i019 : num NA NA 0 NA NA NA 0 NA 0 0 ...
## $ i020 : num NA NA NA NA NA 1 NA NA 0 0 ...
## $ i021 : num 0 NA NA NA NA 1 NA 0 NA NA ...
## $ i022 : num 0 1 NA 1 0 NA NA 0 NA NA ...
## $ i023 : num NA 0 1 1 0 NA 1 NA NA NA ...
## $ i024 : num NA NA 0 NA NA NA 0 NA 1 1 ...
## $ i025 : num NA NA NA NA NA 1 NA NA 0 0 ...
## $ i026 : num 0 NA NA NA NA 0 NA 0 NA NA ...
## $ i027 : num 1 0 NA 1 1 NA NA 1 NA NA ...
## $ i028 : num NA 1 1 0 0 NA 0 NA NA NA ...
## $ i029 : num NA NA 1 NA NA NA 0 NA 0 0 ...
## $ i030 : num NA NA NA NA NA 0 NA NA 0 0 ...