Browsing by Subject "Computerized adaptive testing"

Now showing 1 - 4 of 4

A comparison of item selection procedures using different ability estimation methods in computerized adaptive testing based on the generalized partial credit model
(2010-05) Ho, Tsung-Han; Dodd, Barbara Glenzing; Powers, Daniel A.; Whittaker, Tiffany A.; Vaughn, Brandon K.
Computerized adaptive testing (CAT) provides a highly efficient alternative to the paper-and-pencil test. By selecting items that match examinees’ ability levels, CAT not only can shorten test length and administration time but it can also increase measurement precision and reduce measurement error. In CAT, maximum information (MI) is the most widely used item selection procedure. However, the major challenge with MI is the attenuation paradox, which results because the MI algorithm may lead to the selection of items that are not well targeted at an examinee’s true ability level, resulting in more errors in subsequent ability estimates. The solution is to find an alternative item selection procedure or an appropriate ability estimation method. CAT studies have not investigated the association between these two components of a CAT system based on polytomous IRT models. The present study compared the performance of four item selection procedures (MI, MPWI, MEI, and MEPV) across four ability estimation methods (MLE, WLE, EAP-N, and EAP-PS) under the mixed-format CAT based on the generalized partial credit model (GPCM). The test-unit pool and generated responses were based on test-units calibrated from an operational national test that included both independent dichotomous items and testlets. Several test conditions were manipulated: the unconstrained CAT as well as the constrained CAT in which the CCAT was used as the content-balancing, and the progressive-restricted procedure with maximum exposure rate equal to 0.19 (PR19) served as the exposure control in this study. The performance of various CAT conditions was evaluated in terms of measurement precision, exposure control properties, and the extent of selected-test-unit overlap. Results suggested that all item selection procedures, regardless of ability estimation methods, performed equally well in all evaluation indices across two CAT conditions. The MEPV procedure, however, was favorable in terms of a slightly lower maximum exposure rate, better pool utilization, and reduced test and selected-test-unit overlap than with the other three item selection procedures when both CCAT and PR19 procedures were implemented. It is not necessary to implement the sophisticated and computing-intensive Bayesian item selection procedures across ability estimation methods under the GPCM-based CAT. In terms of the ability estimation methods, MLE, WLE, and two EAP methods, regardless of item selection procedures, did not produce practical differences in all evaluation indices across two CAT conditions. The WLE method, however, generated significantly fewer non-convergent cases than did the MLE method. It was concluded that the WLE method, instead of MLE, should be considered, because the non-convergent case is less of an issue. The EAP estimation method, on the other hand, should be used with caution unless an appropriate prior θ distribution is specified.
A comparison of three statistical testing procedures for computerized classification testing with multiple cutscores and item selection methods
(2014-05) Haring, Samuel Heard; Dodd, Barbara Glenzing
Computerized classification tests (CCT) have been used in high-stakes assessment settings where the express purpose of the testing is to assign a classification decision (e.g. pass/fail). One key feature of sequential probability ratio test-type procedures is that items are selected to maximize information around the cutscore region of the examinee ability distribution as opposed to common features of CATs where items are selected to maximize information at examinees' interim estimates. Previous research has examined the effectiveness of computerized adaptive tests (CAT) utilizing classification testing procedures a single cutscore as well as multiple cutscores (e.g. below basic/proficient/advanced). Several variations of the SPRT procedure have been advanced recently including a generalized likelihood ratio (GLR). While the GLR procedure has shown evidences of improved average test length while reasonably maintaining classification accuracy, it also introduces unnecessary error. The purpose of this dissertation was to propose and investigate the functionality of a modified GLR procedure which does not incorporate the unnecessary error inherent in the GLR procedure. Additionally this dissertation explored the use of the multiple cutscores and the use of ability-based item selection. This dissertation investigated the performance of three classification procedures (SPRT, GLR, and modified GLR), multiple cutscores, and two test lengths. An additional set of conditions were developed in which an ability-based item selection method was used with the modified GLR. A simulation study was performed to gather evidences of the effectiveness and efficiency of a modified GLR procedure by comparing it to the SPRT and GLR procedures. The study found that the GLR and mGLR procedures were able to yield shorter test lengths as anticipated. Additionally, the mGLR procedure using ability-based item selection produced even shorter test lengths than the cutscore-based mGLR method. Overall, the classification accuracy of the procedures were reasonably close. Examination of conditional classification accuracy in the multiple-cutscore conditions showed unexpectedly low values for each of the procedures. Implications and future research are discussed herein.
Domain score estimation in adaptive test assembly for single-subject multiple-domain content
(2023-08) Lim, Sangdon; Choi, Seung-weon, 1965-; Keng, Leslie; Whittaker, Tiffany A; Kang, Hyeon-Ah
Educational assessments require that scores have a good reliability. Based on item response theory, computerized adaptive testing allows for constructing tests that provide scores with a higher reliability compared to their counterparts based on classical test theory. Test construction in computerized adaptive testing involves assembling a test from a large collection of items, subject to various test specifications and an optimality criterion. One example of an optimality criterion is having maximum test information, which is closely associated with reliability. Constructs assessed in educational settings often have multiple domains under a single content subject. This gives rise to two types of scores to be measured and reported: (1) overall scores and (2) domain scores. For obtaining these, one approach that is taken in a real-world adaptive testing program is a separate-models approach. The separate-models approach uses a correlated-factors model as the main test assembly model to obtain domain scores. After a test is completed, a bifactor model is fitted separately to obtain overall scores. Alternatively, overall and domain scores may be obtained from a single model. The single-model approach uses a bifactor model as the main test assembly model. Weighted composites of general and specific factor scores can be taken as overall and domain scores. The choice between the separate-models approach and the single-model approach is essentially the choice between using a correlated-factors model or a bifactor model for the main test assembly. The model choice is important because it determines the structure of the main test assembly: how many dimensions the interim ability estimates will have, and how many dimensions the item parameters should have. The choice of the main test assembly model also has implications on the recovery of ability parameters and score reliability. One advantage of the separate-models approach is that it allows using between-domain correlations as priors to help ability estimation. However, in practice, this can also potentially introduce estimation bias for between-domain correlations. Because the correlation estimates would be obtained from the calibration stage before an adaptive testing system is employed, estimation errors on between-domain correlations can propagate into subsequent steps, which may have detrimental effects on the recovery of true ability in adaptive tests under the separate-models approach. In contrast, the single-model approach can be less susceptible to this problem, because between-factor correlations can be assumed to be zero when a bifactor model is used as the main model. However, a drawback of the single-model approach is that it cannot benefit from between-domain correlations that would be obtained from the calibration stage. This is because a bifactor model would be used for calibration purposes instead of a correlated-factors model, and between-factor correlations in the bifactor model would be assumed to be zero in the calibration stage. Test assembly for educational assessments also requires satisfying a test content blueprint. This is referred to as a content balancing problem in the test assembly literature. There are two main frameworks for content balancing: (1) heuristic approaches and (2) optimal test design approaches. Heuristic approaches are currently more widely adopted, but have a drawback in that they do not ensure all content requirements are satisfied. Optimal test design approaches offer an advantage over heuristic approaches in that they ensure all content requirements are strictly satisfied. For multidimensional tests, there is one problem that makes it difficult to use optimal test design approaches compared to heuristic approaches. That is, optimal test design approaches require (1) a scalar-valued information quantity for each item, and (2) the quantity to have an additive property between item- and test-level values. The problem is that Fisher information is not a scalar-valued quantity but is matrix-valued in multidimensional cases. One scalar-valued alternative is directional information, which meets the two requirements for optimal test design approaches. The current study simulated adaptive tests to compare separate-models and single-model approaches in terms of domain score recovery, and also to compare content balancing methods in terms of satisfying content requirements. To allow for a neutral comparison between the two scoring approaches, simulation input was generated from a higher-order model first, then converted to correlated-factors and bifactor formats to be used in the two scoring approaches. Calibration error was simulated, and the calibration sample size was varied. For content balancing, an optimal test design method was implemented using directional information, and compared with three other heuristic content balancing methods. Among the heuristic methods, a multidimensional extension of the weighted deviation method was not available in the literature, and hence an extension was performed in the current study. From the simulation, the single-model approach had a better domain score recovery compared to the separate-models approach when calibration error was present. Estimated between-domain correlations had a small negative calibration bias of -0.15 in correlated-factors models for the separate-models approach. These suggest that estimation error in between-domain correlations may lead to less accurate domain scores when the separate-models approach is used. Recovery performances of overall scores were similar between the two scoring approaches. For content balancing, the optimal test design approach satisfied all content requirements in every assembled test, with the weighted penalty method following close by having near-perfect rates. The weighted deviation method had the lowest satisfaction rates on content requirements. These results provide evidence on how correlation estimation error can have a detrimental effect on domain score estimation when the separate-models approach is used. The main finding of the current study was that the single-model approach may offer an alternative and provide more accurate domain score estimates compared to the separate-models approach. Another finding of the current study was demonstrating that the optimal test design approach to content balancing ensures all content requirements are strictly satisfied in multidimensional contexts. Considering that the weighted penalty method is one of the methods currently widely adopted for content balancing, the optimal test design method may offer a viable alternative to the weighted penalty method.
Extension of the item pocket method allowing for response review and revision to a computerized adaptive test using the generalized partial credit model
(2017-08-14) Jensen, Mishan G. B.; Whittaker, Tiffany A.; Beretvas, Susan N; Dodd, Barbara G; Hersh, Matthew A; Pituch, Keenan A
Computerized Adaptive Testing (CAT) has increased in the last few decades, due in part to the increased use and availability of personal computers, but also partly due to the benefits of CATs. CATs provide increased measurement precision of ability estimates while decreasing the demand on examinees with shorter tests. This is accomplished by tailoring the test to each examinee and selecting items that are not too difficult or too easy based on the examinees’ interim ability estimate and responses to previous items. These benefits come at the cost of the flexibility to move through the test as an examinee would with a Paper and Pencil (P & P) test. The algorithms used in CATs for item selection and ability estimation require restrictions to response review and revision; however, a large portion of examinees desire options for review and revision of responses (Vispoel, Clough, Bleiler, Hendrickson, and Ihrig, 2002). Previous research has examined response review and revision in CATs with limited review and revision options and are limited to after all items had been administered. The development of the Item Pocket (IP) method (Han, 2013) has allowed for response review and revision during the test, relaxing the restrictions, while maintaining an acceptable level of measurement precision. This is achieved by creating an item pocket in which items are placed, which are excluded from use in the interim ability estimation and the item selection procedures. The initial simulation study was conducted by Han (2013) who investigated the use of the IP method using a dichotomously-scored fixed length test. The findings indicated that the IP method does not substantially decrease measurement precision and bias in the ability estimates were within acceptable ranges for operational tests. This simulation study extended the IP method to a CAT using polytomously-scored items using the Generalized Partial Credit model with exposure control and content balancing. The IP method was implemented in tests with three IP sizes (2, 3, and 4), two termination criteria (fixed and variable), two test lengths (15 and 20), and two item completion conditions (forced to answer and ignored) for items remaining in the IP at the end of the test. Additionally, four traditional CAT conditions, without implementing the IP method, were included in the design. Results found that the longer, 20 item IP method conditions using the forced answer method had higher measurement precision, with higher mean correlations between known and estimated theta, lower mean bias and RMSE, and measurement precision increased as IP size increased. The two item completion conditions (forced to answer and ignored) resulted in similar measurement precision. The variable length IP conditions resulted in comparable measurement precision as the corresponding fixed length IP conditions. The implications of the findings and the limitations with suggestions for future research are also discussed.