Comparing item selection methods in computerized adaptive testing using the rating scale model

Butterfield, Meredith Sibley

Comparing item selection methods in computerized adaptive testing using the rating scale model

Access full-text files

BUTTERFIELD-DISSERTATION-2016.pdf (10.95 MB)

Date

2016-08

Authors

Butterfield, Meredith Sibley

Abstract

Computer Adaptive Testing (CAT), a form of computer-based testing that selects and administers items that match the examinee’s trait levels, can be shorter in length and maintain comparable or greater measurement precision than traditional fixed-length paper-and-pencil testing. Administration of computer-based patient reported outcome (PRO) measures has increased recently in the medical field. Because PRO measures often have small item pools, small numbers of items administered, and populations in poor health, the benefits of CATs are especially advantageous. In CAT, Maximum Fisher information (MFI) is the most commonly used item selection procedure since it is easy to use and computationally simple. However, its main drawback is the attenuation paradox. If the estimated trait level of the examinee is not the true trait level, the items selected will not maximize information at the true trait level and the measurement is less precise. To address this issue, alternative item selections methods have been proposed. In studies, these alternatives have not performed better than MFI. Recently, Gradual Maximum Information Ratio (GMIR) item selection method was proposed and previous findings suggest GMIR could be beneficial for a short CAT. This simulation study compared GMIR and MFI item selection methods under conditions specific to the constraints of the PRO measures. GMIR and MFI are compared under Andrich’s Rating Scale Model (ARSM) across two polytomous item pool sizes (41 and 82), two population latent trait distributions (normal and negatively skewed), and three combination maximum number of item and minimum standard error stopping rules (5/0.54, 7/0.46, 9/0.40). The conditions were fully crossed. Performance was evaluated in terms of descriptive statistics of the final trait estimates, measurement precision, conditional measurement precision, and administration efficiency. Results found GMIR had better measurement precision when the test length was 5 items, with higher mean correlations between known and estimated trait levels, smaller mean bias, and smaller mean RMSE. An effect of item pool size and population latent trait distribution was not found. Across item selection methods, measurement precision increased as the test length increase, but with diminishing returns from 7 to 9 items.