Domain score estimation in adaptive test assembly for single-subject multiple-domain content

Access full-text files



Journal Title

Journal ISSN

Volume Title



Educational assessments require that scores have a good reliability. Based on item response theory, computerized adaptive testing allows for constructing tests that provide scores with a higher reliability compared to their counterparts based on classical test theory. Test construction in computerized adaptive testing involves assembling a test from a large collection of items, subject to various test specifications and an optimality criterion. One example of an optimality criterion is having maximum test information, which is closely associated with reliability. Constructs assessed in educational settings often have multiple domains under a single content subject. This gives rise to two types of scores to be measured and reported: (1) overall scores and (2) domain scores. For obtaining these, one approach that is taken in a real-world adaptive testing program is a separate-models approach. The separate-models approach uses a correlated-factors model as the main test assembly model to obtain domain scores. After a test is completed, a bifactor model is fitted separately to obtain overall scores. Alternatively, overall and domain scores may be obtained from a single model. The single-model approach uses a bifactor model as the main test assembly model. Weighted composites of general and specific factor scores can be taken as overall and domain scores. The choice between the separate-models approach and the single-model approach is essentially the choice between using a correlated-factors model or a bifactor model for the main test assembly. The model choice is important because it determines the structure of the main test assembly: how many dimensions the interim ability estimates will have, and how many dimensions the item parameters should have. The choice of the main test assembly model also has implications on the recovery of ability parameters and score reliability. One advantage of the separate-models approach is that it allows using between-domain correlations as priors to help ability estimation. However, in practice, this can also potentially introduce estimation bias for between-domain correlations. Because the correlation estimates would be obtained from the calibration stage before an adaptive testing system is employed, estimation errors on between-domain correlations can propagate into subsequent steps, which may have detrimental effects on the recovery of true ability in adaptive tests under the separate-models approach. In contrast, the single-model approach can be less susceptible to this problem, because between-factor correlations can be assumed to be zero when a bifactor model is used as the main model. However, a drawback of the single-model approach is that it cannot benefit from between-domain correlations that would be obtained from the calibration stage. This is because a bifactor model would be used for calibration purposes instead of a correlated-factors model, and between-factor correlations in the bifactor model would be assumed to be zero in the calibration stage. Test assembly for educational assessments also requires satisfying a test content blueprint. This is referred to as a content balancing problem in the test assembly literature. There are two main frameworks for content balancing: (1) heuristic approaches and (2) optimal test design approaches. Heuristic approaches are currently more widely adopted, but have a drawback in that they do not ensure all content requirements are satisfied. Optimal test design approaches offer an advantage over heuristic approaches in that they ensure all content requirements are strictly satisfied. For multidimensional tests, there is one problem that makes it difficult to use optimal test design approaches compared to heuristic approaches. That is, optimal test design approaches require (1) a scalar-valued information quantity for each item, and (2) the quantity to have an additive property between item- and test-level values. The problem is that Fisher information is not a scalar-valued quantity but is matrix-valued in multidimensional cases. One scalar-valued alternative is directional information, which meets the two requirements for optimal test design approaches. The current study simulated adaptive tests to compare separate-models and single-model approaches in terms of domain score recovery, and also to compare content balancing methods in terms of satisfying content requirements. To allow for a neutral comparison between the two scoring approaches, simulation input was generated from a higher-order model first, then converted to correlated-factors and bifactor formats to be used in the two scoring approaches. Calibration error was simulated, and the calibration sample size was varied. For content balancing, an optimal test design method was implemented using directional information, and compared with three other heuristic content balancing methods. Among the heuristic methods, a multidimensional extension of the weighted deviation method was not available in the literature, and hence an extension was performed in the current study. From the simulation, the single-model approach had a better domain score recovery compared to the separate-models approach when calibration error was present. Estimated between-domain correlations had a small negative calibration bias of -0.15 in correlated-factors models for the separate-models approach. These suggest that estimation error in between-domain correlations may lead to less accurate domain scores when the separate-models approach is used. Recovery performances of overall scores were similar between the two scoring approaches. For content balancing, the optimal test design approach satisfied all content requirements in every assembled test, with the weighted penalty method following close by having near-perfect rates. The weighted deviation method had the lowest satisfaction rates on content requirements. These results provide evidence on how correlation estimation error can have a detrimental effect on domain score estimation when the separate-models approach is used. The main finding of the current study was that the single-model approach may offer an alternative and provide more accurate domain score estimates compared to the separate-models approach. Another finding of the current study was demonstrating that the optimal test design approach to content balancing ensures all content requirements are strictly satisfied in multidimensional contexts. Considering that the weighted penalty method is one of the methods currently widely adopted for content balancing, the optimal test design method may offer a viable alternative to the weighted penalty method.


LCSH Subject Headings