Automated Difficulty Prediction of Large-Scale Assessments
- Item Difficulty Modeling Using Fine-Tuned Small and Large Language Models
Abstract: This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models. We introduce novel data augmentation strategies, including on-the-fly augmentation and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results show that fine-tuning small language models such as BERT and RoBERTa yields competitive results, while domain-specific models like BioClinicalBERT and PubMedBERT do not provide significant improvements due to distributional gaps. Majority voting among small language models enhances robustness, reinforcing the benefits of ensemble learning.
Large language models (LLMs), such as GPT-4, exhibit strong generalization capabilities but struggle with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge. Review of Data-Driven Approaches for Item Difficulty Prediction in High-Stakes Assessment Settings (ongoing)
Abstract: Accurate modeling of item difficulty is essential in high-stakes assessment settings, where test outcomes can have significant consequences. Recent research has investigated data-driven approaches (e.g., machine learning, language models) to address the limitations of traditional expert-based methods, which are often subjective, costly, and time-consuming. This systematic review synthesized findings from 23 studies on automated item difficulty prediction of assessments across language proficiency, mathematics, science, social studies, and medicine. For each study, we delineated the dataset/assessment, item type, input information, prediction models, features, and evaluation criteria. Results indicated that data-driven methods effectively leverage linguistic, semantic, and psycholinguistic features to evaluate item difficulty with greater efficiency and objectivity. We concluded by discussing implications for practice and outlining future research directions for automated item difficulty modeling.