Automated Scoring

Automated scoring of reading constructed-response items (: Exploration of the Stacking Ensemble Learning Algorithm for Automated Scoring of Constructed-Response Items in Reading Assessment)
Abstract: This chapter explores the stacking ensemble learning algorithm for automated scoring
of constructed-response items in reading assessment. To develop a meta-model using stacking, multiple supervised machine learning algorithms were used as the base models including multinomial logistic regression, K-nearest neighbor, decision tree, support vector machine, Gaussian Naïve Bayes, linear discriminant analysis, MLPC neural network model, random forest, and gradient boosting. Item-specific models were developed for several constructed-response reading items using hand-engineered
features. This study revealed that using only the perfect matched scores and adding similarity measures to the reference responses for different score categories in addition to other linguistic features, the best performing meta-model reached a quadratic weighted kappa of 0.88 or higher for some items with small sample sizes of about 500. This supported the use of automated scoring for small testing programs for high-stakes decisions and for classroom assessment. Further, the study supported that the features extracted are critical and features with strong validity evidence bearing help to enhance the accuracy in developing the automated scorers.
Automated scoring of math constructed-response items
Automated scoring of essays
-Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory (Paper to appear at 2025 NCME)
Abstract: Recently, large language models (LLMs) have gained traction in automated scoring due to
their extensive knowledge bases and contextual adaptability (Latif & Zhai, 2024). �˻�ֱ��ever,
despite decades of research on automated essay grading and short answer scoring, accurately
evaluating essays remains a challenge. Key parameters such as content relevance, idea
development, cohesion, and coherence are difficult to assess consistently across all tasks (Ramesh
& Sanampudi, 2022). This study examines score differences and consistency between human
raters and AI raters (i.e., ChatGPT), guided by the following research question: �˻�ֱ�� reliable are
scores produced by LLMs in scoring large-scale writing assessments compared to human raters?
To address this question, the study employs generalizability theory (GT) and conducts a
series of multivariate GT analyses using data from AP Chinese writing tasks. The data consist of
ratings for 30 students on two types of writing tasks, evaluated by two human raters and three
versions of ChatGPT (3.5, 4.0, & 4o). Various D-studies are used to compare results across human
vs. AI raters, different writing tasks, different versions of ChatGPT, and combinations of human
and AI raters. Preliminary findings indicate that AI raters produce less reliable scores than human
raters, with automated scores for email response tasks being less reliable than those for story
narration tasks.
Automated scoring of essays
-Estimating Human and AI Rater Effects Using the Many-Facet Rasch Model (Paper to appear at 2025 NCME)
Abstract: Using LLMs in automated scoring of essays in writing assessments has become the state-
of-the-art. In low-stakes assessment programs, generative AI models or Chatbots have been explored for this purpose. Three commonly used models are ChatGPT by OpenAI, Claude by Anthropic, and Google Gemini. Each model has its own strengths and features that may cater to different needs and applications. Like human raters, AI raters may bear different degrees of rater effects. Thus, it is worthwhile to compare the ratings from each scoring engine and understand the potential AI rater effects.
This study employs a Many-Facet Rasch measurement model (Linacre, 1994, 2024; Wright & Masters, 1982; Wright & Stone, 1979) to investigate and compare rater effects between AI and human raters in a large-scale writing assessment. A dataset of 30 student essays on two types of writing tasks, each scored by two human raters and five AI systems, (i.e., ChatGPT 3.5, ChtGPT 4.0, ChatGPT 4o, Claude, and Gemini) will be analyzed using a three-facet Rasch model (student ability, task difficulty, and rater severity). Results will reveal the differences in severity and internal consistency between human and AI raters. Rater internal consistency will be evaluated in terms of infit mean square while rater severity will be evaluated in terms of rater effect estimates. The interaction between essay difficulty and rater effects will be evaluated using a construct map. The findings of this study will provide insights into the strengths and limitations of AI- raters compared to human raters.
Linking Writing Processes to Writing Quality
Abstract: Process data may bring about added value to assessment. This study explores the linkage of writing process data to the writing quality of the argumentative writing task. Four SAT writing prompts were used to collect the keystroke data and writing scores for 5000 participants randomly assigned to each prompt which was scored on a scale of 0 to 6. The instructions required the minimum essay length was 200 words in 3 paragraphs. Keystroke information was collected during the writing task. A keystroke logging program written in vanilla Javascript embedded in the script of the website. The program listened to the keystroke and mouse events. The time stamp and cursor position for each keystroke or mouse operation was logged. The identified operations include input, delete, paste, and replace and text changes. The dataset only include the information about the text change, but no information was available about the exact text input or text changes. Keystroke measures extracted from the log data included production rate, the essay length, average word and sentence length, total response time, pause, warm-up time, revision types and frequencies, and burst. Different base models were compared, including LightGBM-Regressor, CatBoostRegressor, SVR, and XGBRegressor. VotingRegressor was used to ensemble the model outputs. Both root mean squared error of the predicted scores and QWK were computed to quantify the prediction accuracy. The best performing model yielded a result of RMSE=0.5861 and QWK=0.75. Feature importance were examined.

�˻�ֱ��

�˻�ֱ�� Assessment Research Center (MARC)

Automated Scoring