浪花直播 Assessment Research Center (MARC)

Automated Scoring

  • Automated scoring of reading constructed-response items (: Exploration of the Stacking Ensemble Learning Algorithm for Automated Scoring of Constructed-Response Items in Reading Assessment)
    Abstract: This chapter explores the stacking ensemble learning algorithm for automated scoring
    of constructed-response items in reading assessment. To develop a meta-model using stacking, multiple supervised machine learning algorithms were used as the base models including multinomial logistic regression, K-nearest neighbor, decision tree, support vector machine, Gaussian Na茂ve Bayes, linear discriminant analysis, MLPC neural network model, random forest, and gradient boosting. Item-specific models were developed for several constructed-response reading items using hand-engineered
    features. This study revealed that using only the perfect matched scores and adding similarity measures to the reference responses for different score categories in addition to other linguistic features, the best performing meta-model reached a quadratic weighted kappa of 0.88 or higher for some items with small sample sizes of about 500. This supported the use of automated scoring for small testing programs for high-stakes decisions and for classroom assessment. Further, the study supported that the features extracted are critical and features with strong validity evidence bearing help to enhance the accuracy in developing the automated scorers.
     
  • Automated scoring of math constructed-response items
     
  • Automated scoring of essays
    -Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory (Paper to appear at 2025 NCME)
    Abstract: Recently, large language models (LLMs) have gained traction in automated scoring due to
    their extensive knowledge bases and contextual adaptability (Latif & Zhai, 2024). 浪花直播ever,
    despite decades of research on automated essay grading and short answer scoring, accurately
    evaluating essays remains a challenge. Key parameters such as content relevance, idea
    development, cohesion, and coherence are difficult to assess consistently across all tasks (Ramesh
    & Sanampudi, 2022). This study examines score differences and consistency between human
    raters and AI raters (i.e., ChatGPT), guided by the following research question: 浪花直播 reliable are
    scores produced by LLMs in scoring large-scale writing assessments compared to human raters?
        To address this question, the study employs generalizability theory (GT) and conducts a
    series of multivariate GT analyses using data from AP Chinese writing tasks. The data consist of
    ratings for 30 students on two types of writing tasks, evaluated by two human raters and three
    versions of ChatGPT (3.5, 4.0, & 4o). Various D-studies are used to compare results across human
    vs. AI raters, different writing tasks, different versions of ChatGPT, and combinations of human
    and AI raters. Preliminary findings indicate that AI raters produce less reliable scores than human
    raters, with automated scores for email response tasks being less reliable than those for story
    narration tasks.
  • Automated scoring of essays
    -Estimating Human and AI Rater Effects Using the Many-Facet Rasch Model (Paper to appear at 2025 NCME)
    Abstract: Using LLMs in automated scoring of essays in writing assessments has become the state-
    of-the-art. In low-stakes assessment programs, generative AI models or Chatbots have been explored for this purpose. Three commonly used models are ChatGPT by OpenAI, Claude by Anthropic, and Google Gemini. Each model has its own strengths and features that may cater to different needs and applications. Like human raters, AI raters may bear different degrees of rater effects. Thus, it is worthwhile to compare the ratings from each scoring engine and understand the potential AI rater effects.
        This study employs a Many-Facet Rasch measurement model (Linacre, 1994, 2024; Wright & Masters, 1982; Wright & Stone, 1979) to investigate and compare rater effects between AI and human raters in a large-scale writing assessment. A dataset of 30 student essays on two types of writing tasks, each scored by two human raters and five AI systems, (i.e., ChatGPT 3.5, ChtGPT 4.0, ChatGPT 4o, Claude, and Gemini) will be analyzed using a three-facet Rasch model (student ability, task difficulty, and rater severity). Results will reveal the differences in severity and internal consistency between human and AI raters. Rater internal consistency will be evaluated in terms of infit mean square while rater severity will be evaluated in terms of rater effect estimates. The interaction between essay difficulty and rater effects will be evaluated using a construct map. The findings of this study will provide insights into the strengths and limitations of AI- raters compared to human raters.
     
  • Linking Writing Processes to Writing Quality
    Abstract: Process data may bring about added value to assessment. This study explores the linkage of writing process data to the writing quality of the argumentative writing task. Four SAT writing prompts were used to collect the keystroke data and writing scores for 5000 participants randomly assigned to each prompt which was scored on a scale of 0 to 6. The instructions required the minimum essay length was 200 words in 3 paragraphs. Keystroke information was collected during the writing task. A keystroke logging program written in vanilla Javascript embedded in the script of the website. The program listened to the keystroke and mouse events. The time stamp and cursor position for each keystroke or mouse operation was logged. The identified operations include input, delete, paste, and replace and text changes. The dataset only include the information about the text change, but no information was available about the exact text input or text changes. Keystroke measures extracted from the log data included production rate, the essay length, average word and sentence length, total response time, pause, warm-up time, revision types and frequencies, and burst. Different base models were compared, including LightGBM-Regressor, CatBoostRegressor, SVR, and XGBRegressor. VotingRegressor was used to ensemble the model outputs.  Both root mean squared error of the predicted scores and QWK were computed to quantify the prediction accuracy. The best performing model yielded a result of RMSE=0.5861 and QWK=0.75. Feature importance were examined.