Automated assessment of student scientific reasoning presents a compelling use case for NLP in education, yet the distribution of responses across rubric categories rarely follows uniform patterns. In this investigation, researchers tackled the challenge of scoring 1,466 high school responses against an 11-category NGSS-aligned rubric where advanced reasoning categories exhibit severe underrepresentation. The fundamental problem: standard fine-tuning of SciBERT on imbalanced data prioritizes majority classes at the expense of rare but pedagogically critical categories, degrading recall for sophisticated conceptual understanding.
The authors evaluated three augmentation strategies beyond conventional oversampling (SMOTE). GPT-4 synthetic generation produced coherent student responses that improved both precision and recall across imbalanced categories. EASE (word-level extraction and filtering) demonstrated broad effectiveness, substantially improving alignment with human scoring across all 11 categories by preserving authentic linguistic patterns from existing responses. Most notably, ALP (Augmentation using Lexicalized Probabilistic context-free grammars) achieved perfect F1 scores on the most severely imbalanced categories (5, 6, 7, and 9) through phrase-level composition, suggesting that grammatically-constrained generation better captures domain-specific scientific discourse than unconstrained synthesis.
A critical methodological insight emerged: while GPT-4 augmentation boosted metrics, EASE and ALP better preserved novice-level responses essential for learning progression modeling. This distinction matters profoundly—overfitting to synthetic advanced reasoning could obscure intermediate developmental stages. The comparative analysis reveals a trade-off between metric optimization and conceptual fidelity that traditional imbalance-handling approaches (SMOTE) fail to address. By maintaining authentic novice exemplars alongside augmented advanced examples, the framework achieves both statistical balance and pedagogical validity, offering a replicable solution for automated assessment systems in science education.