CHiLL Grader

Calibrated Human-in-the-Loop Short-Answer Grading

A fine-tuned language model grades student responses and emits a temperature-scaled confidence score. High-confidence predictions are auto-graded; low-confidence ones are flagged for human review. Attribution highlights the answer tokens that most influenced the grade.

Examples:

Question

Student Response

Maximum Grade

Acceptance Threshold (τ): 0.60

0.300.99

Loading model and running inference. This may take a minute on first request.

Predicted Grade

Confidence

Token Attribution Gradient x Input: tokens most influential to this grade

Low

High attribution

Model Feedback