Controlled student-teacher feedback evaluation

What Drives Interactive Improvement from Feedback?

In multi-turn tasks, a model can get another attempt after receiving feedback. The hard question is whether it improved because the feedback was useful, or simply because it got more inference time. We isolate that difference across four verifiable reasoning environments and thirteen open-weight model families.

4
environments
13 x 13
student-teacher matrices
K=10
interactive attempts
Student and teacher role means showing that student identity explains more gain variation.
Student identity explains substantially more gain variation than teacher identity across environments.

Feedback is becoming a training signal, but the loop itself is under-measured.

Interactive feedback is attractive because it resembles how people debug math solutions, programs, and plans: try, observe what failed, revise, and try again. That same loop could become a way to train models that follow instructions and recover from mistakes in context.

The missing piece is diagnostic. Most work focuses on the final student, the stronger model being distilled, or the verifier. We instead ask what happens inside the feedback step: when does a teacher's message contain information the student can actually use?

Evaluation setting

We evaluate a simple interaction: a student attempts a task, a verifier checks it, a teacher gives feedback, and the student retries from the same task. The study separates three effects that are easy to conflate.

  1. Repeated attempts without real feedback.
  2. The student's ability to act on feedback.
  3. The teacher's ability to diagnose the failed attempt.

Multi-turn improvement is not always feedback use.

We introduce a controlled student-teacher protocol across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI1. We compare external feedback, self-feedback, and unguided self-refinement while varying interaction history, task difficulty, and teacher access to privileged task information.

Across settings, self-generated feedback often adds little beyond retrying from scratch, while the strongest external teachers produce substantially larger feedback-specific gains. Dense model matrices show that interactive gains are driven more by the student's ability to use feedback than by the teacher's identity.

What matters for improvement from feedback?

Six targeted findings separate repeated attempts, student feedback uptake, teacher quality, history length, and privileged task information.

RF1

Feedback-specific gains require better-than-generic guidance.

Self-feedback is inconsistent relative to self-refinement. The best external teachers add 9.2 to 16.6 acc@10 points over self-refinement across environments.

RF2

First-turn accuracy and feedback adherence are distinct.

A strong first attempt does not guarantee strong recovery. Models with similar acc@1 can differ sharply in normalized gain from feedback.

RF3

Interactive performance is mostly student-dependent.

Student fixed effects explain 77.1% to 96.5% of pair-level gain variation, while teacher identity adds only a small increment after conditioning on the student.

RF4

Task competence is not the same as teaching competence.

A teacher that solves more tasks on turn one is not always the better interactive teacher. Diagnosing the student's specific error is a separate capability.

RF5

Longer histories are mostly flat in the completed Gemma4 grid.

Extra visible turns can expose repeated failures, but the completed Gemma4 ablation shows near-flat average AUC from max history 1 to 5 across the four environments.

RF6

Privileged teacher information is not uniformly useful.

Answer access strongly helps BBEH Linguini, barely moves ARC-AGI1, and gives moderate gains on Omni-MATH and Codeforces execution-feedback variants.

A verifier-grounded feedback loop.

In each episode, a student attempts a task. A task-specific verifier checks the answer. If the answer is wrong, a teacher sees the latest attempt, selected interaction history, and optional privileged information, then writes natural-language feedback. The student tries again until success or the interaction budget is exhausted.

Omni-MATH Codeforces BBEH Linguini ARC-AGI1
Diagram of the student, teacher, verifier, and environment feedback loop.

Repeated attempts are a strong baseline.

We report acc@1, acc@10, gain, normalized gain, and cumulative accuracy AUC from episode logs. The summary below separates retry, self-feedback, and best external feedback; the additional plots show the latest completed Gemma4 ablations.

Environment Self-refinement Self-feedback Best feedback
Omni-MATH 43.9 (+20.7) 48.6 (+24.7) 60.5 (+36.9)
Codeforces 52.8 (+17.8) 56.9 (+23.8) 68.7 (+35.1)
BBEH Linguini 12.0 (+7.6) 10.8 (+7.7) 21.2 (+17.5)
ARC-AGI1 18.2 (+11.6) 26.9 (+17.1) 33.2 (+23.1)
Scatter plot showing first-turn accuracy versus normalized gain.
First-turn accuracy and feedback uptake are separate capabilities.
Marginal gain by turn across environments.
Most gains arrive in the first few turns.
Cumulative accuracy curves for student models.
Interactive trajectories reveal model-specific recovery behavior.
Gemma4 history ablation showing AUC at K=10 across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI.
Max-history 1 to 5 is largely flat in the completed Gemma4 ablation.
Teacher information access ablation showing large BBEH benefit, small Codeforces and Omni-MATH effects, and no ARC-AGI effect.
Teacher-side information helps selectively: large on BBEH, essentially zero on ARC-AGI.

Every model acts as both student and teacher.

The dense 13 x 13 matrices make the role asymmetry visible: rows often dominate columns, showing that the student receiving feedback explains most interactive gain variation.

Omni-MATH student-teacher matrix.
Omni-MATH
Codeforces student-teacher matrix.
Codeforces
BBEH Linguini student-teacher matrix.
BBEH Linguini
ARC-AGI1 student-teacher matrix.
ARC-AGI1

Evaluate feedback against retry.

Multi-turn success is not enough evidence that feedback was useful. A controlled evaluation should compare against unguided self-refinement, measure recovery from failed first attempts, and separate the teacher's ability to diagnose from the student's ability to act on feedback.

  • Report repeated-attempt baselines before claiming feedback use.
  • Track student and teacher effects separately.
  • Treat privileged answers, solutions, and execution traces as upper-bound ablations.
  • Use shorter high-quality trajectories when most gain is front-loaded.

Pair-level gain variation

77.1%-96.5%
Explained by student fixed effects
1.3%-12.4%
Explained by teacher fixed effects
70.8%-86.8%
Gain recovered by K=5

Paper artifacts

Project page for the publication, with the paper, code, evaluation protocol, and main result figures in one place.