Controlled student-teacher feedback evaluation

What Drives Interactive Improvement from Feedback?

In multi-turn tasks, a model can get another attempt after receiving feedback. The hard question is whether it improved because the feedback was useful, or simply because it got more inference time. We isolate that difference across four verifiable reasoning environments and thirteen open-weight model families.

Read paper View findings

4: environments
13 x 13: student-teacher matrices
K=10: interactive attempts

Student and teacher role means showing that student identity explains more gain variation. — Student identity explains substantially more gain variation than teacher identity across environments.

Motivation

Feedback is becoming a training signal, but the loop itself is under-measured.

Interactive feedback is attractive because it resembles how people debug math solutions, programs, and plans: try, observe what failed, revise, and try again. That same loop could become a way to train models that follow instructions and recover from mistakes in context.

The missing piece is diagnostic. Most work focuses on the final student, the stronger model being distilled, or the verifier. We instead ask what happens inside the feedback step: when does a teacher's message contain information the student can actually use?

Evaluation setting

We evaluate a simple interaction: a student attempts a task, a verifier checks it, a teacher gives feedback, and the student retries from the same task. The study separates three effects that are easy to conflate.

Repeated attempts without real feedback.
The student's ability to act on feedback.
The teacher's ability to diagnose the failed attempt.

Abstract

Multi-turn improvement is not always feedback use.

We introduce a controlled student-teacher protocol across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI1. We compare external feedback, self-feedback, and unguided self-refinement while varying interaction history, task difficulty, and teacher access to privileged task information.

Across settings, self-generated feedback often adds little beyond retrying from scratch, while the strongest external teachers produce substantially larger feedback-specific gains. Dense model matrices show that interactive gains are driven more by the student's ability to use feedback than by the teacher's identity.

Research findings

What matters for improvement from feedback?

Six targeted findings separate repeated attempts, student feedback uptake, teacher quality, history length, and privileged task information.

RF1

Feedback-specific gains require better-than-generic guidance.

Self-feedback is inconsistent relative to self-refinement. The best external teachers add 9.2 to 16.6 acc@10 points over self-refinement across environments.

RF2

First-turn accuracy and feedback adherence are distinct.

A strong first attempt does not guarantee strong recovery. Models with similar acc@1 can differ sharply in normalized gain from feedback.

RF3

Interactive performance is mostly student-dependent.

Student fixed effects explain 77.1% to 96.5% of pair-level gain variation, while teacher identity adds only a small increment after conditioning on the student.

RF4

Task competence is not the same as teaching competence.

A teacher that solves more tasks on turn one is not always the better interactive teacher. Diagnosing the student's specific error is a separate capability.

RF5

Longer histories are mostly flat in the completed Gemma4 grid.

Extra visible turns can expose repeated failures, but the completed Gemma4 ablation shows near-flat average AUC from max history 1 to 5 across the four environments.

RF6

Privileged teacher information is not uniformly useful.

Answer access strongly helps BBEH Linguini, barely moves ARC-AGI1, and gives moderate gains on Omni-MATH and Codeforces execution-feedback variants.

Protocol

A verifier-grounded feedback loop.

In each episode, a student attempts a task. A task-specific verifier checks the answer. If the answer is wrong, a teacher sees the latest attempt, selected interaction history, and optional privileged information, then writes natural-language feedback. The student tries again until success or the interaction budget is exhausted.

Omni-MATH Codeforces BBEH Linguini ARC-AGI1

Diagram of the student, teacher, verifier, and environment feedback loop.

Results

Repeated attempts are a strong baseline.

We report acc@1, acc@10, gain, normalized gain, and cumulative accuracy AUC from episode logs. The summary below separates retry, self-feedback, and best external feedback; the additional plots show the latest completed Gemma4 ablations.

Environment	Self-refinement	Self-feedback	Best feedback
Omni-MATH	43.9 (+20.7)	48.6 (+24.7)	60.5 (+36.9)
Codeforces	52.8 (+17.8)	56.9 (+23.8)	68.7 (+35.1)
BBEH Linguini	12.0 (+7.6)	10.8 (+7.7)	21.2 (+17.5)
ARC-AGI1	18.2 (+11.6)	26.9 (+17.1)	33.2 (+23.1)

Scatter plot showing first-turn accuracy versus normalized gain. — First-turn accuracy and feedback uptake are separate capabilities.

Marginal gain by turn across environments. — Most gains arrive in the first few turns.

Cumulative accuracy curves for student models. — Interactive trajectories reveal model-specific recovery behavior.

Gemma4 history ablation showing AUC at K=10 across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI. — Max-history 1 to 5 is largely flat in the completed Gemma4 ablation.

Teacher information access ablation showing large BBEH benefit, small Codeforces and Omni-MATH effects, and no ARC-AGI effect. — Teacher-side information helps selectively: large on BBEH, essentially zero on ARC-AGI.

Dense matrices

Every model acts as both student and teacher.

The dense 13 x 13 matrices make the role asymmetry visible: rows often dominate columns, showing that the student receiving feedback explains most interactive gain variation.

Omni-MATH student-teacher matrix. — Omni-MATH

Codeforces student-teacher matrix. — Codeforces

BBEH Linguini student-teacher matrix. — BBEH Linguini

ARC-AGI1 student-teacher matrix. — ARC-AGI1

Takeaway

Evaluate feedback against retry.

Multi-turn success is not enough evidence that feedback was useful. A controlled evaluation should compare against unguided self-refinement, measure recovery from failed first attempts, and separate the teacher's ability to diagnose from the student's ability to act on feedback.

Report repeated-attempt baselines before claiming feedback use.
Track student and teacher effects separately.
Treat privileged answers, solutions, and execution traces as upper-bound ablations.
Use shorter high-quality trajectories when most gain is front-loaded.

Pair-level gain variation

77.1%-96.5%: Explained by student fixed effects
1.3%-12.4%: Explained by teacher fixed effects
70.8%-86.8%: Gain recovered by K=5

Resources

Paper artifacts

Project page for the publication, with the paper, code, evaluation protocol, and main result figures in one place.

Paper PDF Code Evaluation protocol Main figures