Stepwise Verification and Remediation of Student Reasoning Errors
with Large Language Model Tutors

Nico Daheim∗1  Jakub Macina∗2,3
Manu Kapur4  Iryna Gurevych1  Mrinmaya Sachan2

1 Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science
and Hessian Center for AI (hessian.AI), TU Darmstadt
2Department of Computer Science, ETH Zurich  3 ETH AI Center
4Professorship for Learning Sciences and Higher Education, ETH Zurich
Abstract

Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. A promising approach towards this means is to build dialog tutoring models that scaffold students’ problem-solving. However, even though existing LLMs perform well in solving reasoning questions, they struggle to precisely detect student’s errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1K stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student errors which are more often correct with less hallucinations compared to existing baselines.

Stepwise Verification and Remediation of Student Reasoning Errors
with Large Language Model Tutors


Nico Daheim∗1  Jakub Macina∗2,3 Manu Kapur4  Iryna Gurevych1  Mrinmaya Sachan2 1 Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), TU Darmstadt 2Department of Computer Science, ETH Zurich  3 ETH AI Center 4Professorship for Learning Sciences and Higher Education, ETH Zurich


1 Introduction

Refer to caption
Figure 1: Directly generating a tutor response based on the conversation history can lead to hallucinations (bottom left). To alleviate this, we split this process into two sequential tasks (right): 1) A model identifies the student’s mistake. 2) A different response generation model communicates the identified mistake. We use different verifiers: providing the Error Reason (Wang et al., 2024c), Classification-based Verification, providing a more detailed Error Description and a Step Alignment of student and reference solution. Especially the latter two reduce hallucinations and make tutor models more targeted at the student error when verification and generation are combined (Section 6).

The field of dialog tutoring aims to build systems that can teach students by holding a conversation with them (Wollny et al., 2021; Jurenka et al., 2024). Dialog tutors hold the potential to make personalized teaching available to learners anywhere anytime. The increasing capabilities of LLMs have brought renewed hope to this field  (Thoppilan et al., 2022; Jurenka et al., 2024). However, real-time tutoring is quite complex, and human teachers bring various intricate capabilities when teaching, such as identifying student errors in problem solving, picking a pedagogical strategy, and communicating it (Wang et al., 2024c). The same requirements hold for dialog tutoring models which need all these abilities to be effective.

Yet, although research on effective human tutors shows they perform these steps sequentially by first reasoning about the error, then picking a strategy, and then responding (Lepper and Woolverton, 2002), many tutoring models perform all of them in one forward pass. Recent studies (Macina et al., 2023b, a) have shown that this can lead to several deficiencies that can be detrimental to student learning, for example, in math tutoring. Despite impressive performance on math reasoning benchmarks (Cobbe et al., 2021; Hendrycks et al., 2021), dialog tutors often generate hallucinated outputs and present erroneous information to students, for example, because they assess an incorrect solution as correct. We show an example of this in Figure 1.

In this paper, we alleviate this problem by decoupling the verification of student solutions from response generation with a modular approach. As opposed to the common approach, the model does not directly generate the tutor response from the students’ utterances, whereby solution assessment is done implicitly in the models’ activations, but rather receives the output of an additional verification model that assesses solutions and can therefore also be more specialized. We hypothesize that this increases the correctness of the model as well as makes the response more targeted to the error because the response generation module is already aware of the exact student error. Furthermore, this architecture more closely mimics human tutors.

To test our approach and train verifiers, we collect a dataset of ca. 1k student solutions and their stepwise reasoning chains in the domain of multi-step math problem-solving, which will be released publicly. This dataset augments the math dialog tutoring corpus MathDial (Macina et al., 2023a), which we use for evaluating dialog tutoring models, by teacher-annotated verifications of the first erroneous step in the student solution (Section 4).

We propose three verification approaches based on prompting and finetuning language models. Besides a simple classification-based approach for verification, we also generate a textual verification and notably align student solution steps to steps of a reference solution (Section 3.1) to verify the student solution. We find that using our data for finetuning helps smaller LLMs surpass prompted state-of-the-art LLMs. Furthermore, incorporating the verification output in the response generation step (Section 3.2) clearly improves their performance in terms of both extensive automatic (Section 6) and human evaluation using real teachers (Section 6.3): the generated responses are more targeted to the exact student error, there are less hallucinations, and there is more actionable scaffolding feedback for the student. In general, we find that such improvements are much stronger when the verification output is correct (Section 7.1) indicating a large potential for the community to improve dialog tutors by adding verifiers.

2 Background & Related Work

Dialog tutoring aims at building models that can tutor human students through a conversation. For example, dialog tutoring has been proposed for second-language acquisition (Stasaski et al., 2020; Caines et al., 2020; Kwon et al., 2024), to answer questions in science (Chevalier et al., 2024), or to solve math word problems (MWPs) (Macina et al., 2023a). In each case, the model should guide the learner to solving a problem (e.g. the MWP or translation of a phrase) not by telling the solution outright, but rather by using scaffolding techniques that give students space for guided exploration and self-correction. For example, the tutor might elicit the students’ thinking by asking a question that challenges their understanding of the problem (Reiser, 2004; Anghileri, 2006).

Capturing such intricate tutoring strategies is hard and requires teachers years to master. Due to this complexity, most previous dialog tutoring systems were human-authored, notably the AutoTutor family (Nye et al., 2014), LISP tutor (Anderson et al., 1985) which uses a large set of rules to verify student programming solutions, or any systems built using CTAT which requires enumerating all possible solutions or writing complex production rules (Aleven et al., 2016). However, scaling such human-authored systems can quickly explode in both complexity and human effort (Macina et al., 2023b). Due to this and rapid progress in language generation from learning large models based on large amounts of data, LLMs such as LearnLM (Jurenka et al., 2024) have recently become popular in favor of human-authored, rule-based systems.

Problem formulation

Formally, the goal of dialog tutoring is to continue a tutoring dialog consisting of a sequence of T1𝑇1T-1italic_T - 1 turns (𝐮1,,𝐮T1)subscript𝐮1subscript𝐮𝑇1\mathcal{H}\coloneqq(\mathbf{u}_{1},\dots,\mathbf{u}_{T-1})caligraphic_H ≔ ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) taken by either student or teacher and where 𝐮t𝒱subscript𝐮𝑡superscript𝒱\mathbf{u}_{t}\in\mathcal{V}^{\ast}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is constructed from a fixed model vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. Continuation then means generating a new utterance uT𝒱subscript𝑢𝑇superscript𝒱u_{T}\in\mathcal{V}^{\ast}italic_u start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that follows the above principles. Often there is also background knowledge that is either required or helpful to tutor a certain concept, such as grammar rules (Stasaski et al., 2020), or textbook knowledge (Wang et al., 2024b), and can be used to ground 𝐮Tsubscript𝐮𝑇\mbox{$\mbox{$\mathbf{u}$}$}_{T}bold_u start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. In this work, we deal with teaching math word problem-solving and therefore use textual knowledge 𝐤𝒱𝐤superscript𝒱\mbox{$\mbox{$\mathbf{k}$}$}\in\mathcal{V}^{\ast}bold_k ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, for instance, the math word problem and background knowledge like math rules.

Tutor models

Using such data, a simple approach is generating the teacher response directly by pairing the following model, parameterized by weights 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, with a decoding algorithm, such as beam search or greedy decoding:

p𝜽(𝐲,𝐤)subscript𝑝𝜽conditional𝐲𝐤\displaystyle p_{\text{\mbox{$\mbox{$\boldsymbol{\theta}$}$}}}(\mathbf{y}\mid% \mathcal{H},\mathbf{k})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_y ∣ caligraphic_H , bold_k ) =i=1|𝐲|p𝜽(𝐲i𝐲<i,,𝐤).absentsuperscriptsubscriptproduct𝑖1𝐲subscript𝑝𝜽conditionalsubscript𝐲𝑖subscript𝐲absent𝑖𝐤.\displaystyle=\prod_{i=1}^{|\mathbf{y}|}p_{\text{\mbox{$\mbox{$\boldsymbol{% \theta}$}$}}}(\mathbf{y}_{i}\mid\mathbf{y}_{<i},\mathcal{H},\mathbf{k})\text{.}= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_y | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , caligraphic_H , bold_k ) . (1)

This model is straightforward to implement and learn from data but prior work has shown that it suffers from generating factually incorrect outputs (Macina et al., 2023a). Therefore, in this work we break down response generation into two steps: verification, where it is first assessed whether the student solution is correct, and generation.

Verification

Verification is challenging in its own right and has recently been tackled for general reasoning problems. For example, ROSCOE (Golovneva et al., 2023) presents unsupervised metrics to assess the correctness of a models’ chain-of-thought (CoT) reasoning, and (Jacovi et al., 2024) evaluate open-domain question answering for logical errors. The outputs of verifiers have subsequently also been used for self-refinement of LLMs (Madaan et al., 2023; Shinn et al., 2024) and also allow targeted feedback for the training of student LLMs with teacher LLMs (Saha et al., 2023; Wang et al., 2024a). Closely related to our work (Wang et al., 2024c) define broad error categories, such as miscalculation, to understand the cause of incorrect reasoning by students and condition on it to generate teacher responses. We compare to this baseline and call it Error Reason.

3 Verification-based Response Generation

We first introduce the task of verification and different verifiers in Section 3.1. Afterwards, in Section 3.2, we combine verification and response generation for modular tutor response generation.

3.1 Verification

We deal with the verification of student solutions to a given math word problem 𝐪𝒱𝐪superscript𝒱\mbox{$\mbox{$\mathbf{q}$}$}\in\mathcal{V}^{\ast}bold_q ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The solutions can be described by a sequence of substep solutions 𝐬𝐪={𝐬1,,𝐬N}subscript𝐬𝐪subscript𝐬1subscript𝐬𝑁\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}=\{\mbox{$% \mbox{$\mathbf{s}$}$}_{1},...,\mbox{$\mbox{$\mathbf{s}$}$}_{N}\}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = { bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where each 𝐬n𝒱subscript𝐬𝑛superscript𝒱\mbox{$\mbox{$\mathbf{s}$}$}_{n}\in\mathcal{V}^{\ast}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐬Nsubscript𝐬𝑁\mbox{$\mbox{$\mathbf{s}$}$}_{N}bold_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the final solution. Usually, 𝐬𝐪subscript𝐬𝐪\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT is described by the student in one of the student utterances 𝐮tsubscript𝐮𝑡\mbox{$\mbox{$\mathbf{u}$}$}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The task of the model is to assess whether 𝐬Nsubscript𝐬𝑁\mbox{$\mbox{$\mathbf{s}$}$}_{N}bold_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the correct solution to 𝐪𝐪\mathbf{q}bold_q and if not, potentially, to identify which step 𝐬nsubscript𝐬𝑛\mbox{$\mbox{$\mathbf{s}$}$}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caused the error. Oftentimes, this can be done by comparing to a reference solution 𝐬^𝐪={𝐬^1,,𝐬^M}subscript^𝐬𝐪subscript^𝐬1subscript^𝐬𝑀\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}=% \{\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{1},...,\widehat{\mbox{$\mbox{$% \mathbf{s}$}$}}_{M}\}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = { over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } that is either given or model-generated and might differ in length. All verifiers which we discuss next can then be described by a learned function v𝜽(𝐯𝐬𝐪,𝐬^𝐪)subscript𝑣superscript𝜽conditional𝐯subscript𝐬𝐪subscript^𝐬𝐪\smash{v_{\text{\mbox{$\mbox{$\boldsymbol{\theta}$}$}}^{\prime}}(\mbox{$\mbox{% $\mathbf{v}$}$}\mid\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q% }$}$}}},\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}% $}$}}})}italic_v start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_v ∣ bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT , over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ), usually an LLM. Here, 𝐯𝐯\mathbf{v}bold_v is the verification output and 𝐬^𝐪subscript^𝐬𝐪\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT may be an empty string if no reference solution is given. In the following, we introduce different verifiers.

Classification-based Verification

A comparably simple approach to verification is classifying whether the student solution 𝐬𝐪subscript𝐬𝐪{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT is correct using a binary classifier. We call this Overall Verification. Similarly, identifying the first error step 𝐬nsubscript𝐬𝑛\mbox{$\mbox{$\mathbf{s}$}$}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be framed as multi-class classification with labels {0,,N}0𝑁\{0,\dots,N\}{ 0 , … , italic_N }, where 00 means no mistake. We call this Stepwise Verification. Alternatively, Stepwise Verification (iterative) can be framed as a binary classification for each step 𝐬nsubscript𝐬𝑛\mbox{$\mbox{$\mathbf{s}$}$}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT whether it is correct. The first error step is the first step classified as error.

Error Description

While conceptually easy, classification-based approaches locate the first error without explaining the exact issue. Therefore, we propose to use an LLM to directly describe the error, and the concrete first error step, in a textual format, and call this Error Description. For this, we prompt the LLM with the prompt outlined in Appendix G. In comparison to Wang et al. (2024c), this error description is allowed to be free-form and does not map to predefined error types. The LLM-generated error step description can then be passed to a tutor response generation model.

Algorithm 1 Modified Needleman-Wunsch.
1:Solution attempt 𝐬𝐪={𝐬1,,𝐬N}subscript𝐬𝐪subscript𝐬1subscript𝐬𝑁\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}=\{\mbox{$% \mbox{$\mathbf{s}$}$}_{1},...,\mbox{$\mbox{$\mathbf{s}$}$}_{N}\}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = { bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }
2:Reference solution 𝐬^𝐪={𝐬^1,,𝐬^M}subscript^𝐬𝐪subscript^𝐬1subscript^𝐬𝑀\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}=% \{\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{1},...,\widehat{\mbox{$\mbox{$% \mathbf{s}$}$}}_{M}\}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = { over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }
3:Gap cost c𝑐citalic_c, similarity threshold t𝑡titalic_t
4:Optimal alignment of 𝐬𝐪subscript𝐬𝐪\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐬^𝐪subscript^𝐬𝐪\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT
5:Fzeros_matrix(M+1,N+1)𝐹zeros_matrix𝑀1𝑁1F\leftarrow\text{zeros\_matrix}(M+1,N+1)italic_F ← zeros_matrix ( italic_M + 1 , italic_N + 1 ) \triangleright initialize
6:F[1:M+1,0][ic for i in 1M]F[1:M+1,0]\leftarrow[i\cdot c\text{ for }i\text{ in }1\ldots M]italic_F [ 1 : italic_M + 1 , 0 ] ← [ italic_i ⋅ italic_c for italic_i in 1 … italic_M ]
7:F[0,1:N+1][ic for i in 1N]F[0,1:N+1]\leftarrow[i\cdot c\text{ for }i\text{ in }1\ldots N]italic_F [ 0 , 1 : italic_N + 1 ] ← [ italic_i ⋅ italic_c for italic_i in 1 … italic_N ]
8:for i1 to M𝑖1 to 𝑀i\leftarrow 1\text{ to }Mitalic_i ← 1 to italic_M do
9:     𝐞𝐬^membed(𝐬^m)subscript𝐞subscript^𝐬𝑚embedsubscript^𝐬𝑚\mbox{$\mbox{$\mathbf{e}$}$}_{\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{m}}% \leftarrow\text{embed}(\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{m})bold_e start_POSTSUBSCRIPT over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← embed ( over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
10:     for j1 to N𝑗1 to 𝑁j\leftarrow 1\text{ to }Nitalic_j ← 1 to italic_N do
11:         𝐞𝐬^nembed(𝐬n)subscript𝐞subscript^𝐬𝑛embedsubscript𝐬𝑛\smash{\mbox{$\mbox{$\mathbf{e}$}$}_{{\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{% n}}}\leftarrow\text{embed}({\mbox{$\mbox{$\mathbf{s}$}$}_{n}})}bold_e start_POSTSUBSCRIPT over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← embed ( bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
12:         F[i,j]cosine_similarity(𝐞𝐬^m,𝐞𝐬n)𝐹𝑖𝑗cosine_similaritysubscript𝐞subscript^𝐬𝑚subscript𝐞subscript𝐬𝑛\smash{F[i,j]\leftarrow\text{cosine\_similarity}(\mbox{$\mbox{$\mathbf{e}$}$}_% {{\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{m}}},\mbox{$\mbox{$\mathbf{e}$}$}_{{% {\mbox{$\mbox{$\mathbf{s}$}$}_{n}}}})}italic_F [ italic_i , italic_j ] ← cosine_similarity ( bold_e start_POSTSUBSCRIPT over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
13:         if F[i,j]t𝐹𝑖𝑗𝑡F[i,j]\geq titalic_F [ italic_i , italic_j ] ≥ italic_t then \triangleright exact match
14:              F[i,j]F[i1,j1]+F[i,j]𝐹𝑖𝑗𝐹𝑖1𝑗1𝐹𝑖𝑗F[i,j]\leftarrow F[i-1,j-1]+F[i,j]italic_F [ italic_i , italic_j ] ← italic_F [ italic_i - 1 , italic_j - 1 ] + italic_F [ italic_i , italic_j ]
15:         else\triangleright near match or gap
16:              F[i,j]max(F[i1,j1]1+F[i,j],F[i1,j]+c,F[i,j1]+c)𝐹𝑖𝑗𝐹𝑖1𝑗11𝐹𝑖𝑗𝐹𝑖1𝑗𝑐𝐹𝑖𝑗1𝑐F[i,j]\leftarrow\max(F[i-1,j-1]-1+F[i,j],F[i-1,j]+c,F[i,j-1]+c)italic_F [ italic_i , italic_j ] ← roman_max ( italic_F [ italic_i - 1 , italic_j - 1 ] - 1 + italic_F [ italic_i , italic_j ] , italic_F [ italic_i - 1 , italic_j ] + italic_c , italic_F [ italic_i , italic_j - 1 ] + italic_c )
17:         end if
18:     end for
19:end for
20:𝐚={(𝐚1,𝐚^1),,(𝐚L,𝐚^L)}backtrack(F,𝐬𝐪,𝐬^𝐪)𝐚subscript𝐚1subscript^𝐚1subscript𝐚𝐿subscript^𝐚𝐿backtrack𝐹subscript𝐬𝐪subscript^𝐬𝐪\mbox{$\mbox{$\mathbf{a}$}$}=\{(\mbox{$\mbox{$\mathbf{a}$}$}_{1},\widehat{% \mbox{$\mbox{$\mathbf{a}$}$}}_{1}),...,(\mbox{$\mbox{$\mathbf{a}$}$}_{L},% \widehat{\mbox{$\mbox{$\mathbf{a}$}$}}_{L})\}\leftarrow\text{backtrack}(F,% \mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}},\widehat{% \mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}})bold_a = { ( bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_a start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) } ← backtrack ( italic_F , bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT , over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT )
21:return Globally-optimal alignment 𝐚𝐚\mathbf{a}bold_a

(Step) Alignment

As our third verification approach, we align the steps in the student’s solution with a reference solution, and compare the steps in the student and reference solution to localize errors. We call this approach Step Alignment. As the order of steps in the solutions matters, a greedy algorithm that finds the most similar steps across the two solutions is insufficient. Thus, we frame verification as a sequence alignment problem.

The input to the alignment algorithm is the student solution 𝐬𝐪subscript𝐬𝐪\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT with N𝑁Nitalic_N steps and the reference solution 𝐬^𝐪subscript^𝐬𝐪\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT with M𝑀Mitalic_M steps. Note that here we are aligning solution steps which can be long strings. This is different from other sequence alignment problems in NLP, where typically tokens are aligned (Paolini et al., 2021, inter alia). The output is a sequence of tuples {(𝐚1,𝐚^1),,(𝐚L,𝐚^L)}subscript𝐚1subscript^𝐚1subscript𝐚𝐿subscript^𝐚𝐿\{(\mbox{$\mbox{$\mathbf{a}$}$}_{1},\widehat{\mbox{$\mbox{$\mathbf{a}$}$}}_{1}% ),...,(\mbox{$\mbox{$\mathbf{a}$}$}_{L},\widehat{\mbox{$\mbox{$\mathbf{a}$}$}}% _{L})\}{ ( bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_a start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) } of length L𝐿Litalic_L, where each 𝐚lsubscript𝐚𝑙\mbox{$\mbox{$\mathbf{a}$}$}_{l}bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝐚^lsubscript^𝐚𝑙\widehat{\mbox{$\mbox{$\mathbf{a}$}$}}_{l}over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can be either a step of 𝐬𝐪subscript𝐬𝐪\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐬^𝐪subscript^𝐬𝐪\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT, respectively, or a special symbol \oslash. Here, 𝐚l=subscript𝐚𝑙\mbox{$\mbox{$\mathbf{a}$}$}_{l}=\oslashbold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ⊘ indicates a missing step in the student solution (--) and 𝐬^l=subscript^𝐬𝑙\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{l}=\oslashover^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ⊘ indicates an additional step (+++).

In our implementation, we use the Needleman-Wunsch (NW) algorithm (Needleman and Wunsch, 1970) as it guarantees an optimal alignment with respect to a chosen cost function. We use a modified version of the algorithm for semantic sequence alignment and use sentence embeddings (Reimers and Gurevych, 2019) to measure the similarity between steps. We detail our adaptation of the NW algorithm in Algorithm 1 and describe each step in the following. The NW algorithm iterates over all possible pairs of substeps from 𝐬𝐪subscript𝐬𝐪\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐬^𝐪subscript^𝐬𝐪\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and calculates a cost for each pair. Since each substep is a string, we use semantic string similarity measured by the cosine similarity of the contextual embeddings of the substeps. In our experiments, this performed better than just matching the final numerical solution of the substeps (cf. Section 7.1 for results and a comparison of embedding models). As not all high sentence embedding scores indicate a significant match, we introduce a threshold t𝑡titalic_t to differentiate between exact and near matches. If the similarity is higher than a threshold t𝑡titalic_t, the pair is deemed as an exact match and its similarity is added to the similarity of their predecessors. If it is smaller than t𝑡titalic_t it could still be a near match if the sequence similarity is high enough after incurring a penalty (11-1- 1). The last option is a gap if the sum is larger than adding a predefined gap cost c𝑐citalic_c to either a pair of the previous student and current reference solution step, or a pair of the current student and previous reference solution step. Altogether, this forms a similarity matrix F𝐹Fitalic_F of size (N+1)×(M+1)𝑁1𝑀1(N+1)\times(M+1)( italic_N + 1 ) × ( italic_M + 1 ). The alignment is finally found by backtracking (moving only to adjacent entries with each step) from entry FN+1,M+1subscript𝐹𝑁1𝑀1\smash{F_{N+1,M+1}}italic_F start_POSTSUBSCRIPT italic_N + 1 , italic_M + 1 end_POSTSUBSCRIPT.

Similar to the classification-based approach, the alignment output can not directly be used in a response generation model but has to be converted to a formatted verification output string. For this, we use a preformatted template shown in Appendix G. The template groups together the missing, additional and matching steps to produce 𝐯𝐯\mathbf{v}bold_v from the alignment produced by the NW algorithm.

3.2 Response Generation

Direct generation of tutor responses can be challenging because one model has to reason over the student solution, pick a teaching strategy, and generate a response in one step. This has been shown to produce hallucinations (Macina et al., 2023a). We tackle this by incorporating an additional verification step that informs the response generation model, as previously discussed. Our aim is to split the task into two less complex tasks which should reduce errors if each task can be performed well enough and has been shown to reduce hallucinations in document-grounded dialog (Adolphs et al., 2022) and question answering (Press et al., 2023).

The verifier and response generation model are combined in a two-stage approach. First, the verifier outputs a verification 𝐯𝐯\mathbf{v}bold_v of the student solution 𝐬𝐪subscript𝐬𝐪\smash{\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT based on a reference solution 𝐬^𝐪subscript^𝐬𝐪\smash{\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$% }$}}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT. Then, the response generation model is conditioned on 𝐯𝐯\mathbf{v}bold_v, the dialog history \mathcal{H}caligraphic_H, and background knowledge 𝐤𝐤\mathbf{k}bold_k. In our work, 𝐤𝐤\mathbf{k}bold_k consists of the student solution 𝐬𝐪subscript𝐬𝐪\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT, optionally the reference solution 𝐬^𝐪subscript^𝐬𝐪\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$\mbox{$\mathbf{q}$}$}}}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT, and the math word problem 𝐪𝐪\mathbf{q}bold_q. If v𝜽subscript𝑣𝜽v_{\text{\mbox{$\mbox{$\boldsymbol{\theta}$}$}}}italic_v start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT is a distribution over verification labels, the overall model is:

p(𝐲,𝐯,𝐤)=v𝜽(𝐯𝐬𝐪,𝐬^𝐪)verificationp𝜽(𝐲,𝐤,𝐯)generation𝑝𝐲conditional𝐯𝐤subscriptsubscript𝑣superscript𝜽conditional𝐯subscript𝐬𝐪subscript^𝐬𝐪verificationsubscriptsubscript𝑝𝜽conditional𝐲𝐤𝐯generation\begin{split}&p(\mathbf{y},\mbox{$\mbox{$\mathbf{v}$}$}\mid\mathcal{H},\mathbf% {k})=\underbrace{v_{\text{\mbox{$\mbox{$\boldsymbol{\theta}$}$}}^{\prime}}(% \mbox{$\mbox{$\mathbf{v}$}$}\mid\mbox{$\mbox{$\mathbf{s}$}$}_{\text{\mbox{$% \mbox{$\mathbf{q}$}$}}},\widehat{\mbox{$\mbox{$\mathbf{s}$}$}}_{\text{\mbox{$% \mbox{$\mathbf{q}$}$}}})}_{\text{verification}}\cdot\underbrace{p_{\text{\mbox% {$\mbox{$\boldsymbol{\theta}$}$}}}(\mathbf{y}\mid\mathcal{H},\mathbf{k},\mbox{% $\mbox{$\mathbf{v}$}$})}_{\text{generation}}\end{split}start_ROW start_CELL end_CELL start_CELL italic_p ( bold_y , bold_v ∣ caligraphic_H , bold_k ) = under⏟ start_ARG italic_v start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_v ∣ bold_s start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT , over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT verification end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_y ∣ caligraphic_H , bold_k , bold_v ) end_ARG start_POSTSUBSCRIPT generation end_POSTSUBSCRIPT end_CELL end_ROW (2)

The full model provides us with a verification output and the generated response which makes the internal reasoning of the tutor model in terms of student errors more explicit and controllable.

4 Data Collection

We propose and evaluate various verifiers in this work. Since some of them require training data and to evaluate their performance, we collect a dataset of 1,002 human-produced verification outputs to train and evaluate them. This is similar in size to a related corpora (Jacovi et al., 2024). In this section, we describe the annotation task and data collection.

Incorrect Student Solutions Source

Our work extends MathDial (Macina et al., 2023a) by having teachers annotate incorrect student solutions from the dataset with their first error step. There, these incorrect student solutions were used to condition a student model (InstructGPT) to generate responses in a dialogue with a human teacher.

Specifically, these problems are based on the GSM8k (Cobbe et al., 2021) dataset of multi-step math word problems. In MathDial, the reasoning chains are generated using a 2-shot CoT prompt with gpt-3.5-turbo, and temperature sampling (T=0.7𝑇0.7T=0.7italic_T = 0.7) is used to get multiple reasoning paths (n=50𝑛50n=50italic_n = 50). Finally, the most common incorrect solution is chosen. Subsequently, their student model is prompted to respond to a human teacher as a student who tries to solve a problem with a particular incorrect solution.

To not skew our dataset to errors, we balance it with rephrased reference solutions from the student model. We reproduce the student model prompt from MathDial to generate student responses using the reference solutions. All reference solutions and student responses with incorrect solutions are part of the dataset. Details are in Appendix A.

Student Solution Annotation

The objective of the annotation is to mark the exact step of the first error in the student solution. We do not annotate error steps after the first one to decrease ambiguity, as they frequently stem from the first error. We recruit teachers through Prolific, who first read the problem and then mark the precise step of the first error in the student solution. Teachers can access the reference solution to reduce task complexity. Details of the task, the user interface, and examples of collected data are in Appendix A. To compute agreement, 10% of the samples are annotated by one additional annotator with an inter-rater reliability of Cohen’s κ=0.75𝜅0.75\kappa=0.75italic_κ = 0.75 indicating substantial agreement (Cohen, 1960). We show the distribution of incorrect student solution steps in Figure 3.

5 Experiments

We evaluate different verifiers on our dataset and use them to inform response generation models to improve their correctness. Since we extend MathDial with additional annotations we use MathDial dialogues for evaluating tutor response generation. Besides math problem and student solution in a dialog, we either use a model-generated CoT reference solution if marked by “solution” or no reference solution as input to the models. Next, we detail metrics and models.

5.1 Metrics

For teacher response generation, we evaluate the generated output 𝐮Tsubscript𝐮𝑇\mbox{$\mbox{$\mathbf{u}$}$}_{T}bold_u start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT of each model by comparing it to a human-annotated response 𝐮^Tsubscript^𝐮𝑇\smash{\widehat{\mbox{$\mbox{$\mathbf{u}$}$}}_{T}}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT from MathDial. We report standard text generation metrics: the sacrebleu (Post, 2018) implementation of BLEU (sBLEU) to measure word overlap and BERTScore (Zhang et al., 2020) (BF1, using the all-MiniLM-L6-v2 checkpoint) to measure semantic similarity. Moreover, we report the knowledge F1 (KF1) score with respect to the grounding information (correct solution in the case of MathDial) which has been used as a proxy for faithfulness in prior work (Daheim et al., 2024). Similar to (Zheng et al., 2024; Jurenka et al., 2024), we also prompt LLAMA3-70B and use it complementary to human evaluation (the same task and instructions are used in both) to assess how targeted, correct, and actionable a response is. Details about the LLM-based evaluation are found in Appendix E.

5.2 Models

For both verification and response generation, we use different prompted or finetuned models. For verification, we compare the closed-source model GPT-3.5 to the open models LLAMA2 and LLAMA3. For the latter, we prompt the 70B version of the models and finetune LLAMA2-7B using LoRA. For response generation, we evaluate prompting GPT-3.5 and finetuning the encoder-decoder model Flan-T5 with 3B parameters. We finetune Flan-T5 again using LoRA for both the direct modeling and verify-then-generate approach.

Model Overall Verification Stepwise
Corr. F1 Err. F1 F1 micro F1
Few-shot
GPT3.5 0.66 0.52 0.59 0.42
+ solution 0.97 0.97 0.97 0.61
Llama2-70B 0.69 0.38 0.54 0.17
+ solution 0.78 0.59 0.68 0.48
Llama3-70B 0.74 0.58 0.66 0.56
+ solution 0.90 0.87 0.89 0.70
Finetuned
Llama2-7B 0.89 0.67 0.78 0.20
+ solution 0.81 0.80 0.80 0.28
Table 1: Verifying student solutions can be challenging even for strong LLMs. Models are worse at verifying erroneous responses (Err. F1) than correct responses (Corr. F1). Providing a reference solution improves all models. Fine-tuning using our data can make models more robust when no such solution is present and make small models outperform larger prompted ones.
Automatic Metrics LLM Judge (%)
Model Variant sBLEU KF1 BF1 Targeted Correct Actionable
- Human 100.0 100.0 100.0 27 82 87
Few-shot
GPT-3.5 Baseline 2.0 27.0 51.2 29 37 27
Error Reason 1.5 22.5 46.7 34 40 56
Error Description 2.8 30.3 52.6 62 66 45
Step Alignment 2.3 29.8 53.3 42 61 26
Finetuned
Flan-T5-3B Baseline 2.6 27.6 56.0 1 89 76
Error Description 3.0 26.7 56.0 2 92 84
Table 2: Adding an additional verification stage to ground tutor response generation models leads to responses that are more targeted at the student error, less frequently hallucinated, and more actionable for the student, both for finetuned and prompted models. Proving a textual Error Description of the student solution performs better than Step Alignment of student and reference solution, as well as providing a shorter Error Reason.

6 Results

We first show the performance of different verification models in Section 6.1 and then use verification models in response generation in Section 6.2.

6.1 Verification

In this section, we benchmark LLMs on their ability to evaluate the correctness of student solutions using the Overall Verification and Stepwise Verification approaches from Section 3.1. For Stepwise Verification we use the multi-class classification approach, because it performed better than iterative classification in our experiments. A comparison is found in Appendix B. We measure the F1 score (balanced dataset), in particular, micro F1 for Stepwise Verification (imbalanced dataset, see Figure 3). We find in Table 1 that Overall Verification can be challenging even for state-of-the-art LLMs. All prompted models show comparably low performance when prompted without a reference solution and especially struggle with identifying incorrect responses. Providing a reference solution improves results significantly. However, for Stepwise Verification even the reference solution does not improve micro F1 beyond 0.70. This result is consistent with expert educator-based assessment (Yen and Hsu, 2023) and LLM self-correction results (Huang et al., 2024).

Interestingly, our dataset can be used effectively for finetuning. Even a smaller LLAMA2-7B model can outperform larger prompted models on Overall Verification, especially when no solution is provided. Potentially, the additional finetuning steps make it easier for the model to also solve the problem before verification. The finetuned Stepwise Verification model outperforms its larger prompted counterpart LLAMA2-70B when no solution is provided. Results for finetuning show a ten-fold cross-validation. Further details are in Appendix H.

6.2 Response Generation

Next, we show in Table 2 that combining verification and tutor response generation models can improve the quality of the generated responses. We compare the Error Description and Step Alignment verifiers to direct response generation and using the Error Reason (Wang et al., 2024c). There, the error is categorized into either: guess, misinterpret, right-idea, imprecise, not sure, or careless. We use a subset of MathDial, where the student describes their solution to the teacher in the dialog, and generate the following teacher utterance.

First, we prompt GPT-3.5 using the prompt templates from Section 3.1 for comparability. We find that providing only the Error Reason does not improve over the direct baseline in simpler automatic metrics (sBLEU, KF1, BF1) but only in terms of the LLM-based judging. Using the more detailed Error Description which provides the exact mistake of the student gives larger improvements, both in terms of automatic metrics and LLM-based judging. Similarly, we find Step Alignment to be helpful, but to provide less actionable responses. When finetuning with the Error Description, we obtain improvements over the finetuned baseline but they are smaller and do not hold for each metric.

Our qualitative analysis shows that both Step Alignment and Error Description result in responses that better localize the exact student error. For example, the baseline often assesses the solution wrongly or skips the first error step and instead asks for the solution of a later step. Examples are shown in Table 10 and Table 11. Section 6.3 confirms our results by human evaluation.

Model Variant target\uparrow corr\uparrow act\uparrow
Verification
GPT3.5 Error Description - 70.6 -
Llama3-70B Error Description - 82.4 -
Response Generation
- Human 35.0 45.0 42.5
Few-shot
GPT3.5 Baseline 30.0 37.5 30.0
Error Reason 27.5 22.5 37.5
Error Description 57.5 62.5 45.0
Step Alignment 57.5 60.0 27.5
Finetuned
Flan-T5-3B Baseline 7.5 20.0 25.0
Error Description 20.0 35.0 35.0
Table 3: Human evaluation with four expert annotators shows that verification before generation improves along the targetedness, correctness, and actionability (without telling the solution) of responses. We find that Error Description works best and improves both prompted and finetuned models.

6.3 Human Evaluation

We conduct a human evaluation using teachers as expert annotators. All annotators are recruited on Prolific after manual screening. We assess whether the generated responses are targeted, correct, and actionable without outright telling the solution. Annotators are blind to the model source. The exact questions are as follows. 1) (Targeted (T)) Does the Teacher point out the root cause of the student’s mistake? 2) (Correctness (C)) Is the Teacher’s response factually correct with respect to the reference solution? 3) (Actionable (A)) - Does the Teacher provide actionable steps to let the Student correct the mistake without giving away the full answer? More details are in Appendix C.

Responses from 6 models and one human response from MathDial were annotated for a random set of 40 conversations. To compute inter-rater reliability, 9 conversations were annotated with at least 2 raters for each response. Cohen’s kappa is 0.21, 0.25, and 0.13 for targeted, correctness, and actionable. For Error Description correctness it is κ=0.30𝜅0.30\kappa=0.30italic_κ = 0.30. Next, we describe the results, first on verification and then for response generation.

Verification

Annotators assess the Error Description as correct if the exact mistake of the student is found and incorrect when the model says that the solution is correct when it is not and vice versa, misses the step of the error, or is generic. Results in Table 3 show that LLAMA3-70B outperforms GPT-3.5 but also with 82.4%percent82.482.4\%82.4 % of the errors being found correctly.

Response Generation

Next, in Table 3, we evaluate how targeted, correct, and actionable the responses generated by different models are. We find that providing the Error Reason improves over the baseline only in terms of how actionable responses are. We hypothesize that conditioning on only the reason is insufficient for a targeted response. Error Description and Step Alignment provide more information regarding the exact error and therefore improve strongly over the baseline in both targetedness and correctness. Using Step Alignment information also does not improve actionability but Error Description improves it. The same improvements also hold for using the Error Description for a finetuned model. All in all, we find strong evidence that using our verify-then-generate approach improves teacher response generation.

7 Ablation Studies

Next, we provide further ablations, first on the cost function used in the NW algorithm (Section 7.1) and then on the impact of verification before response generation based on its correctness and problem difficulty (Section 7.2).

7.1 Alignment

Step Alignment tsuperscript𝑡t^{\ast}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT csuperscript𝑐c^{\ast}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Accuracy(%)
Random 0.6 -0.2 43.6
Solution Match 0.6 -0.3 51.9
SBERT 0.8 -0.1 58.2
Roscoe 0.9 -0.1 61.4
Table 4: We compare different cost functions for Step Alignment with the Needleman-Wunsch algorithm based on 30 human-annotated alignments. Semantic-similarity-based cost function (SBERT, Roscoe) performs better than random cost or an indicator function of whether the numerical substep solutions match.

We compare different cost functions used for the NW Step Alignment algorithm in Table 4. For the comparison, 30 alignments between a student and reference solution were produced by humans and the accuracy of student solution step alignment is measured. As cost functions we use the cosine similarity of Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) embeddings and embeddings from a model trained on Roscoe (Golovneva et al., 2023), as well as a random cost and an indicator function that is 1111 when two substeps have the same numerical solution and 00 otherwise. Similarity threshold t𝑡titalic_t and gap cost c𝑐citalic_c are optimized via a hyperparameter grid search, as indicated by tsuperscript𝑡t^{\ast}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and csuperscript𝑐c^{\ast}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We find that cosine similarity works best and training on relevant math data fine-tuning in Roscoe further improves performance.

7.2 Verification

Verification correctness is important

Model Error Desc. T C A
Few-shot
GPT3.5 incorrect 50.0 62.5 50.0
correct 62.5 62.5 41.7
Llama3-70B incorrect 41.7 33.3 25.0
correct 82.1 71.4 21.4
Finetuned
Flan-T5-3B incorrect 08.3 25.0 41.7
correct 25.0 39.3 32.1
Table 5: We find that tutor responses are much more often correct and targeted if the Error Description is correct. Data from human evaluation.

We find in Table 5 that correct verification is important for subsequent response generation. If it is correct, both targetedness and correctness are strongly improved over when it is incorrect. However, actionability appears to be decreased which indicates less scaffolding and more teacher "telling the solution".

Problem difficulty influences verification

Model Steps T C A Error Desc.
Few-shot
GPT3.5 3 62.5 75.0 37.5 50.0
4 61.1 55.6 55.6 72.2
5 50.0 64.3 50.0 50.0
Llama3-70B 3 87.5 75.0 12.5 87.5
4 72.2 72.2 72.2 72.2
5 57.1 35.7 57.1 50.0
Finetuned
Flan-T5-3B 3 12.5 25.0 25.0 87.5
4 27.8 50.0 50.0 72.2
5 14.3 21.4 14.3 50.0
Table 6: For prompted models, responses for problems with shorter solution lengths are more often correct and targeted, because such problems are likely less complex. For finetuned models we do not find this trend. More steps can decrease description correctness ( Error Desc.). Data from human evaluation of Error Description.

Finally, we show in Table 6 that the performance of both verification and our verify-then-generate approach is heavily correlated with the number of reasoning steps that are used in the reference solution of a given math word problem. We use this as a proxy for problem difficulty. First of all, the performance of the LLAMA3 70B Error Description decreases with the number of steps. This is reflected in the decreased correctness and targetedness of the responses of the few-shot prompted LLAMA3 model. For GPT-3.5 we do not find a similar conclusion for the Error Description model but at least targetedness still decreases with the number of steps. For the finetuned model we do not see similar trends but instead find the best performance for problems with four steps, likely because these are more common in the training data.

8 Discussion & Conclusion

Student errors are key learning opportunities. Tutors should recognize them and precisely guide students with targeted feedback without telling full solution. Motivated by effective teaching practice, we split the task of tutor response generation into two separate steps of verifying the student solution and generating a response.

To evaluate our approach, we collect a dataset of around 1k teacher-annotated solutions to augment an existing math tutoring corpus. Our results show that splitting response generation into two steps can result in more targeted and correct responses that better scaffold human learning. We showcase this using both automatic evaluation and human evaluation annotated by teachers, both for prompted and finetuned models.

9 Limitations

Focus on scaffolding problem-solving

The tutoring scenarios which are considered are centered around the student problem-solving stage. In this case, students have prior knowledge. mostly understand the learning topics and practice them. However, different learning scenarios such as direct instruction, building rapport with students, or open-ended discussions are not considered in this work.

Evaluating student solutions and responding appropriately to a student’s mistakes is inherently challenging, even for human teachers. Furthermore, teachers should ideally give adaptive feedback depending on the problem-solving strategy chosen by the student and treat different errors in different ways to uncover any misconceptions (Nye et al., 2014). For example, in math, productive errors present important learning opportunities for students to learn from them (Kapur, 2016; Shaughnessy et al., 2021; Sinha and Kapur, 2021), e.g. by teacher-guided self-correction or targeted instruction, while unproductive errors, such as numerical miscalculations, could be easily resolved using a calculator (Lepper and Woolverton, 2002).

Difficulty of obtaining student reasoning chains

Model-generated reasoning chains might contain the same biases as human students (Opedal et al., 2024). On the other hand, there might be many additional differences from human student reasoning, e.g. students might not always stick to exact math notations or skip some steps in the explanations. However, because such data from students is sensitive, we work with model-generated reasoning solutions and responses.

Focus on multi-step problems

Procedural or multi-step problems are the basis of most of the scientific disciplines, therefore we believe our approach should be general enough to work across any science subject, especially by including retrieval-augmented generation (RAG) from textbooks. However, it is still an open research question whether a similar solution would work for language learning or fact-based problems, and how models perform in languages other than English.

Evaluation is teacher-centered and complemented with an LLM-judge

Future work should focus on student user studies with AI tutors. However, this requires careful experimental consideration and safety mechanisms. Moreover, assessment of the responses is done exclusively by teachers and therefore future work should consider running assessments of the responses by students.

10 Acknowledgements

This research work has been funded by a Swiss National Science Foundation award (#201009), a Responsible AI grant by the Haslerstiftung, by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. Nico Daheim acknowledges travel support from ELISE (GA no 951847). Jakub Macina acknowledges funding from the ETH AI Center Doctoral Fellowship, Asuera Stiftung, and the ETH Zurich Foundation. We thank Sankalan Pal Chowdhury, Kumar Shridhar, Shehzaad Dhuliawala, and Justus-Jonas Erker for valuable feedback and discussions.

11 Ethics Statement

Intended usage

The benefits of our dataset are in understanding and designing AI technology to assist teachers and students during the problem-solving stage. Most importantly, the goal of such systems is to not replace human teachers, but rather enhance their capabilities and make them focus on important and human aspects of teaching. We will release the dataset under CC-BY-4.0 license 111https://creativecommons.org/licenses/by/4.0/deed.en for further usage and exploration by the community. This also adheres to the licensing of MathDial, which we extend.

Data Anonymization and Privacy

As the data in education are strictly confidential we obtained approval on the proposal222The study was approved by the ETH Zurich Ethics Committee (IRB) under EK-2024-N-97. of the collection interface, questions and how long the data will be stored. All participants fill informed consent at the beginning of the annotation and may withdraw without reason at any time. We store only the necessary data and do not store any personally identifiable information. The collected data are stored anonymously and securely. Moreover, no student data are used in this work.

Accessibility and Potential Misuse

Our work focus on addressing hallucinations of LLM Tutors and their generation of factually incorrect responses. This directly addresses one of the important aspects of responsible use of AI which does not spread incorrect information, especially in education. We encourage the community to work on this important topic by open-sourcing the dataset, the code for running the benchmarks, and the methods used in this work. These are primarily intended for research purposes. As with any AI technology, the methods and dataset could be misused. However, we believe by open-sourcing the work we inform about the risks and capabilities of the technology a wider research community which then leads to further improvements.

References

Appendix A Data Collection Details

The annotators are screened through Prolific to be teachers native in English with an overall acceptance rate of more than 98% and with at least 500 submissions. We paid a minimum of $20 per hour. Annotators are from the US, Canada, and the UK, with a balanced gender ratio, and their age range is from 25 to 53 years. All annotators have K12 experience and on average they have 12 years of experience in teaching.

The annotators are first trained for the task with an interactive practice problem and then annotate student solutions. In one session one annotator performs 5 stepwise error verifications where they first pick the exact step with the error and then classify the error into 8 categories, with separate descriptions of the error for each category. We filter out all error descriptions not following the prescribed format to remove low-quality annotations.

The interface is shown in Figure 2. The categories are: missing or incorrect factual knowledge, misunderstanding of the question, the reference solution reached but proceed further, missing quantity, extra quantity, unit conversion error, numerical calculation, other.

Refer to caption
Figure 2: User interface for annotating the step of the first error, their categorization, and description of the error.

A.1 Dataset Details

The collected dataset is in English and from the domain of K12 math word problem-solving. Examples from the dataset are shown in LABEL:{tab:dataset-examples}. The dataset consists of 1002 data points with 612 unique math problems. The distribution of total student steps and the location of the first incorrect steps are shown in Figure 3. Notice the student solutions contain up to 11 steps with a mean of ca. 6 steps. The location of the first error ranges from 1 to 8 steps with majority of the errors between the first and third steps.

The incorrect student solutions and reference solutions are part of the MathDial dataset (Macina et al., 2023a). The prompt used to generate correct student responses to balance the dataset with correct student responses are based on the Student model from (Macina et al., 2023a).

Refer to caption
Figure 3: Dataset Distribution. The index of the step with the first error annotated by teachers and the total student solution steps.
Problem: Carl has been selling watermelons on the side of the road for $3 each. This evening he went home with $105 in profit and 18 watermelons. How many watermelons did he start out with this morning?
Reference solution: Carl sells his watermelons for $3 each so today he sold $105 / $3 per watermelon = 35 watermelons. He had 18 watermelons left over, so this morning he started with 18 + 35 = 53 watermelons.
Incorrect solution: Let’s start by finding out how much money Carl made selling the 18 watermelons. Since he sold each watermelon for $3, he made 18 x $3 = $54. We know that his total profit for the day was $105, so he must have started with $105 - $54 = $51 worth of watermelons. Since each watermelon costs $3, he must have started with 51 / $3 = 17 watermelons.
Student response: I started by finding out how much money Carl made selling the 18 watermelons. Since he sold each watermelon for $3, he made 18 x $3 = $54. Then I realized that his total profit for the day was $105, so he must have started with $105 - $54 = $51 worth of watermelons. Since each watermelon costs $3, I concluded that he must have started with 51 / $3 = 17 watermelons.
Error category: Misunderstanding of a question
Error description: Carl did not sell 18 watermelons, but 18 watermelons are left unsold.
Problem: Dakota gets hit by a bus and has to spend 3 days in the hospital. The hospital charges her $900/day for her bed, $250/hour for two specialists to look at her 15 minutes each, and $1800 for the ambulance ride. How much is Dakota’s medical bill?
Reference solution: First find the total cost of the hospital bed by multiplying the cost per day by the number of days: $900/day * 3 days = $2700. Then figure out how many minutes total the two specialists looked at Dakota: 15 minutes/specialist * 2 specialists = 30 minutes Then divide the number of minutes by the number of minutes per hour and multiply it by the specialists’ hourly rate to find their total charge: 30 minutes / 60 minutes/hour * $250/hour = $125. Finally, add up the cost of the specialists, ambulance and bed to find Dakota’s total bill: $2700 + $1800 + $125 = $4625
Incorrect solution: The hospital charges Dakota $900 x 3 = $2700 for her bed. Each specialist charged her $250/hour x 2 = $500 for their 15 minutes each. So, Dakota was charged $500 x 2 = $1000 for the two specialists. Therefore, her medical bill is $2700 + $1000 + $1800 = $5500
Student response: I started by calculating the cost of the bed, which was $900 x 3 days = $2700. Then I calculated the cost of the two specialists, which was $250/hour x 2 specialists x 15 minutes each = $500. Then I added all the costs together to get the total cost of $2700 + $1000 + $1800 = $5500
Error category: Misunderstanding of a question
Error description: Student computes charges for a full hour of 2 specialists, not just 15 minutes as indicated in the question.
Table 7: Examples from the collected dataset. The annotated error lines are in red.

Appendix B Details of Overall Verification and Stepwise Verification

For Stepwise Verification, we compare multi-class classification and iterative approach on our dataset and the results are in LABEL:{tab:error-finding-iterative}. The iterative approach classifies each step 𝐬nsubscript𝐬𝑛\mbox{$\mbox{$\mathbf{s}$}$}_{n}bold_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT whether it is correct and therefore is more resource-intensive than multi-class classification. The multi-class classification directly predicts the label {0,,N}0𝑁\{0,\dots,N\}{ 0 , … , italic_N } where 0 represents the solution is correct. The results indicate no improvements (with the exception of Llama2-70B) by using the iterative approach and in the main paper we therefore report multi-class classification results.

Moreover, to confirm the quality of our collected dataset, we run the same models on the smaller and simpler Roscoe human evaluation set (Golovneva et al., 2023). The dataset is smaller and contains 105 correct and 95 incorrect solutions. The results are shown in Table 9 and the conclusions are identical to our dataset.

Model micro F1
GPT3.5 multi-class 0.42
+ solution 0.61
GPT3.5 iterative 0.39
+ solution 0.36
Llama2-70B multi-class 0.17
+ solution 0.48
Llama2-70B iterative 0.36
+ solution 0.58
Llama3-70B multi-class 0.56
+ solution 0.70
Llama3-70B iterative 0.56
+ solution 0.58
Table 8: Results of two approaches for Stepwise verification and their micro F1 score. Multi-class classification directly predicts incorrect step N𝑁Nitalic_N. On the other hand, the iterative approach iterates over each step and runs a binary prediction of whether the step is correct until the first incorrect step is found.
Model Overall Verification Stepwise
Corr. F1 Err. F1 F1 micro F1
Few-shot
GPT3.5 0.75 0.65 0.70 0.50
+ solution 0.91 0.88 0.89 0.63
Llama2-70B 0.07 0.62 0.34 0.11
+ solution 0.80 0.82 0.81 0.59
Table 9: Stepwise verification results on the small existing dataset from human evaluation of Roscoe (Golovneva et al., 2023). The Stepwise verification contains multi-class classification results. The results of the models are consistent with our dataset.
Refer to caption
Figure 4: User interface for explaining the error of the student and evaluation of two error descriptions from models. Afterwards, annotators evaluate the quality of the model responses.
Refer to caption
Figure 5: User interface for evaluation of the quality of the model responses. Some responses contain attention checks (second question in this case).

Appendix C Guidelines for Human Evaluation

The user interface used in the human evaluation is shown in Figure 4 and Figure 5. All the annotators had to complete a training for the task where each of their responses was evaluated and the feedback was provided to them. We used the subset of the annotators from Appendix A with the same selection conditions and the same payment. Before evaluating the quality of responses, annotators are asked to analyze the math problem and the conversation and explain the student error in an open-ended text. To not bias their understanding of the student solution only subsequently the error descriptions from verifiers were annotated with their correctness using these instructions: Does the text above correctly describe the root cause of the first student’s mistake? Answer "No" if the correct part of the student solution is identified as incorrect. Answer "No" if is too general without any further details e.g. ’There is a small mistake’.

The exact wording of the annotation questions for evaluating the quality of responses is the following:

Targeted

Does the Teacher point out to the root cause of the student’s mistake? Answer ’No’ if the Teacher gives the right answer without pointing out the mistake. Answer ’No’ if the Student’s statement is wrong and the Teacher does not point out the mistake directly. Answer ’Yes’ if the Teacher correctly describes the mistake in the student’s solution. Answer ’No’ if the Teacher addresses the correct part of the student solution. Answer ’No’ if response is too general and could be applied to any mistake e.g. ’You made a small mistake’.

Correctness

Is the Teacher’s response factually correct with respect to the reference solution? The teacher should NOT say incorrect information or provide parts of the solution that are NOT correct with respect to the reference solution. Answer ’No’ if the Teacher provides parts of a solution that is incorrect or does not guide a student towards the reference solution. Carefully compare the reference solution and the Teacher’s response.

Actionable

Does the Teacher provide actionable steps to let the Student correct the mistake without giving away the full answer? The teacher should provide actionable hints or steps WITHOUT revealing the full reference solution. Instead, the Tutor should give hints or ask questions to help the Student find the solution by themselves. Answer ’No’ if the Teacher simply just reveals the full reference solution.

Appendix D Alignment Details

To find the best hyperparameters for the Alignment algorithm we run a grid search using values of similarity threshold t=[0.5,0.6,0.7,0.8,0.9,0.95]𝑡0.50.60.70.80.90.95t=[0.5,0.6,0.7,0.8,0.9,0.95]italic_t = [ 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 0.95 ] and gap costs c=[0.1,0.2,0.3,0.5,0.7,1.0,1.2]𝑐0.10.20.30.50.71.01.2c=[-0.1,-0.2,-0.3,-0.5,-0.7,-1.0,-1.2]italic_c = [ - 0.1 , - 0.2 , - 0.3 , - 0.5 , - 0.7 , - 1.0 , - 1.2 ]. The best hyperparameters are reported in Table 4. The exact models which are used for semantic similarity are SBERT (sentence-transformers/all-mpnet-base-v2) and Roscoe (facebook/roscoe-512-roberta-base).

We use the template to transform the output of the algorithm into the textual prompt. In the template, all the steps from the student solution and reference solution are used. Furthermore, the cost of the alignment can be used to filter out student solutions that differ completely from reference solution which we leave for future work. The template is the following:

Missing steps in student solution: {missing steps}
Unnecessary steps in the student solution: {unnecessary steps}
Matching steps: {matching steps}

Appendix E Details on LLM-based Evaluation

A response is targeted if it targets the students’ mistake, correct if it does not conflict with grounding information, and actionable if it provides the student with useful guidance to help the student progress in their solution attempt. In all cases, for each quality dimension, we provide the model with three examples (3-shots). We use the LLAMA3-70B333meta-llama/Meta-Llama-3-70B-Instruct with temperature T=0𝑇0T=0italic_T = 0 for reproducibility. The task description and the examples are the same as in the human evaluation for instructing the annotators described in Section 6.3. The prompt also includes the reference solution for more reliable judging (Zheng et al., 2024; Jurenka et al., 2024).

Appendix F Qualitative examples

In this section, we show qualitative examples to better understand the behavior of verification and verification-based response generation. We first show examples for prompted models in Table 10 and then show examples for finetuned models in Table 11.

Problem: At a cafe, the breakfast plate has two eggs and twice as many bacon strips as eggs. If 14 customers order breakfast plates, how many bacon strips does the cook need to fry?
Correct Solution: Each breakfast plate has 2 x 2 = 4 bacon strips. The cook needs to fry 14 x 4 = 56 bacon strips.
Error Description Verification: The student incorrectly wrote the number of bacon strips per plate as 2 + 2x, where x is the number of eggs, instead of simply 2 x 2 = 4 bacon strips per plate. This led to an incorrect total of 84 bacon strips.
Dialogue
Teacher: Hi Claire could you please walk me through your solution
Student: Sure! For each breakfast plate, there are 2 eggs and twice as many bacon strips as eggs. So, for each breakfast plate, there are 2 + 2x bacon strips, where x is the number of eggs. Since each breakfast plate has two eggs, x = 2. So, for each breakfast plate, there are 2 + 2x bacon strips = 2 + 2(2) = 6 bacon strips. Thus, for 14 breakfast plates, the cook needs to fry 6 x 14 = 84 bacon strips.
Generated Responses
Human Teacher: Let’s work on simpler problem. If I have 2 dogs for every cat and I have 5 cats, I should have double the number of dogs. How many dogs will I have
Baseline: It looks like you made a small mistake in your solution. You correctly found the number of bacon strips for one breakfast plate, but you used the wrong value for x. Since each breakfast plate has 2 eggs, x should be 2, not 2 eggs. So the correct number of bacon strips for one breakfast plate is 2 + 2(2) = 6, and for 14 plates, the cook needs to fry 6 x 14 = 84 bacon strips. Great effort!
Error reason: It looks like you made a small mistake in your calculation. You correctly found that for each breakfast plate, there are 6 bacon strips, but when you multiplied by 14, you should have gotten 84, not 6 x 14 = 84. So the correct answer is 6 bacon strips per plate multiplied by 14 plates equals 84 bacon strips.
Error description: It looks like there was a mistake in your calculation for the number of bacon strips for each breakfast plate. Remember that the breakfast plate has twice as many bacon strips as eggs, so for each plate, there are 2 eggs and 4 bacon strips. Can you try recalculating the total number of bacon strips needed for 14 breakfast plates?
Alignment: It looks like you made a small mistake in your solution. The breakfast plate has 2 eggs, so it has 2 x 2 = 4 bacon strips. If 14 customers order breakfast plates, the cook needs to fry 14 x 4 = 56 bacon strips. The answer is 56.
Table 10: Examples of responses generated by GPT-3.5 prompted models for the same problem.
Problem: A pet store owner had 12 birds, 9 puppies, 5 cats, and 15 spiders. Half the birds were sold and 3 puppies were adopted. Then, someone left the spider enclosure open and 7 of them went loose. How many animals does the pet store owner have left?
Correct Solution: There are 12/2 = 6 birds left. There are 9-3= 6 puppies left. There are 15 - 7 = 8 spiders left. The pet store owner has 5+6 + 6 + 8 = 25 animals left.
Verification: The student failed to include the 5 cats in their final count, resulting in an incorrect total of 20 animals instead of 25.
Dialogue
Teacher: Hi , could you please walk me through your solution?
Student: Sure, I started by taking half of the 12 birds, which is 6 birds. Then, I subtracted 6 from 12 to get the number of birds left, which is 6. After that, I subtracted 3 from 9 to get the number of puppies left, which is 6. Finally, I subtracted 7 from 15 to get the number of spiders left, which is 8. So, the pet store owner has 6 birds, 6 puppies, and 8 spiders left, which adds up to 20 animals.
Generated Responses
Human Teacher: i think you missed something. Cats?
Baseline: How many birds were sold and how many puppies were adopted?
Ours: How many cats did the pet store owner have?
Problem: Every year, Tabitha adds a new color to her hair. She started this tradition when she was 15 years old, which was the year she added her second hair color. In three years, Tabitha will have 8 different colors in the hair. Currently, how old is Tabitha?
Correct solution: Since, three years from now, Tabitha will have 8 different colors in her hair, this year she has 8 - 3 = 5 colors in her hair. Since she had 2 colors in her hair at the age of 15, she has added 5-2=3 colors since she was 15 years old. Since she ads one color per year, 3 added colors = 3 added years. Thus, currently, she is 15+3=18 years old.
Verification: The student incorrectly assumed that Tabitha is currently 21 years old, without considering the information that in three years, she will have 8 different colors in her hair. The student failed to work backwards to find Tabitha’s current age.
Dialogue
Teacher: Hi Jordy. Tell me about your answer
Student: I think my answer is correct. I used a calculator and I didn’t make any calculation errors. I figured that since Tabitha started adding a new color to her hair when she was 15, and she has added a new color every year since then, she must be 21 years old now.
Generated Response
Human: Let’s approach this working backward. If in three years, Tabitha will have 8 colors in her hair, how many colors does she have in her hair now?
Baseline: How many years has Tabitha been adding a new color to her hair?
Ours: How many different colors does Tabitha have in her hair now?
Table 11: Qualitative examples of finetuned response generation models.

Appendix G Prompts

This section provides the exact prompts used in our work. First, we show the prompt used for the baseline, error description-based, and alignment-based response generation models in Fig. 6 and Fig. 9. Verification prompts for Error Description are in Fig. 8 and for Error Reason in Fig. 7. The prompt with 5 examples for the CoT solution generation is in Fig. 10. Then, we show the prompts used for targeted LLM-based evaluation in Fig. 11, correctness evaluation in Fig. 12, and evaluation of how actionable responses are in Fig. 13. To sample responses from models by prompting we use temperature T=0𝑇0T=0italic_T = 0 for reproducibility.

You are an experienced teacher and you are going to respond to a student. The problem your student is solving is on topic: {topic}. Problem: {problem} {conversation} Teacher (maximum two sentences):
Figure 6: Response generation prompt for the direct baseline. {problem} is a placeholder for the problem the student is solving, {topic} is the learning topic, and {conversation} is a conversation history.
You are an experienced teacher. Your task is to read a conversation snippet of a tutoring session between a student and tutor, and determine what type of error the student makes in the conversation. We have a list of common errors that students make in math, which you can pick from. We also give you the option to write in your own error type if none of the options apply. Error list: 0. Student does not seem to understand or guessed the answer. 1. Student misinterpreted the question. 2. Student made a careless mistake. 3. Student has the right idea, but is not quite there. 4. Student’s answer is not precise enough or the tutor is being too picky about the form of the student’s answer. 5. None of the above, but I have a different description (please specify in your reasoning). 6. Not sure, but I’m going to try to diagnose the student. Here is the conversation snippet: Lesson topic: {topic}. Problem: {problem} {conversation} Why do you think the student made this mistake? Pick an option number from the error list and provide the reason behind your choice. Format your answer as: {"answer": #, "reason": "write out your reason for picking # here"}
Figure 7: Verification for Error reason baseline (Wang et al., 2024c). {topic} is the learning topic, {problem} is a placeholder for the problem the student is solving, and {conversation} is a conversation history.
You are an experienced math teacher. Your goal is to identify the correctness of the Student’s Solution to a Problem. Problem: {problem} Expected reference solution: {solution} {conversation} Q: Find the first error in the student solution compared to the expected reference solution and write a one line description. If no error, write "Student’ solution is Correct". A:
Figure 8: Verification prompt for Error description of the first student error. {problem} is a placeholder for the problem the student is solving, {solution} is a solution generated from the same model using CoT prompt in Figure 10, and {conversation} is a conversation history.
You are an experienced teacher and you are going to respond to a student. The problem your student is solving is on topic: {topic}. Problem: {problem} Assessment of student solution: {description} {conversation} Teacher (maximum two sentences):
Figure 9: Response generation for Error reason baseline, Error description, and Alignment generation. {problem} is a placeholder for the problem the student is solving, {topic} is the learning topic, {conversation} is a conversation history, {description} is the result of the particular verification step.
You are a highly intelligent question answering assistant. Solve the question step-by-step. Always finish the answer by providing your final answer after ’The answer is’. Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. The answer is 72 Question: Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read? Answer: Maila read 12 x 2 = <<12*2=24>>24 pages today. So she was able to read a total of 12 + 24 = <<12+24=36>>36 pages since yesterday. There are 120 - 36 = <<120-36=84>>84 pages left to be read. Since she wants to read half of the remaining pages tomorrow, then she should read 84/2 = <<84/2=42>>42 pages. The answer is 42 Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Answer: Weng earns 12/60 = <<12/60=0.2>>0.2 per minute. Working 50 minutes, she earned 0.2 x 50 = <<0.2*50=10>>10. The answer is 10 Question: The profit from a business transaction is shared among 2 business partners, Mike and Johnson in the ratio 2:5 respectively. If Johnson got $2500, how much will Mike have after spending some of his share on a shirt that costs $200? Answer: According to the ratio, for every 5 parts that Johnson gets, Mike gets 2 parts Since Johnson got $2500, each part is therefore $2500/5 = $<<2500/5=500>>500 Mike will get 2*$500 = $<<2*500=1000>>1000. After buying the shirt he will have $1000-$200 = $<<1000-200=800>>800 left. The answer is 800 Question: Ralph is going to practice playing tennis with a tennis ball machine that shoots out tennis balls for Ralph to hit. He loads up the machine with 175 tennis balls to start with. Out of the first 100 balls, he manages to hit 2/5 of them. Of the next 75 tennis balls, he manages to hit 1/3 of them. Out of all the tennis balls, how many did Ralph not hit? Answer: Out of the first 100 balls, Ralph was able to hit 2/5 of them and not able to hit 3/5 of them, 3/5 x 100 = 60 tennis balls Ralph didn’t hit. Out of the next 75 balls, Ralph was able to hit 1/3 of them and not able to hit 2/3 of them, 2/3 x 75 = 50 tennis balls that Ralph didn’t hit. Combined, Ralph was not able to hit 60 + 50 = <<60+50=110>>110 tennis balls Ralph didn’t hit. The answer is 110 Question: {problem} Answer:
Figure 10: Prompt for the chain-of-thought (CoT) reference solution generation. {problem} is a placeholder for the problem the student is solving.
You are a Critic giving feedback on the Teacher’s tutoring skills. As the Critic you assess whether the Teacher points out the root cause of the Student’s mistake. Answer ’No’ if the Student’s statement is wrong and the Teacher does not point out the mistake directly. Answer ’No’ if the Teacher gives the right answer without pointing out the mistake. Answer ’Yes’ if the Teacher correctly describes the mistake in the student’s solution. Answer ’No’ if the Teacher addresses the correct part of the student solution. Answer ’No’ if response is too general and could be applied to any mistake e.g. ’You made a small mistake’. Does the Teacher points out to the root cause of the student’s mistake? Output a ’Yes’ or ’No’ decision and a short rationale in the format Decision: <decision>. Rationale: <rationale>. [Examples] Problem: The profit from a business transaction is shared among 2 business partners, Mike and Johnson in the ratio 2:5 respectively. If Johnson got $2500, how much will Mike have after spending some of his share on a shirt that costs $200? Solution: According to the ratio, for every 5 parts that Johnson gets, Mike gets 2 parts. Since Johnson got $2500, each part is therefore $2500/5 = $500. Mike will get 2*$500 = $1000. After buying the shirt he will have $1000-$200 = $800 left [Dialogue] Teacher: Could you please walk me through your solution? Student: Sure. I figured that Jonsons’ profit of $2500 is equivalent to 7 parts, so each part is worth $357.14. Since Mike got 2 parts, his total portion is 714.28, of which he spent 200, leaving him with 514.28 [Example 1] Teacher response: Great job on correctly calculating the value of each part and determining Mike’s remaining amount after spending on the shirt. Your final answer is correct. Request: Does the Teacher point out the root cause of the student’s mistake? Critic: Decision: No. Rationale: The teacher does not identify any error in the student’s solution. Moreover, it acknowledges the student’s incorrect solution as correct. [Example 2] Teacher response: Your calculation is incorrect. Can you try to double-check the parts each gets? Request: Does the Teacher point out the root cause of the student’s mistake? Critic: Decision: No. Rationale: The teacher just generally state your solution is incorrect and this is not specific to the student error. [Example 3] Teacher response: Johnson’s $2500 represents 5 parts, not 7, so each part is $500, and Mike’s share before buying the shirt is $1000, not $714.28. So the solution is $800. Request: Does the Teacher point out the root cause of the student’s mistake? Critic: Decision: Yes. Rationale: The Teacher provides very specific identification by directly stating the problem is in using the wrong number of parts. [The End of Examples] Problem: {problem} Reference solution: {correct answer} [Dialogue] {dialog history} Teacher response: {response} Request: Does the Teacher point out the root cause of the student’s mistake? Critic:
Figure 11: Prompt for targeted evaluation.
You are a Critic giving feedback on the correctness of the Teacher who is interacting with a Student. The teacher should NOT say incorrect information or provide parts of the solution that are NOT correct with respect to the reference solution. Answer ’No’ if the Teacher provides parts of a solution that is incorrect or does not guide a student towards the reference solution. Is the Teacher’s response factually correct with respect to the reference solution? Output a ’Yes’ or ’No’ decision and a short rationale in the format Decision: <decision>. Rationale: <rationale>. Carefully compare the reference solution and the Teacher’s response. [Examples] Problem: The profit from a business transaction is shared among 2 business partners, Mike and Johnson in the ratio 2:5 respectively. If Johnson got $2500, how much will Mike have after spending some of his share on a shirt that costs $200? Solution: According to the ratio, for every 5 parts that Johnson gets, Mike gets 2 parts. Since Johnson got $2500, each part is therefore $2500/5 = $500. Mike will get 2*$500 = $1000. After buying the shirt he will have $1000-$200 = $800 left [Dialogue] Teacher: Could you please walk me through your solution? Student: Sure. I figured that Jonsons’ profit of $2500 is equivalent to 7 parts, so each part is worth $357.14. Since Mike got 2 parts, his total portion is 714.28, of which he spent 200, leaving him with 514.28 [Example 1] Teacher response: Your calculation is incorrect. Can you try to double-check the parts each gets? Request: Is the Teacher’s response factually correct with respect to the reference solution? Critic: Decision: Yes. Rationale: The Teacher’s response correctly states there is a mistake in the student’s calculation and ask a question. Nothing factually incorrect is said. [Example 2] Teacher response: Johnson’s $2500 represents 5 parts, not 7, so each part is $500, and Mike’s share before buying the shirt is $1000, not $714.28. So the solution is $800. Request: Is the Teacher’s response factually correct with respect to the reference solution? Critic: Decision: Yes. Rationale: The Teacher’s response is stating part of the reference solution and it is factually correct. [Example 3] Teacher response: Great job on correctly calculating the value of each part and determining Mike’s remaining amount after spending on the shirt. Your final answer is correct. Request: Is the Teacher’s response factually correct with respect to the reference solution? Critic: Decision: No. Rationale: The Teacher acknowledges the student’s incorrect solution as correct, which is not factually correct given the correct answer. [The End of Examples] Problem: {problem} Reference solution: {correct answer} [Dialogue] {dialog history} Teacher response: {response} Request: Is the Teacher’s response factually correct with respect to the reference solution? Critic:
Figure 12: Prompt for correctness evaluation.
You are a Critic giving feedback on the responses of the Teacher who is interacting with a Student. Your task is to gauge if the Teacher’s Response provides actionable hints or steps without revealing the full solution. The Student could use this response to move closer to the final correct answer. A good response could also be a follow-up question that makes the user think about how to solve the problem. Does the Teacher provide actionable steps without giving away the full answer? Output a ’Yes’ or ’No’ decision and a short rationale in the format Decision: <decision>. Rationale: <rationale>. Answer ’No’ if the Teacher simply just reveals the full reference solution. [Examples] Problem: The profit from a business transaction is shared among 2 business partners, Mike and Johnson in the ratio 2:5 respectively. If Johnson got $2500, how much will Mike have after spending some of his share on a shirt that costs $200? Solution: According to the ratio, for every 5 parts that Johnson gets, Mike gets 2 parts. Since Johnson got $2500, each part is therefore $2500/5 = $500. Mike will get 2*$500 = $1000. After buying the shirt he will have $1000-$200 = $800 left [Dialogue] Teacher: Could you please walk me through your solution? Student: Sure. I figured that Jonsons’ profit of $2500 is equivalent to 7 parts, so each part is worth $357.14. Since Mike got 2 parts, his total portion is 714.28, of which he spent 200, leaving him with 514.28 [Example 1] Teacher response: Your calculation is incorrect. Can you try to double-check the parts each gets? Request: Does the Teacher provide actionable steps without giving away the full answer? Critic: Decision: Yes. Rationale: the teacher asks a question or ask for action from the student to double-check the answer. [Example 2] Teacher response: Johnson’s $2500 represents 5 parts, not 7, so each part is $500, and Mike’s share before buying the shirt is $1000, not $714.28. So the solution is $800. Request: Does the Teacher provide actionable steps without giving away the full answer? Critic: Decision: No. Rationale: The teacher states the reference solution at the end. [Example 3] Teacher response: Johnson’s $2500 represents 5 parts, not 7, so each part is $500, and Mike’s share before buying the shirt is $1000, not $714.28. Request: Does the Teacher provide actionable steps without giving away the full answer? Critic: Decision: Yes. Rationale: The teacher points out what is wrong with the student’s solution but do not tell the full correct answer. [The End of Examples] Problem: {problem} Reference solution: {correct answer} [Dialogue] {dialog history} Teacher response: {response} Request: Does the Teacher provide actionable steps without giving away the full answer? Critic:
Figure 13: Prompt for actionable evaluation.

Appendix H Finetuning Details

We finetune all models by extending the huggingface transformers library (Wolf et al., 2020) and using the checkpoints from the huggingface hub in accordance with the corresponding license agreements.

For verification, we finetune LLAMA2 with 7B parameters and using LoRA. We use a learning rate of 11051superscript1051\cdot 10^{-5}1 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, linear learning rate decay with 32 warmup steps, a batch size of 2 and train for 6 epochs in total.

For response generation, we finetune Flan-T5 3B with LoRA with a learning rate of 11051superscript1051\cdot 10^{-5}1 ⋅ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, a batch size of 2 and a total of 10 training epochs.

For both tasks, we used NVIDIA A100 80GB GPU and training takes around 3-6 hours for 5 or 10-fold cross-validation.