AutoMathCritique

Abstract

Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model’s capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor’s performance on difficult queries at test- time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor’s self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor’s exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique data and showcase its potential.

AutoMathCritique: An Automated and Scalable Framework to Collect Step-level Critique Data

The overview of AutoMathCritique framework. It has three main steps: flawed reasoning path construction, critique generation, data filtering.

TL'DR

To train critique models capable of delivering step-level supervision and constructive feedback for reasoning, we introduce AutoMathCritique—an automated and scalable framework for collecting critique data.
This framework consists of three main stages: flawed reasoning path construction, critique generation, and data filtering. Using AutoMathCritique, we create a dataset containing 76321 samples named MathCritique-76k.

Case

Figure 2: example of the response, critique, and refinement process in the two-player setting.

Critique Models Improves LLM Reasoning through Test-time Supervision

Critique Model	GSM8K			MATH
Critique Model	Acc	Discrimin.	Helpfulness	Acc	Discrimin.	Helpfulness
No Critic	54.81	-	-	17.22	-	-
GPT-3.5-Turbo	58.38	62.9%	13.3%	25.56	51.3%	14.3%
GPT-4-Turbo	77.86	91.6%	57.5%	36.00	87.6%	26.2%
GPT-4o	79.52	91.5%	59.7%	39.98	85.4%	30.9%
Critique Model-8b	63.31	79.4%	31.0%	24.26	75.7%	16.2%
Critique Model-70b	76.88	92.3%	55.3%	33.94	82.3%	23.9%

Test-time evaluation results of critique models on GSM8K and MATH. "Acc" represents accuracy; "Discrimin." refers to the accuracy of determining whether a reasoning path contains errors; "Helpfulness" indicates the ability of critique models to provide assistance for an incorrect reasoning path. The "No Critic" baseline represents the standalone performance of the actor reasoning model. Our 8B critique model outperforms GPT-3.5-Turbo, while the 70B critique model achieves performance close to the GPT-4 series models.

To construct the reasoning process in our two-player setting, First, we train critique models through super-vised fine-tuning with the collected MathCritique-76k.In this way, we can obtain a critique model that provides step-level supervision and constructive feedback on reasoning paths for actor models.We then train reasoning models in our two-player setting. We mixed reasoning data and refinement data in the training set to enable the model to possess reasoning abilities while also effectively refining based on critique feedback.

Here are our findings in our test-time improvement experiments:

Critique models are highly effective at identifying the correctness of reasoning, offering constructive feedback for erroneous responses, and improving the overall accuracy of the actor.

We compare our critique models with SOTA models used as critics, and the results are presented in Table 2. We observe that compared to current state-of-the-art (SOTA) models, our 8B critique model significantly outperforms GPT-3.5, while our Llama3-Critic-70B model achieves performance comparable to GPT-4 series models.

Critique models assist the actor in better handling challenging queries.

The results are illustrated in Figure . is evident that on both the training and test sets of GSM8K and MATH, critique models provide minimal benefit for simpler queries, as the actor model can independently perform well on these cases. However, for more challenging problems, critique models offer significant support, resulting in the overall improved performance.

Scaling up inference-time computation consistently improves reasoning performance.

We investigate whether incorporating critique models can further elevate the reasoning performance ceiling as test-time computation scales.As shown in Figure , without critique models, Maj@K performance improves with increased computation but quickly plateau.In contrast, when critique models are utilized during test-time, performance surpasses the baseline by a significant margin under the same computation budget—showing a 12.4% improvement on GSM8K and a 14.8% improvement on MATH.

Critique-in-the-loop Self-Improvement for Better Reasoning Models

Motivated by the test-time findings in Test-time improve that critique models significantly aid in solving challenging problems, and that they substantially raise the reaosning performance ceiling when scaling up computation, we integrate the critique-based supervision into the actor model’s iterative exploration and learning process. We introduce a critique-in-the-loop self-improvement method,which scales up exploration computation on challenging queries and leads to the development of stronger reasoning models.

The evaluating results of critique-in-the-loop self-improvement. "SI" in the figure means self-improvement. Compared to the vanilla self-improvement approach, our method achieves significant performance improvements, particularly at larger N values.

Here are our main findings :

Critique-in-the-loop self-improvement consistently improves reasoning performance.

The evaluating results of our method are shown in Figure 6. We can observe that: (1) Increasing the number of samples during exploration improves performance, with the performance upper bound rising accordingly, underscoring the benefits of enhanced exploration computation. (2) Our method consistently outperforms vanilla self-improvement with stable and significant performance gains, especially when the sample number N is larger.

The difference in the proportion of training data across different difficulty levels obtained from the exploration steps of the critique-in-the-loop self-improvement method compared to the data obtained from the exploration steps of the vanilla self-improvement method. On both datasets, our method increases the proportion of difficult problems in the training set while reducing the proportion of simpler problems.

The performance differences between our method and vanilla self-improvement on test sets of varying difficulty. While our method slightly outperforms the vanilla approach on simpler problems, it achieves significantly greater improvements on harder problems.

Critique-in-the-loop self-improvement balances the solution distribution across difficulty levels,and enhances performance on challenging queries in the test set.

As shown in Figure 7, we find that our approach samples a higher proportion of solutions for challenging queries during the exploration stage. This significantly balances the training data distribution for the learning stage, effectively mitigating the tail narrowing issue. In Figure 8, we also present the model’s performance on the test set across different difficulty levels, and we observe that our method performs significantly better than the vanilla approach on harder problems, further demonstrating the potential of our approach.

Combining test-time supervision with training-time supervisions yields more performance gains.

Training-time	Test-time	GSM8K			MATH
Training-time	Test-time	Acc	Pass@5	MV@5	Acc	Pass@5	MV@5
response only	response only	54.8	75.2	54.5	17.2	35.0	15.6
Supervised Fine-tuning	w/ critique model	63.3	87.6	75.4	24.3	47.4	30.7
Supervised Fine-tuning	response only	54.2	73.1	53.4	18.1	32.4	16.6
Self-Correction Fine-tuning	self-correction	60.1	81.5	67.2	24.2	41.7	26.1
Self-Correction Fine-tuning	response only	64.6	83.4	70.6	20.2	38.5	23.0
Vanilla Self-Improve	w/ critique model	70.2	90.8	78.2	27.0	48.8	31.4
Vanilla Self-Improve	response only	75.5	89.1	80.1	31.3	51.0	35.1
Critique-in-the-loop Self-Improve	w/ critique model	75.8	91.8	82.8	31.4	53.1	36.8

Evaluation results of combining different training-time and test-time methods. During training, "Self-Correction Fine-tuning" refers to training a model with both reasoning and correction capabilities. For test-time methods, "response only" represents the actor model generating a response without additional correction or critique; "w/ critique model" indicates using a critique model at test-time to provide feedback, enabling the actor to perform refinement; and "self-correction" refers to the model generating a response and then performing correction by itself. The best performance is in bold and underlined, while the second-best performance is underlined.From the results, we observe that both test-time and train-time critique supervision provide consistent improvements, and the combination of the two achieves the best performance.

Evaluation results shown in Table 3 reveal that: (1) Integrating critique models during test-time consistently enhances performance under identical training conditions, particularly when critique supervision is not used during training. For example, applying critique models at test-time increases the MV@5 performance of SFT on GSM8K and MATH by $10.9$ and $15.1$ points, respectively. (2) When critique models are used during training, the additional benefit of test-time critique supervision becomes marginal, suggesting successful “distillation” of critique models into the actor during training. (3) The self-correction baseline underperforms compared to utilizing separate critique models, aligning findings in prior work that models struggle to accurately evaluate and refine their outputs without external feedback. Moreover, training a single model to handle both reasoning and correction capabilities may introduce conflicts, leading to performance degradation. (4) Compared to the traditional strategy of vanilla self-improvement + response-only, which increases computation during training, the approach of supervised fine-tuning + test-time critique supervision reduces training computation while increasing test-time computation and achieves better performance, particularly on the more challenging MATH dataset. This aligns with prior work highlighting the benefits of enhancing test-time computation.

A Step Further: Training Step-level Self-Talk Reasoning Models via Critique Data

Motivation

In this work, we focus on the two-player paradigm, leveraging critique models to provide step-level supervision and feedback for actor models. Recently, OpenAI’s o1 model has pushed the boundaries of large reasoning models’ capabilities. With its self-talk output format, it can autonomously plan, reflect, critique, correct, backtrack, and more during the thinking process, marked by phrases such as "wait" and "alternatively".

Method

Therefore, we investigate whether it is possible to construct self-talk data with step-level critique supervision, and propose the preliminary self-talk-via-critique method. Specifically, it has three main steps:

Construct an initial thinking chain that has step-level reflection.
Iterative refine and critique the thinking chain.
Smooth the thinking chain into self-talk data.

Evaluation and findings.

Based on this approach, we construct a dataset of 4k self-talk examples from the MATH training set and fine-tune the model for evaluation. As in the previous section, we used the Llama3-8b base model as the backbone for our experiments. We compared our method with the self-correction baseline and the baseline of SFT with test-time critique supervision. These two baselines fall under the one-player and two-player settings, respectively. Note that for the two baselines, we only used the MATH dataset for training, without using GSM8K data. The experimental results are shown in Table 5. We observe that, in the one-player setting, the step-level self-talk approach outperforms trajectory-level self-correction by a significant margin, demonstrating its potential. However, it still lags behind the two-player setting, indicating that this direction requires further exploration, which we leave to future work.

Case

An example of the constructed step-leval self-talk data.

BibTeX

@misc{xi2024enhancingllmreasoningcritique,
      title={Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision}, 
      author={Zhiheng Xi and Dingwen Yang and Jixuan Huang and Jiafu Tang and Guanyu Li and Yiwen Ding and Wei He and Boyang Hong and Shihan Do and Wenyu Zhan and Xiao Wang and Rui Zheng and Tao Ji and Xiaowei Shi and Yitao Zhai and Rongxiang Weng and Jingang Wang and Xunliang Cai and Tao Gui and Zuxuan Wu and Qi Zhang and Xipeng Qiu and Xuanjing Huang and Yu-Gang Jiang},
      year={2024},
      eprint={2411.16579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.16579}, 
}