Training LLM agents to critique their own actions improves performance and generalization

TL;DR

Researchers introduced Agentic Critical Training (ACT), a reinforcement‑learning method that rewards large‑language‑model agents for correctly ranking two candidate actions, thereby teaching them to evaluate actions rather than merely imitate demonstrations. Across navigation, dialogue, and game benchmarks, ACT improved performance by about five points over imitation learning and outperformed standard RL and reflection‑distillation approaches, while also showing stronger generalization to unseen environments and better reasoning abilities. These results demonstrate that rewarding autonomous critique can reliably enhance agent decision‑making and reasoning, offering a promising path toward more trustworthy LLM agents.

ACT trains large‑language‑model agents to judge which of two actions is better, and this simple change yields measurable gains across multiple benchmarks.

Large‑language‑model agents are typically taught by imitation learning: they copy pre‑written demonstrations of what to do. That approach teaches correct behavior but offers no insight into why an action succeeds or fails. Liu, Liu, Ho, and colleagues at arXiv identify this shortfall and propose Agentic Critical Training (ACT), a reinforcement‑learning framework that rewards the model when it correctly ranks an action against a suboptimal alternative. In contrast to earlier self‑reflection methods that merely imitate reflection text, ACT forces the agent to develop its own reasoning about action quality.

In ACT, the agent is presented with a state and two candidate actions. It generates a judgment—“action A is better than action B” or vice versa—and receives a reward if that judgment matches the ground‑truth comparison. Because the reward is tied to the correctness of the judgment itself, the model learns to internalize a critical evaluation process rather than to copy pre‑written reflections. The researchers evaluated ACT on three demanding agent benchmarks: a navigation task, a dialogue management scenario, and a problem‑solving game. When combined with existing post‑training methods, ACT improved average performance by 5.07 points over pure imitation learning and by 4.62 points over standard reinforcement learning baselines. Compared with approaches that inject reflection via knowledge distillation, ACT delivered an additional 2.42‑point advantage.

Beyond these head‑to‑head comparisons, ACT also showed stronger out‑of‑distribution generalization. In tests where the agent faced environments or prompts not seen during training, ACT‑trained models maintained higher success rates than their imitation‑learning counterparts. Moreover, the method improved performance on general reasoning benchmarks—such as logical deduction and commonsense inference—even though the training data contained no reasoning‑specific examples. This suggests that the critical‑evaluation skill learned by ACT transfers beyond the narrow scope of agentic tasks.

The significance of these findings lies in the shift from imitation to autonomous critique. By rewarding correct judgments, ACT aligns the learning signal with the very quality the agent must eventually assess, encouraging the emergence of genuine self‑reflection. This addresses a persistent bottleneck in building reliable LLM agents: the lack of awareness about action efficacy. The reported point‑level gains, while modest, are consistent across multiple benchmarks and demonstrate that a principled reinforcement signal can outperform both imitation learning and traditional RL, as well as knowledge‑distillation tricks.

Future work must probe the limits of ACT’s generalization and explore how the critical‑evaluation skill scales with larger models and more complex action spaces. Understanding whether this approach can be integrated with other forms of meta‑learning or curriculum design remains an open question.