奖励科学过程:面向智能数据分析的过程级奖励建模

#Tech

奖励科学过程:面向智能数据分析的过程级奖励建模

该研究探讨了过程奖励模型(PRM)在动态数据分析任务中的应用。

研究发现,通用PRM难以有效监督数据分析代理,无法检测到“静默错误”并可能错误地惩罚探索性行为。

为此,研究团队提出了DataPRM,一种环境感知的生成过程奖励模型,能够主动验证环境状态并区分可纠正的错误和不可恢复的错误。

DataPRM通过生成多样化的轨迹并采用知识增强的步进级标注,构建了超过8000个高质量的训练样本,实验结果表明DataPRM显著提升了下游策略LLM的性能,且在不同测试条件下表现出良好的泛化能力,在强化学习场景下也取得了显著提升。

查看原文开头(英文 · 仅前 3 段)

View PDF

HTML (experimental)

Abstract:Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at this https URL.

※ 出于版权考虑,仅引用前 3 段。完整内容请阅读原文。

阅读原文 ↗