📚 Weekly Papers

|Archive
2025-09-14
2025-09-08 ~ 2025-09-14
A Survey of Reinforcement Learning for Large Reasoning Models
Authors: Kaiyan Zhang; Biqing Qi, Ning Ding, Bowen Zhou.
Affiliation: Tsinghua University.
**TLDR:**系统综述面向LRM的强化学习:梳理奖励设计(可验证/生成/稠密/无监督)、策略优化(PG/critic/离线/正则)、采样策略与训练资源,回顾自DeepSeek-R1以来的应用进展,并提出持续/记忆/模型驱动等未来方向与可扩展性挑战。
The Majority is not always right: RL training for solution aggregation
Authors: Wenting Zhao;
Affiliation: FAIR at Meta.
**TLDR:**提出将“解答汇聚”本身作为可学习的推理技能:先生成多解,再用AggLM在可验证奖励下经RL整合纠错与合并步骤;在数学基准上优于多数投票和打分选择,且更节省tokens,表明多数并非总是最佳。
Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate
Authors: Andrea Wynn; Andrea Wynn, Harsh Satija.
Affiliation: Johns Hopkins University.
**TLDR:**系统研究多智能体辩论的失效情形:异质能力或较弱代理的加入常使准确率随回合增加而下降;弱者会扰乱强者,过长讨论也会退化。论文刻画影响因素并提示需谨慎设计辩论与激励以抑制随从式错误。
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
Authors: Zhiheng Xi; Zhiheng Xi, Tao Gui, Qi Zhang.
Affiliation: Fudan University.
**TLDR:**提出统一的交互式RL框架AgentGym-RL与逐步扩展交互回合的ScalingInter-RL,用于从零训练多轮决策型LLM代理;覆盖多场景并支持主流算法,在27项任务上与商用模型相当或更优,计划开源代码与数据。
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Authors: Hao Wen; Yuanchun Li.
Affiliation: Institute for AI Industry Research (AIR), Tsinghua University.
**TLDR:**提出原生并行思考(ParaThinker):让模型同步生成多条推理路径并聚合为更优答案,绕开顺序推理的“隧道视野”瓶颈;在推理基准上显著提升准确率且延迟开销小,揭示“宽度”优于单纯“深度”的测试时扩展范式。
ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
Authors: Jianghao Chen, Jiajun Zhang
Affiliation: Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Zhongguancun Academy
提出ACE-RL:将指令分解为可验证的细粒度约束,按满足度构造奖励并用RL优化长文生成,无需偏好对比数据。WritingBench上较SFT/RL分别提升20.7%/7.3%,并在多场景写作超过GPT-4o,显著提升长文质量与一致性。
REFRAG: Rethinking RAG based Decoding
Authors: Xiaoqiang Lin, Aritra Ghosh
Affiliation: Meta Superintelligence Labs, National University of Singapore
面向RAG推理提出REFRAG解码:以预计算压缩块嵌入替代原始检索token,利用稀疏注意力结构“压缩-感知-扩展”以降KV与TTFT开销;在多任务中保持精度的同时将首token时延提速30.85×(较前作3.75×),并将上下文扩展至16×。
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
Authors: Haozhe Wang, Wenhu Chen
Affiliation: Hong Kong University of Science and Technology, University of Waterloo
实证揭示RL微调LLM的“两阶段”层级推理:先巩固低层执行,再转向高层策略规划;据此提出HICRA,将优化聚焦于高影响规划token,并以语义熵作为探索指针。方法在多基准超越GRPO等强基线,解释“aha”与“长度扩展”等现象。
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
Authors: Yihao Wang, Donglin Wang
Affiliation: Beijing University of Posts and Telecommunications, Westlake University, OpenHelix Team
针对VLA高成本,提出VLA-Adapter:系统分析感知-动作桥接条件,并以Bridge Attention在策略侧自适应注入最佳条件;仅0.5B骨干、无需机器人预训练即可达SOTA且推理更快;单张消费级GPU 8小时可训练强力VLA,显著降低门槛。
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Authors: Jeffrey Amico, Gabriel Passamani Andrade
Affiliation: Gensyn
提出去中心化异步后训练算法SAPO:各节点保有自策略,跨节点共享并采样回合以实现“经验共享”,无需同步与同构假设;在受控实验中累计奖励提升最高94%,并展示真实大规模测试网的可行性与扩展性。
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Affiliation: Tencent AI Lab Seattle, University of Maryland, College Park
**Authours:** Tong Zheng, (not listed) **TLDR:** 提出用于激发并内化“并行思维”的RL框架Parallel-R1:先在易题用SFT灌输格式,再在难题用RL探索,并通过交替奖励在准确率与结构并行间折中;在MATH/AMC/AIME上显著提升,并观察训练策略由“探索”转向“验证”。
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Affiliation: Tsinghua University
**Authours:** Haozhan Li, Ning Ding **TLDR:** 面向机器人VLA,基于veRL构建高效在线RL:并行多环境渲染、VLA特定轨迹采样与优化,使OpenVLA-OFT在LIBERO与RoboTwin上超越SFT,并在真实任务中更稳健;同时报告训练中涌现的“pushcut”新动作模式。
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Affiliation: ByteDance, ByteDance Seed
**Authours:** Jiawei Wang, Yang Wang **TLDR:** 提出熵调制策略梯度(EMPG):按每步不确定性缩放梯度——放大全对的自信动作、强惩自信错误、减弱高熵探索噪声,并引入“未来清晰度”奖励引导可预测轨迹;在WebShop、ALFWorld、Deep Search显著优于强基线。
So let's replace this phrase with insult... Lessons learned from generation of toxic texts with LLMs
Affiliation: AIRI, Skoltech
**Authours:** Sergey Pletenev, Sergey Pletenev **TLDR:** 探索用LLM从中性文本合成“有毒”数据以训练去毒化模型。结果显示纯合成数据导致明显降级,根因是词汇多样性匮乏:模型复用少数脏词,难以覆盖人类毒性语言的细粒度变化;当前仍需依赖人工标注数据。
Reverse-Engineered Reasoning for Open-Ended Generation
Affiliation: ByteDance Seed, Hong Kong University of Science and Technology
**Authours:** Haozhe Wang, Ge Zhang **TLDR:** 提出REER:从高质量开放式答案出发,反向搜索能解释它的分步思考轨迹,以梯度无关方式规模化合成“深思”数据。构建DeepWriting-20K并训练DeepWriter-8B,在多项写作基准上接近或部分超越商用模型,证明无需RL/蒸馏亦可习得深推理。
姚顺雨访谈