📚 Weekly Papers

|Archive
2025-10-05
2025-09-29 ~ 2025-10-05
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Authors: Adrian Kosowski, Adrian Kosowski.
Affiliation: Pathway, Palo Alto, USA.
提出受生物神经启发的BDH架构:以局部图动力学与可解释的ReLU-lowrank模块实现注意力与记忆,兼具可解释性与Transformer级性能。理论上连接大模型与脑模型,实证在10M–1B参数规模与翻译等任务上逼近GPT-2同规模表现。
Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
Authors: Yuandong Tian, Yuandong Tian.
Affiliation: Meta Superintelligence Labs (FAIR).
基于梯度动力学提出Li²框架,将grokking训练划分为惰性学习、独立特征学习与交互阶段;证明隐藏层特征是能量函数E的局部极大,给出样本规模与权重衰减等超参如何决定特征涌现与记忆/泛化边界的规模律,并解释部分优化器效果。
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
Authors: Hanlin Zhu, Hanlin Zhu.
Affiliation: UC Berkeley.
理论解析连续CoT训练动力学:提出思维生成与预测两阶段,证明指数匹配logit保持有界,从而在局部搜索与多路径探索间取得平衡并产生“叠加”推理;在图可达性任务与实证跟踪中验证该机制与优势。
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Authors: Xinyu Tian, Zhaoyuan Yang.
Affiliation: Australian National University.
提出多模态推理的“双刃剑”现象:更长思维链常削弱图像感知,导致识别错误。作者定义“视觉遗忘”,并提出VAPO,以视觉锚点与感知奖励在推理过程中强制关注图像,在多项VLM基准上提高准确率并减缓推理塌缩。
RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
Authors: Yuxiao Qu, Ankait Singh.
Affiliation: Carnegie Mellon University.
提出RLAD:先生成“抽象”以总结中间步骤,再用强化学习在抽象指导下求解;该解耦减少冗长且退化的链式探索。实验表明在相同测试算力下,优先生成高质量抽象比生成更多样本更有效,并提升对更难问题的泛化。
Variational Reasoning for Language Models
Authors: Xiangxin Zhou, Tianyu Pang.
Affiliation: Sea AI Lab, UCAS, CASIA.
将“思维轨迹”视为潜变量,构建变分推理框架:ELBO到多轨迹IWAE上界,并用前向KL稳定训练;同时解释RFT与二元奖励RL(如GRPO)为隐式加权的前向KL。该方法在Qwen系列多项推理任务上稳定提升,统一概率视角与RL实践。
Quantile Advantage Estimation for Entropy-Safe Reasoning
Authors: Junkang Wu, Xiangnan He.
Affiliation: University of Science and Technology of China.
指出RLVR训练既有熵塌缩也有熵爆炸风险。提出分组K分位数基线的QAE,替代均值基线以自适应控制探索—利用,实现“两侧熵安全”理论保证。实验证明在Qwen3-8B/14B等上稳定提升pass@1,并减少无效更新。
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Authors: Fang Wu, Yejin Choi.
Affiliation: Stanford University.
将MCTS直接融入RLVR训练,提出全局前沿选择、基于熵的监督与自适应重放缓存,系统扩展搜索覆盖与精细回溯。在数学推理上以1.5B模型达新SOTA,较延长训练节省5.7×GPU时,表明“训练期探索”优于蛮力扩步。
The Era of Real-World Human Interaction: RL from User Conversations
Authors: Chuanyang Jin, Jason Weston.
Affiliation: FAIR at Meta, Johns Hopkins University.
提出从真实用户对话中进行强化学习(RLHI)以替代静态标注后训练,包含“用户引导重写”和“基于用户长期画像的奖励”两法。在WildChat等数据上显著提升个性化、指令跟随与推理基准,表明有机交互可成为可扩展监督。
In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners
Authors: Jaehoon Kim, Dongha Lee.
Affiliation: Yonsei University.
直接蒸馏大模型推理链到小模型会因分布失配而降性能。提出反向投机解码(RSD):教师提议、学生按自身概率接受,过滤低概率token,生成“学生友好”推理轨迹;在Qwen小模型上对AIME等基准显著增益。
PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning
Authors: Xueliang Zhao, Wei Wu.
Affiliation: The University of Hong Kong, Ant Group.
将推理链纳入提示合成并用EM联合优化,支持自博弈与SFT两种训练。相较PromptCoT 1.0与开源语料,PromptCoT 2.0在数学与编程基准大幅提升;仅用合成题也能把7B模型准确率从个位数推至60%+。
Training Agents Inside of Scalable World Models
Authors: Danijar Hafner, Timothy Lillicrap.
Affiliation: Google DeepMind.
提出Dreamer 4:在可扩展世界模型内进行“想象训练”,结合捷径forcing与高效变换器,单GPU可实时交互。仅用离线数据即在Minecraft达成挖钻石,显著超越先前世界模型,展示从海量非标注视频学得通用知识的潜力。
Aristotle: IMO-level Automated Theorem Proving
Authors: The Harmonic Team, The Harmonic Team.
Affiliation: Harmonic.
介绍Aristotle:结合Lean形式证明、非形式推理与几何求解器,实现IMO级自动定理证明;在2025年IMO题上达到金牌等效(六题中五题给出正式解)。通过生成与形式化引理并用Lean验证,展现可扩展的数学推理能力。
Why Do We Need Warm-up? A Theoretical Perspective
Authors: Foivos Alimisis, Aurelien Lucchi.
Affiliation: University of Basel, Switzerland.
提出以损失次优度线性上界曲率的(H0,H1)-平滑性,证明在该条件下梯度下降配合warm-up步长可获得更快收敛的上下界;并在语言与视觉模型上验证,给出warm-up有效性与适用范围的理论解释。
Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft
Authors: Junchao Huang, Li Jiang.
Affiliation: The Chinese University of Hong Kong, Shenzhen, Shenzhen Loop Area Institute.
为Minecraft交互生成提出Memory Forcing:以混合训练与链式前向训练在探索/重访间自适应取舍2,并引入几何索引空间内存、点到帧检索与增量3D重建高效召回历史;在保持计算可控下显著提升长期空间一致性与生成质量。
Tri Dao采访:未来2-3年AI发展预测