1
Dynamic Latent Routing
提出General Dijkstra Search与动态潜路由后训练方法,实现MDP子策略最优时序组合,为语言模型强化学习提供新范式。
arXiv:2605.14323v1 Announce Type: cross Abstract: We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with ti…