1
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
统一离策略修正的自适应逐层扰动方法,为LLM强化学习提供更高效的训练策略。
arXiv:2603.19470v3 Announce Type: replace Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major …