牛哥精选 · 三个月

1

🤖 AI·大模型 ByteByteGo 2026-07-14

How LLMs Learn to Be Helpful (RLHF vs DPO)

一文对比RLHF与DPO两种主流大模型训练方法的核心差异与适用场景

In this article, we will look at how that learning actually happens, starting with why instruction-following alone falls short, then walking through t…

rlhf dpo 大模型训练人类反馈强化学习

2

📝 深度技术 arXiv 机器学习 2026-07-10

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

解决LLM强化学习中不合理token被错误放大的问题，提出尾部感知信用校准新方法。

arXiv:2607.07976v1 Announce Type: cross Abstract: Reinforcement learning (RL) has achieved remarkable success in enhancing the reasoning capabilities …

llm 强化学习 rlhf 信用校准尾部token

3

📝 深度技术 Dev.to 2026-07-04

DPO vs RLHF: The Alignment Tax You Pay Without Knowing

对比DPO与RLHF的对齐代价，揭示大模型隐藏的哲学回答偏差

Ask yourself one question. When you talk to ChatGPT or Claude, do you feel like you talk to something that thinks — or something that agrees with you …

dpo rlhf 对齐成本大模型哲学问询

4

📝 深度技术 arXiv NLP 2026-06-23

What are Key Factors for Updates in RL for LLM Reasoning?

一篇剖析强化学习更新大模型推理能力的关键因素研究，揭示影响性能的核心变量与训练策略。

arXiv:2606.22570v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhanci…

llm推理强化学习模型更新关键因素推理能力

5

📝 深度技术 arXiv AI 2026-06-11

Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment

语义基础+固定惩罚约束优化，让大模型对齐过程获得可认证的安全保障

arXiv:2510.03520v2 Announce Type: replace-cross Abstract: Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an…

safe rlhf llm对齐语义基础约束优化安全性

6

📝 深度技术 arXiv AI 2026-06-10

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

非均匀令牌级信任区域优化，突破传统限制提升大模型强化学习训练稳定性。

arXiv:2606.10968v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasonin…

llm 强化学习信任区域令牌级优化 rlhf

7

📝 深度技术 arXiv 机器学习 2026-06-09

A Unifying Lens on Reward Uncertainty in RLHF

从统一视角剖析RLHF中的奖励不确定性，为强化学习与人类反馈对齐提供新见解

arXiv:2606.09073v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the …

rlhf 奖励不确定性强化学习人类反馈对齐技术

8

📝 深度技术 arXiv NLP 2026-06-08

What Do People Actually Want From AI? Mapping Preference Plurality

顶会论文揭示RLHF聚合偏好的根本缺陷，系统绘制人类对AI的真实多元需求图谱

arXiv:2606.06674v1 Announce Type: new Abstract: Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (…

大语言模型，rlhf 多元性 facct2026

9

🤖 AI·大模型 arXiv 机器学习 2026-06-02

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

用主动学习策略精准筛选高价值偏好数据，大幅降低RLHF数据标注成本，大模型偏好对齐的新效率方案。

arXiv:2603.09692v2 Announce Type: replace Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Langu…

active lea 偏好数据生成 rlhf 效率提升 llm

10

📝 深度技术 arXiv AI 2026-06-02

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

提出一种无需数据整理的三角测量指标，精准隔离LLM在偏好学习阶段的词汇偏差。

arXiv:2606.00334v1 Announce Type: cross Abstract: Various language domains have undergone remarkable changes in recent years; these shifts are largely…

llm 词汇偏差偏好学习 rlhf 三角测量指标

11

📝 深度技术 arXiv AI 2026-05-26

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

DPO统一范式Uni-DPO，动态优化LLM偏好，解决数据质量差异问题。

arXiv:2506.10054v4 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning …

uni-dpo 偏好优化 llm rlhf 动态优化

12

📝 深度技术 arXiv 计算机视觉 2026-05-21

Leveraging Verifier-Based Reinforcement Learning in Image Editing

将RLHF引入图像编辑的新范式，提出基于验证器的强化学习解决奖励模型缺失瓶颈。

arXiv:2604.27505v2 Announce Type: replace Abstract: While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-…

图像编辑强化学习 rlhf 奖励模型验证器

13

🤖 AI·大模型 arXiv NLP 2026-05-20

Reinforcement Learning for LLM Post-Training: A Survey

一篇系统梳理LLM后训练中强化学习的综述，涵盖RLHF、DPO、RLVR等前沿方法

arXiv:2407.16216v4 Announce Type: replace Abstract: Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still pr…

强化学习 llm 后训练 rlhf dpo

14

🤖 AI·大模型 arXiv 机器学习 2026-05-20

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

研究发现多智能体系统在同伴分歧下“屈服”并非RLHF特有，基础模型同样存在该漏洞，挑战了传统对齐认知。

arXiv:2605.12991v2 Announce Type: replace Abstract: LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagr…

多智能体 llm对齐谄媚 rlhf 基础模型

15

📝 深度技术 arXiv 机器学习 2026-05-20

Beyond RLHF: A Unified Theoretical Framework of Alignment

一份超越RLHF的统一对齐理论框架，抽象形式化多种对齐算法并揭示内在联系，为AI安全提供新视角。

arXiv:2506.01523v2 Announce Type: replace Abstract: Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm f…

rlhf 对齐理论统一框架 ai安全算法形式化

16

📝 深度技术 arXiv 机器学习 2026-05-20

General Preference Reinforcement Learning

NeurIPS 2026投稿，提出一种通用的偏好强化学习方法，为RLHF等领域提供更坚实的理论基础。

arXiv:2605.18721v1 Announce Type: new Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Onl…

通用偏好强化学习强化学习偏好学习 rlhf neurips 20

17

🤖 AI·大模型 Hacker News AI 2026-05-20

The Information Theory Behind Why AI Writing Sucks

从信息论看AI写作为何千篇一律，揭开RLHF导致的“注释者共识方言”真相。

Article URL: https://www.pangram.com/blog/joe-stech-information-theory-why-ai-writing-sucks Comments URL: https://news.ycombinator.com/item?id=4819646…

信息论 ai写作 rlhf 注释者共识方言

18

📝 深度技术 arXiv 机器学习 2026-05-20

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

统一离策略修正的自适应逐层扰动方法，为LLM强化学习提供更高效的训练策略。

arXiv:2603.19470v3 Announce Type: replace Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major …

llm 强化学习 rlhf 离策略修正自适应扰动

🐂 牛哥精选

How LLMs Learn to Be Helpful (RLHF vs DPO)

When Implausible Tokens Get Reinforced: Tail-Aware Credit Calibration for LLM Reinforcement Learning

DPO vs RLHF: The Alignment Tax You Pay Without Knowing

What are Key Factors for Updates in RL for LLM Reasoning?

Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

A Unifying Lens on Reward Uncertainty in RLHF

What Do People Actually Want From AI? Mapping Preference Plurality

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Reinforcement Learning for LLM Post-Training: A Survey

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Beyond RLHF: A Unified Theoretical Framework of Alignment

General Preference Reinforcement Learning

The Information Theory Behind Why AI Writing Sucks

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

📅 日期