牛哥精选 · 三个月

1

🤖 AI·大模型 arXiv 机器学习 2026-07-15 NEW

SlimPer: Make Personalization Model Slim and Smart

SlimPer提出让个性化模型精简且智能的新方法，兼顾性能与效率，值得关注。

arXiv:2607.12281v1 Announce Type: cross Abstract: Transformer-style architectures are increasingly adopted for industrial recommendation systems, yet …

slimper 个性化模型模型压缩高效推理深度学习

2

🤖 AI·大模型 arXiv 机器学习 2026-07-10

KronQ: LLM Quantization via Kronecker-Factored Hessian

用Kronecker分解海森矩阵，高效提升LLM量化精度，突破后训练量化瓶颈。

arXiv:2607.07964v1 Announce Type: new Abstract: Post-training quantization (PTQ) is a widely adopted technique for compressing large language models (…

llm量化 kronecker分海森矩阵模型压缩后训练量化

3

📝 深度技术 arXiv 机器学习 2026-07-09

PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning

突破传统一刀切剪枝，PALS按每层百分位自适应稀疏度，精准提升大模型压缩效率。

arXiv:2607.07557v1 Announce Type: cross Abstract: One-shot pruning methods like Wanda and SparseGPT apply the same sparsity ratio to every layer of a …

llm剪枝层间稀疏百分位感知模型压缩 wanda

4

📝 深度技术 arXiv AI 2026-07-07

Nemotron-Labs-3-Puzzle-75B-A9B: Compressing Hybrid MoE LLMs

70多位作者联合发布Nemotron混合MoE大模型压缩方案，大幅降低推理成本且保持性能。

arXiv:2607.04371v1 Announce Type: new Abstract: We present Nemotron-Labs-3-Puzzle-75B-A9B, a compressed variant of Nemotron-3-Super optimized for inte…

混合专家模型模型压缩大语言模型推理优化 nemotron

5

📝 深度技术 arXiv NLP 2026-07-07

Fair-GPTQ: Bias-Aware Quantization for Large Language Models

新方法Fair-GPTQ实现大模型量化时兼顾效率与公平性，减少偏差

arXiv:2509.15206v3 Announce Type: replace Abstract: The high memory demands of generative language models have drawn attention to quantization, which …

fair-gptq 大语言模型量化偏差感知公平性

6

📝 深度技术 arXiv NLP 2026-07-07

Token-level Response-visual Attention Guidance for Multimodal LLMs Knowledge Distillation

针对多模态大语言模型压缩难题，提出Token级响应-视觉注意力引导，提升蒸馏效果

arXiv:2607.02593v1 Announce Type: cross Abstract: While knowledge distillation (KD) is widely adopted for training lightweight models by leveraging su…

知识蒸馏多模态大语言模型注意力引导模型压缩 token级

7

🤖 AI·大模型 arXiv 机器学习 2026-07-07

Effective Distillation to Hybrid xLSTM Architectures

将知识蒸馏引入混合xLSTM架构，探索高效模型压缩新方向

arXiv:2603.15590v2 Announce Type: replace Abstract: There have been numerous attempts to distill quadratic attention-based large language models (LLMs…

知识蒸馏 xlstm 混合架构模型压缩深度学习

8

🤖 AI·大模型 arXiv AI 2026-07-07

Trust Region Policy Distillation

将信任区域优化引入策略蒸馏，解决模型压缩中的策略漂移问题。

arXiv:2607.04751v1 Announce Type: cross Abstract: Big goals are hard to achieve all at once; breaking them into small steps is wiser. We present Trust…

信任区域策略蒸馏强化学习模型压缩深度技术

9

🤖 AI·大模型 arXiv 机器学习 2026-07-02

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

突破性KV缓存量化方案，实现sub-1-bit压缩，大幅降低推理内存开销却不损精度。

arXiv:2607.01065v1 Announce Type: new Abstract: The deployment of Large Language Models (LLMs) with extended context windows is increasingly constrain…

kv cache 量化 gsrq 模型压缩推理加速

10

🤖 AI·大模型 arXiv AI 2026-07-02

Post-Training Pruning for Diffusion Transformers

扩散变换器虽强但计算开销大，后训练剪枝技术如何高效优化？这篇论文给出新方案。

arXiv:2607.00927v1 Announce Type: cross Abstract: Diffusion Transformers (DiTs) have demonstrated impressive performance in image generation but suffe…

diffusion 后训练剪枝图像生成计算优化模型压缩

11

📝 深度技术 arXiv 机器学习 2026-07-02

TallyTrain: Communication-Efficient Federated Distillation

提出一种压缩模型大小与类别数双重带宽瓶颈的联邦蒸馏方法，大幅提升通信效率。

arXiv:2607.00173v1 Announce Type: new Abstract: Federated learning is bandwidth-bound on two orthogonal axes: model size, which limits how often param…

联邦学习知识蒸馏通信效率模型压缩带宽优化

12

🤖 AI·大模型 arXiv NLP 2026-06-30

Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

量化大模型时公平性与安全性易受损，本文通过保护关键权重巧妙解决，让轻量化模型也能坚守底线。

arXiv:2601.12033v2 Announce Type: replace Abstract: Quantization is widely adopted to reduce the computational cost of large language models (LLMs); h…

量化大语言模型公平性安全性关键权重

13

📝 深度技术 arXiv AI 2026-06-29

Measuring the Redundancy of Decoder Layers in SpeechLLMs

语音大模型解码器层存在多少冗余？这篇论文提出测量方法，为模型压缩提供新视角

arXiv:2603.05121v2 Announce Type: replace-cross Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that t…

speechllm 解码器冗余模型压缩 llm优化层冗余分析

14

📝 深度技术 arXiv AI 2026-06-26

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

将激活稀疏性与FP4量化巧妙结合，大幅提升LLM推理效率，硬核优化方案来袭！

arXiv:2606.26587v1 Announce Type: cross Abstract: Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern acc…

llm推理 fp4量化激活稀疏性模型压缩推理加速

15

🤖 AI·大模型 arXiv NLP 2026-06-26

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

大型视觉语言模型重校准后生成更小模型，提升效率和精度，新技术论文。

arXiv:2506.15681v4 Announce Type: replace Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) t…

视觉语言模型知识蒸馏模型压缩再校准生成

16

📝 深度技术 arXiv 机器学习 2026-06-24

AsyncOPD: How Stale Can On-Policy Distillation Be?

On-policy蒸馏中旧数据的容忍度如何？这篇论文挑战了传统认知，给出定量分析框架。

arXiv:2606.24143v1 Announce Type: new Abstract: On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is be…

知识蒸馏强化学习异步训练策略蒸馏陈旧性

17

📝 深度技术 arXiv AI 2026-06-23

Context-Aware Distillation and Ablation for Text2DSL

提出上下文感知的蒸馏与消融方法，精准提升Text2DSL生成质量与效率，是自然语言到领域特定语言转化的新突破。

arXiv:2606.22578v1 Announce Type: cross Abstract: We extend our prior work on Text2DSL automatic generation of domain-specific language (DSL) code fro…

上下文感知蒸馏 text2dsl 消融研究知识蒸馏领域特定语言

18

📝 深度技术 arXiv AI 2026-06-23

An Empirical Study of OpenPangu Quantization on Ascend NPUs

华为Ascend NPU上OpenPangu量化的实证研究，揭秘国产芯片AI模型部署关键优化。

arXiv:2606.21257v1 Announce Type: cross Abstract: OpenPangu models are attractive targets for private and domestic large-language-model deployment, ye…

openpangu 量化 ascend npu 实证研究大模型压缩

19

📝 深度技术 arXiv AI 2026-06-23

UniRank: Unified Rank Allocation for Low-Rank LLM Compression

提出统一秩分配方法，突破低秩分解压缩LLM的瓶颈，兼顾效率与性能。

arXiv:2606.21847v1 Announce Type: cross Abstract: Low-rank decomposition serves as a promising compression paradigm for large language models, however…

低秩分解 llm压缩秩分配统一框架模型压缩

20

🤖 AI·大模型 arXiv AI 2026-06-19

Reinforcement-aware Knowledge Distillation for LLM Reasoning

这篇新论文提出强化学习感知的知识蒸馏，让教师模型“教”学生时更关注推理过程，突破传统蒸馏只传答案的局限。

arXiv:2602.22495v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-t…

强化学习知识蒸馏 llm推理模型压缩推理增强

🐂 牛哥精选

SlimPer: Make Personalization Model Slim and Smart

KronQ: LLM Quantization via Kronecker-Factored Hessian

PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning

Nemotron-Labs-3-Puzzle-75B-A9B: Compressing Hybrid MoE LLMs

Fair-GPTQ: Bias-Aware Quantization for Large Language Models

Token-level Response-visual Attention Guidance for Multimodal LLMs Knowledge Distillation

Effective Distillation to Hybrid xLSTM Architectures

Trust Region Policy Distillation

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

Post-Training Pruning for Diffusion Transformers

TallyTrain: Communication-Efficient Federated Distillation

Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

Measuring the Redundancy of Decoder Layers in SpeechLLMs

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

AsyncOPD: How Stale Can On-Policy Distillation Be?

Context-Aware Distillation and Ablation for Text2DSL

An Empirical Study of OpenPangu Quantization on Ascend NPUs

UniRank: Unified Rank Allocation for Low-Rank LLM Compression

Reinforcement-aware Knowledge Distillation for LLM Reasoning

📅 日期