牛哥精选 · 本月

1

🤖 AI·大模型 arXiv 机器学习 2026-05-25

Strong Teacher Not Needed? On Distillation in LLM Pretraining

颠覆认知？弱教师模型也能有效蒸馏LLM，预训练阶段教师强度并非关键。

arXiv:2605.23857v1 Announce Type: new Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield b…

大语言模型知识蒸馏预训练模型压缩弱到弱蒸馏

2

📝 深度技术 arXiv AI 2026-05-25

Parallel Context Compaction for Long-Horizon LLM Agent Serving

针对长时LLM Agent的上下文溢出问题，提出并行压缩方法，减少数十秒推理阻塞。

arXiv:2605.23296v1 Announce Type: new Abstract: Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's c…

llm代理上下文压缩推理优化并行计算摘要技术

3

🔓 开源项目 Hacker News LLM 2026-05-24

Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%)

GitHub开源项目，让LLM应用拥有长期记忆，同时将输入token平均削减68%，大幅降低API成本。

Article URL: https://github.com/Tem-Degu/streetai-memory Comments URL: https://news.ycombinator.com/item?id=48249509 Points: 1 # Comments: 0

llm token优化内存管理开源成本节约

4

📝 深度技术 arXiv AI 2026-05-23

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

将Agent工作流编译进LLM权重，以极低成本实现接近前沿的质量，提出了一种颠覆性的模型优化路径。

arXiv:2605.22502v1 Announce Type: new Abstract: Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across L…

agentic wo llm权重编译成本优化模型压缩 agent编排

5

📝 深度技术 arXiv AI 2026-05-23

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

用可组合的元标记压缩KV缓存，高效保留上下文信息，大模型推理再提速。

arXiv:2605.22337v1 Announce Type: new Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory b…

kv cache压缩 meta-token 上下文保留大模型推理优化可组合元标记

6

📝 深度技术 arXiv NLP 2026-05-22

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

无需辅助组件的投影引导跨分词器知识蒸馏，有效解决词汇不兼容问题。

arXiv:2605.21699v1 Announce Type: cross Abstract: Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatib…

知识蒸馏跨分词器投影引导模型压缩学生模型

7

📝 深度技术 arXiv 机器学习 2026-05-21

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

重新审视大模型剪枝后微调的必要性，挑战复杂剪枝标准，提出更高效的压缩策略。

arXiv:2510.14444v3 Announce Type: replace Abstract: Post-training pruning can substantially reduce LLM inference costs, but it often degrades quality …

llm压缩剪枝重训练模型优化推理成本

8

🤖 AI·大模型 arXiv 计算机视觉 2026-05-20

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

提出高效视觉编码器，解决Video LLM长视频中视觉token爆炸难题，突破帧扩展瓶颈。

arXiv:2605.17260v1 Announce Type: new Abstract: The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies …

视频大模型视觉编码器长视频理解 token压缩帧缩放

9

📝 深度技术 arXiv 机器学习 2026-05-20

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

提出VeriCache方法，将有损KV Cache转化为无损LLM推理，提升模型效率与精度。

arXiv:2605.17613v1 Announce Type: cross Abstract: The large size of the KV cache has become a major bottleneck for serving LLMs with increasing contex…

kv cache llm推理无损压缩技术论文

10

📝 深度技术 arXiv 机器学习 2026-05-20

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

提出LEAP可学习端到端自适应剪枝方法，在保持大语言模型性能的同时实现高效压缩

arXiv:2605.17289v1 Announce Type: new Abstract: Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shiftin…

leap 大语言模型自适应剪枝端到端可学习

11

🤖 AI·大模型 arXiv 机器学习 2026-05-20

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

混合全微调与低秩适应的新方法，专为后训练场景优化，效率与性能兼得

arXiv:2605.18822v1 Announce Type: new Abstract: Post-training has become essential for adapting large language models (LLMs) to complex downstream beh…

hybrid-lor 低秩适应全微调后训练模型压缩

12

📝 深度技术 arXiv NLP 2026-05-20

K-Quantization and its Impact on Output Performance

探索K-Quantization对模型输出性能的影响，量化新技术深度解析

arXiv:2605.19645v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP…

量化模型压缩输出性能 k-quantiza 深度学习

13

📝 深度技术 arXiv 机器学习 2026-05-20

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

量化技术让机器学习模型在低资源医疗影像场景下也能高效运行，大幅降低算力门槛，加速基层医疗智能化。

arXiv:2605.19207v1 Announce Type: cross Abstract: Deep learning models have shown strong performance in medical image analysis, but deploying them in …

量化模型医学影像低资源医疗模型压缩边缘部署

14

📝 深度技术 arXiv 机器学习 2026-05-20

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

深入评估编码智能体在数据驱动科学发现中的记忆压缩策略，为AI辅助科研提供新思路。

arXiv:2605.18854v1 Announce Type: new Abstract: Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force …

记忆压缩编码智能体科学发现 ai agent 数据驱动

15

📝 深度技术 arXiv 机器学习 2026-05-20

Theory-optimal Quantization Based on Flatness

基于平坦度的理论最优量化方法，为深度学习模型压缩提供新思路

arXiv:2605.18800v1 Announce Type: new Abstract: Post-training quantization has emerged as a widely adopted technique for compressing and accelerating …

模型量化平坦度理论最优压缩深度学习 arxiv论文

16

📝 深度技术 arXiv 计算机视觉 2026-05-20

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

将语义解耦与LLM推理、扩散生成融合，实现通用图像编码新范式。

arXiv:2412.18158v2 Announce Type: replace Abstract: Learned image compression methods have shown impressive performance but are often highly specializ…

通用图像编码语义解耦 llm推理扩散模型图像压缩

17

📝 深度技术 arXiv 计算机视觉 2026-05-20

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

提出频域残差压缩方法，大幅减少视频MLLM的token数量，高效且不损失性能。

arXiv:2605.16366v1 Announce Type: new Abstract: Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-…

视频压缩多模态大模型频域残差 token压缩效率优化

18

📝 深度技术 arXiv NLP 2026-05-20

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

利用大模型隐藏表示实现每任务量化，在保持性能的同时大幅提升效率，值得关注的技术突破。

arXiv:2511.06516v3 Announce Type: replace Abstract: Many LLM applications require only narrow capabilities, yet standard post-training quantization (P…

llm 量化隐藏表示模型压缩每任务量化

19

📝 深度技术 arXiv NLP 2026-05-20

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

提出专家引导的后合并量化方法，利用合并权重锚定，在低资源部署中平衡模型压缩与性能。

arXiv:2605.16882v1 Announce Type: new Abstract: Low-resource deployment constraints have made model quantization essential for deploying neural networ…

模型量化神经网络压缩后合并量化加权锚定低资源部署

20

📝 深度技术 arXiv 机器学习 2026-05-20

RAP: Runtime Adaptive Pruning for LLM Inference

提出运行时自适应剪枝方法，让LLM推理内存动态调整，效率大增

arXiv:2505.17138v5 Announce Type: replace Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous co…

llm推理自适应剪枝运行时优化内存约束模型压缩

🐂 牛哥精选

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Parallel Context Compaction for Long-Horizon LLM Agent Serving

Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%)

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

K-Quantization and its Impact on Output Performance

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

Theory-optimal Quantization Based on Flatness

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

RAP: Runtime Adaptive Pruning for LLM Inference

📅 日期