牛哥精选 · 三个月

1

🔓 开源项目 IT 之家 2026-07-15 NEW

小米开源 Xiaomi-Robotics-U0：具身领域首个“通吃”四类任务的统一生成模型

小米开源具身智能统一生成模型，可通吃四类任务，推理速度提升83倍，开源爱好者不容错过。

IT之家 7 月 15 日消息，小米今日发布 Xiaomi-Robotics-U0—— 一个拥有 380 亿参数的多模态自回归具身生成基础模型，是具身领域首个“通吃”四类任务的统一生成模型，打通了机器人图片和视频数据的生成与编辑链路。具身场景生成（Scene Generation）—— 模型…

小米开源具身领域首个通吃四类任务的统一生成模型

2

🤖 AI·大模型 arXiv 机器学习 2026-07-10

A Practical Investigation of Training-free Relaxed Speculative Decoding

无需额外训练即可加速LLM推理的松弛推测解码新策略，实用性能评估与关键发现

arXiv:2607.08690v1 Announce Type: new Abstract: Speculative decoding accelerates sampling from an autoregressive LLM by using a faster auxiliary model…

推测解码无训练 llm推理加速松弛策略性能评估

3

🤖 AI·大模型 IT 之家 2026-07-10

谷歌发布 LiteRT.js 库，Web AI 推理最高提速 3 倍

谷歌新出LiteRT.js，Web AI推理性能飙升3倍，用WebAssembly和硬件加速打破浏览器瓶颈。

IT之家 7 月 10 日消息，谷歌昨日（7 月 9 日）发布博文，为提升浏览器人工智能（AI）及机器学习（ML）工作负载的运行速度，推出全新的 LiteRT.js 库。谷歌表示 LiteRT.js 采用 WebAssembly，并结合 WebGPU 与 WebNN 等硬件加速能力，用于替代 T…

谷歌发布推理最高提速谷歌 litert.js web ai

4

🤖 AI·大模型 TechCrunch 2026-07-08

Hot French startup ZML releases free product to speed inference across lots of AI chips

法国初创ZML推出免费产品，让AI推理在多芯片上飞驰，20人团队挑战行业巨头。

ZML, a hot French AI startup endorsed by Turing Award winner Yann LeCun, has now released ZML/LLMD, software that could make running AI less costly.

zml ai推理加速多芯片免费产品法国初创

5

🤖 AI·大模型 arXiv AI 2026-07-07

SPORK: Self-Speculative Forking to Accelerate Agentic LLM Inference

自推测分支技术让LLM agent在等待工具返回时预生成后续推理，大幅减少GPU空闲时间，提升推理效率。

arXiv:2607.03333v1 Announce Type: cross Abstract: LLM agents are becoming a common interface for research, coding, and question answering, yet their T…

llm agents 推理加速推测执行 gpu利用率工具调用

6

🤖 AI·大模型 arXiv NLP 2026-07-07

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

一种通过推测性感知与规划加速多模态大模型推理的新方法，显著降低顺序调用开销

arXiv:2603.23483v2 Announce Type: replace-cross Abstract: Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision)…

多模态大模型代理式ai 推测性推理推理加速视觉工具调用

7

🤖 AI·大模型 arXiv 机器学习 2026-07-02

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

突破性KV缓存量化方案，实现sub-1-bit压缩，大幅降低推理内存开销却不损精度。

arXiv:2607.01065v1 Announce Type: new Abstract: The deployment of Large Language Models (LLMs) with extended context windows is increasingly constrain…

kv cache 量化 gsrq 模型压缩推理加速

8

🤖 AI·大模型 arXiv AI 2026-07-01

HippoSpark: An On-Demand Experience System for LLM Reasoning

LLM推理新突破：HippoSpark按需经验系统，动态提升复杂推理准确率与效率。

arXiv:2606.29929v1 Announce Type: new Abstract: Distilling historical trajectories into reusable experience to enhance future problem-solving has beco…

llm推理按需系统大模型优化推理加速分布式经验

9

🤖 AI·大模型 Hacker News LLM 2026-06-30

EdgeSync-LLM – KV cache fragment engine for on-device LLM inference (Go/Android)

面向设备端LLM推理的KV缓存碎片引擎，跳过最昂贵推理环节，专为ARM64 Android优化。

Article URL: https://github.com/bossandboss/EdgeSync-LLM Comments URL: https://news.ycombinator.com/item?id=48732973 Points: 2 # Comments: 0

edgesync-l kv缓存设备端推理 llm优化 android

10

📝 深度技术 arXiv 机器学习 2026-06-29

EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction

熵引导多token预测方法，加速LLM推理并提升生成质量。

arXiv:2606.27550v1 Announce Type: cross Abstract: Multi-token prediction has been shown to increase data density during training, improve downstream t…

llm 推理加速多token预测熵引导生成质量

11

🤖 AI·大模型 Hacker News LLM 2026-06-27

DSpark: Speculative decoding accelerates LLM inference [pdf]

深度揭秘DeepSeek最新研究：投机解码如何大幅加速LLM推理，论文全文在此。

Article URL: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf Comments URL: https://news.ycombinator.com/item?id=48696585 Points: 52…

dspark deepseek 投机解码 llm推理加速生成效率

12

📝 深度技术 arXiv AI 2026-06-26

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

将激活稀疏性与FP4量化巧妙结合，大幅提升LLM推理效率，硬核优化方案来袭！

arXiv:2606.26587v1 Announce Type: cross Abstract: Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern acc…

llm推理 fp4量化激活稀疏性模型压缩推理加速

13

🤖 AI·大模型 IT 之家 2026-06-26

华为与湖北移动完成全国运营商首个 AI 推理加速方案现网测试，长序列 Token 吞吐率提升 372%

IT之家 6 月 26 日消息，6 月 24 日，在 2026 MWC 上海展期间，华为与中国移动通信集团湖北有限公司（IT之家注：以下简称“湖北移动”）联合宣布，双方已成功完成全国运营商首个 AI 推理加速解决方案现网测试。据介绍，该测试基于华为 OceanStor A800 存储与昇腾 A3 …

华为与湖北移动完成全国运营商首个推理加速方案现网测试

14

📝 深度技术 Dev.to 2026-06-25

Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster

深度拆解KV缓存与MQA、GQA、MLA原理，揭示大模型推理加速的关键技术。

LLMs generate text one token at a time. That sounds simple. But without KV Cache, every new token would repeat a lot of old work. That is why inferenc…

kv cache mqa gqa mla 推理加速

15

🚀 产品观察 IT 之家 2026-06-25

高通带来 Dragonfly 数据中心产品组合：HBC 架构、C1000 CPU、AI300 推理加速器

高通推出Dragonfly数据中心组合：HBC架构带宽6倍于HBM，C1000 CPU与AI300推理加速器专为AI优化

IT之家 6 月 25 日消息，Qualcomm（高通）今日在投资者日上宣布了其全面的 Dragonfly 数据中心解决方案，包括 HBC 架构、C1000 CPU、AI300 推理加速器，此外还有芯片设计服务与互连产品组合。 HBC（高带宽计算）架构 HBC 是一种分离式架构，其将完整芯片拆分为主…

高通带来数据中心产品组合架构推理加速器

16

🤖 AI·大模型 arXiv AI 2026-06-23

Executing as You Generate: Hiding Execution Latency in LLM Code Interpreters

提出一种在LLM生成代码时同步执行以隐藏延迟的方法，大幅提升代码解释器效率。

arXiv:2604.00491v2 Announce Type: replace-cross Abstract: Current LLM systems are increasingly equipped with a code interpreter that executes generate…

llm 代码解释器延迟隐藏边生成边执行性能优化

17

📝 深度技术 arXiv 机器学习 2026-06-15

Efficient On-Device Diffusion LLM Inference with Mobile NPU

利用移动NPU并行加速扩散大语言模型推理，为端侧AI落地提供新思路。

arXiv:2606.13740v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel…

扩散大语言模型移动npu 推理加速端侧ai 高效计算

18

📝 深度技术 arXiv 计算机视觉 2026-06-12

Budget-Constrained Step-Level Diffusion Caching

扩散模型推理还能更省？预算约束下的步骤级缓存策略，为生成式AI降本增效提供新思路。

arXiv:2606.13496v1 Announce Type: new Abstract: Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising ste…

扩散模型缓存优化预算约束推理加速 icml 2026

19

🤖 AI·大模型 Hacker News LLM 2026-06-11

Making Local LLM Fast

本地LLM提速秘笈：从Prefill到Decode，两分钟看懂推理加速核心环节

Article URL: https://bogdan.nimblex.net/programming/2026/06/10/making-local-llm-fast.html Comments URL: https://news.ycombinator.com/item?id=48489344 …

本地llm 推理加速 prefill decode 效率技巧

20

🤖 AI·大模型 arXiv NLP 2026-06-10

UniSVQ: 2-bit Unified Scalar-Vector Quantization

突破2-bit量化瓶颈，统一标量与向量量化方法，实现大模型低成本部署与推理加速。

arXiv:2606.10520v1 Announce Type: new Abstract: Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration f…

2-bit量化标量量化向量量化后训练量化大模型部署

🐂 牛哥精选