1
RAP: Runtime Adaptive Pruning for LLM Inference
提出运行时自适应剪枝方法,让LLM推理内存动态调整,效率大增
arXiv:2505.17138v5 Announce Type: replace Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous co…
提出运行时自适应剪枝方法,让LLM推理内存动态调整,效率大增
arXiv:2505.17138v5 Announce Type: replace Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous co…
提出流体引导的在线调度方法,在内存约束下优化LLM推理,显著降低延迟与运营成本
arXiv:2504.11320v3 Announce Type: replace-cross Abstract: Large language models now serve millions of users daily, with providers incurring costs exce…