1
Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning
提出GPU内存气球技术,实现多LLM服务成本大幅降低,已在超万卡生产环境验证。
arXiv:2505.04021v3 Announce Type: replace-cross Abstract: Inference providers must maintain availability for many LLMs, including low-volume but essen…