1
Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds
面向异构GPU云的可扩展联合资源分配方案,高效保障LLM推理的SLO约束
arXiv:2604.07472v2 Announce Type: replace Abstract: Serving large language model (LLM) inference in cloud environments requires jointly optimizing mod…