1
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
提出流体引导的在线调度方法,在内存约束下优化LLM推理,显著降低延迟与运营成本
arXiv:2504.11320v3 Announce Type: replace-cross Abstract: Large language models now serve millions of users daily, with providers incurring costs exce…