ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
跨模型代理剪枝巧妙兼顾低延迟与高精度,解决长上下文LLM推理中KV缓存内存墙难题
arXiv:2605.16360v1 Announce Type: new Abstract: Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Va…