1
Runtime-Certified Bounded-Error Quantized Attention
针对长上下文LLM推理中KV缓存量化的近似误差,提出运行时认证的有界误差方法,确保量化后的注意力计算精度可控。
arXiv:2605.20868v1 Announce Type: new Abstract: KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximat…