MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
新方法MTraining通过分布式动态稀疏注意力,大幅降低超长上下文训练的计算开销。
arXiv:2510.18830v2 Announce Type: replace-cross Abstract: The adoption of long context windows has become a standard feature in Large Language Models …