1
Wiki Dumps to Training Corpora: South Slavic Case
从维基百科原始数据构建高质量南斯拉夫语语料库,详解七种语言的文本提取与清洗流程。
arXiv:2604.25384v2 Announce Type: replace Abstract: This paper presents a pipeline designed to transform raw Wikimedia dumps into quality textual corp…