Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
最大规模伦理数据集Common Corpus发布,为LLM预训练提供高质量合规数据
arXiv:2506.01732v3 Announce Type: replace Abstract: Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and d…