PaddlePaddle · FeixLiu · Sep 8, 2025 · Sep 8, 2025
diff --git a/examples/pre-training/tools/preprocess/README.md b/examples/pre-training/tools/preprocess/README.md
diff --git a/examples/pre-training/tools/preprocess/create_pretraining_data.py b/examples/pre-training/tools/preprocess/create_pretraining_data.py
diff --git a/examples/pre-training/tools/preprocess/docs/CLUECorpus2020.md b/examples/pre-training/tools/preprocess/docs/CLUECorpus2020.md
@@ -0,0 +1,12 @@
+## CLUECorpus2020 语料
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| CLUECorpus2020| 中文 | 200GB |
+
+CLUECorpus2020 过对Common Crawl的中文部分进行语料清洗得到。开源部分提供了约200G左右的语料文本，详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD)，用户可以通过邮件申请下载，方式如下：
+
+> 数据下载
+> 申请方式： 将使用语料研究目的和用途，计划、研究机构和申请者介绍，发送到邮箱，并承诺不向第三方提供。
+>
+> 邮箱: [email protected]，标题是：CLUECorpus2020 200G语料库
diff --git a/examples/pre-training/tools/preprocess/docs/CLUECorpusSmall.md b/examples/pre-training/tools/preprocess/docs/CLUECorpusSmall.md
@@ -0,0 +1,76 @@
+# CLUECorpusSmall
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| CLUECorpusSmall| 中文 | 14GB |
+
+**数据集简介**：可用于语言建模、预训练或生成型任务等，数据量超过14G，近4000个定义良好的 txt 文件、50亿个字。主要部分来自于 nlp_chinese_corpus 项目
+包含如下子语料库（总共14G 语料）：新闻语料[news2016zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/6bac09db4e6d4857b6d680d34447457490cb2dbdd8b8462ea1780a407f38e12b?responseContentDisposition=attachment%3B%20filename%3Dnews2016zh_corpus.zip)， 社区互动语料[webText2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/83da03f7b4974871a52348b41c16c7e3b34a26d5ca644f558df8435be4de51c3?responseContentDisposition=attachment%3B%20filename%3DwebText2019zh_corpus.zip)，维基百科语料[wiki2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/d7a166408d8b4ffdaf4de9cfca09f6ee1e2340260f26440a92f78134d068b28f?responseContentDisposition=attachment%3B%20filename%3Dwiki2019zh_corpus.zip)，评论数据语料[comment2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/b66ddd445735408383c42322850ac4bb82faf9cc611447c2affb925443de7a6d?responseContentDisposition=attachment%3B%20filename%3Dcomment2019zh_corpus.zip)。
+
+## 数据获取
+
+用户可以通过官方 github 网页下载，https://github.com/CLUEbenchmark/CLUECorpus2020 。同时，为方便用户，我们也提供了 aistudio 数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598)，[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用 aistudio 版本的数据，下载好后，可以核对 md5值：
+```shell
+> md5sum ./*
+ 8a8be341ebce39cfe9524fb0b46b08c5  ./comment2019zh_corpus.zip
+ 4bdc2c941a7adb4a061caf273fea42b8  ./news2016zh_corpus.zip
+ fc582409f078b10d717caf233cc58ddd  ./webText2019zh_corpus.zip
+ 157dacde91dcbd2e52a60af49f710fa5  ./wiki2019zh_corpus.zip
+```
+解压文件
+```shell
+unzip comment2019zh_corpus.zip -d  clue_corpus_small_14g/comment2019zh_corpus
+unzip news2016zh_corpus.zip    -d  clue_corpus_small_14g/news2016zh_corpus
+unzip webText2019zh_corpus.zip -d  clue_corpus_small_14g/webText2019zh_corpus
+unzip wiki2019zh_corpus.zip    -d  clue_corpus_small_14g/wiki2019zh_corpus
+```
+将 txt 文件转换为 jsonl 格式
+```
+python trans_to_json.py  --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
+```
+现在我们得到了 jsonl 格式的数据集。
+
+## 中文预训练数据制作
+
+下面是针对训练任务的数据集应用。
+
+* llama 为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "idea-ccnl/ziya-llama-13b-v1" \
+    --input_path "clue_corpus_small_14g.jsonl" \
+    --output_prefix "clue_corpus_small_14g" \
+    --data_format "JSON" \
+    --json_key "text" \
+    --data_impl "mmap" \
+    --append_eos \
+    --log_interval 10000 \
+    --workers 48
+```
+
+* ernie 为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "ernie-3.0-base-zh" \
+    --input_path "clue_corpus_small_14g.jsonl" \
+    --output_prefix "clue_corpus_small_14g"  \
+    --data_format "JSON" \
+    --json_key "text" \
+    --split_sentences \
+    --data_impl "mmap" \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func "lac" \
+    --log_interval 10000 \
+    --workers 48
+```
+
+- model_name 可以更换为[其他模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm)。
+- workers 表示转化的线程数目
+
+数据共有文档`15702702`条左右，由于分词比较耗时，大概一小时左右可以完成。在当前目录下产出训练所需数据。
+```
+clue_corpus_small_14g.bin
+clue_corpus_small_14g.idx
+```
+用户可以使用此数据进行预训练任务。
diff --git a/examples/pre-training/tools/preprocess/docs/OpenWebText2.md b/examples/pre-training/tools/preprocess/docs/OpenWebText2.md
@@ -0,0 +1,42 @@
+# OpenWebText2
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| OpenWebText2 | 英文 | 70GB |
+
+## 数据获取
+
+[OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/)是一个开源的英文网页文本数据集，数据来源于 Reddit，经过去重、清洗、提取，最终包含800多万个文档。
+本示例采用 EleutherAI 清洗好的[OpenWebText2数据](https://openwebtext2.readthedocs.io/en/latest/index.html#download-plug-and-play-version)
+
+下载以后通过以下命令解压：
+
+```shell
+wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/openwebtext2.jsonl.zst.tar
+tar -xvf openwebtext2.json.zst.tar -C  /path/to/openwebtext
+```
+
+## Llama 训练数据制作
+
+然后使用`create_pretraining_data.py`脚本进行数据集制作：
+```
+python -u  create_pretraining_data.py \
+    --model_name meta-llama/Llama-2-7b \
+    --tokenizer_name LlamaTokenizer \
+    --data_format JSON \
+    --input_path /path/to/openwebtext/ \
+    --append_eos \
+    --output_prefix llama_openwebtext  \
+    --workers 40 \
+    --log_interval 10000 \
+    --data_impl "mmap"
+```
+处理时间约一个小时左右，就可以得到我们需要的`llama_openwebtext.bin`, `llama_openwebtext.idx`数据集文件。
+
+将所有预处理得到的文件统一放入一个文件夹中，以备训练使用：
+
+```
+mkdir data
+mv llama_openwebtext.bin ./data
+mv llama_openwebtext.idx ./data
+```
diff --git a/examples/pre-training/tools/preprocess/docs/WuDaoCorpusBase.md b/examples/pre-training/tools/preprocess/docs/WuDaoCorpusBase.md
@@ -0,0 +1,101 @@
+# WuDaoCorpus2.0 Base 语料
+
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| WuDaoCorpus2.0 Base| 中文 | 200GB |
+
+WuDaoCorpora 是悟道爬取的中文大规模语料。整体数量为3TB，目前开源的部分为 WuDaoCorpus2.0 bases 数据集，大小为200GB。
+
+## 数据获取
+
+**1. 下载解压**
+
+用户[此处下载](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab)，即可直接下载数据。下载好的压缩数据约 64GB。解压
+```
+unrar x WuDaoCorpus2.0_base_200G.rar
+```
+**2. 语料分词**
+
+由于 WuDao 数据集比较大，分词比较耗时，这里先进行了语料分词：
+```shell
+python words_segmentation.py \
+    --input_path ./WuDaoCorpus2.0_base_200G \
+    --workers 40  \
+    --data_format wudao \
+    --cn_seg_func seg \
+    --output_path ./wudao_lac_cut \
+```
+
+注：预训练需要实现 SOP( Sentence Order Predict) 任务，在分词的同时，我们使用 简单规则 进行了文本断句。如果语料只有一句话，建议去除 SOP loss，训练时设置 `binary_head=False`。
+
+**3. 转换为 jsonl 格式**
+
+文本转化完成后。我们使用 `../data_tools/trans_to_json.py`重新转换为 jsonl 格式（分词完毕）。
+```shell
+python ./trans_to_json.py  \
+    --input_path ./wudao_lac_cut \
+    --output_path wudao_corpus_200g.jsonl \
+    --workers 40
+```
+在当前目录下产出数据`wudao_corpus_200g.jsonl`。格式如下：
+```
+{"text": "主持人 : 作为 一个 曲线救国 的 路线 我们 没 办法 。\n金鑫 : 考试 和 分数 只是 一个 阶段性 的 评价 手段 , 不是 目的 , 就 像 人 活着 的 目的 不是 为了 吃饭 , 吃饭 是 为了 让 我们 活下去 , 我们 学习 的 目的 不是 为了 考试 , 不是 为了 那个 分数 , 而是 我 掌握 了 知识 , 成为 我 内在 的 能力 , 将来 我 去 创作 创造 工作 , 我能 把 它 做 得 更好 。\n主持人 : 特别感谢 金总 今天 接受 我 的 访谈 , 也 让 我 从 别的 层面 看到 了 一对一 到底 存在 的 道理 是 什么 , 并且 能 发展 那么 好 的 原因 在 哪里 。\n在 节目 后 您 谈谈 您 对 一对一 未来 的 希望 , 包括 您 对 它 未来 的 设想 是 什么 ？\n金鑫 : 一对一 个性化 教育 现在 还是 在 初级阶段 , 如果 是 四个 阶段 的话 , 现在 还是 在 第一阶段 到 第二阶段 迈进 的 , 学大 在 这方面 我们 希望 能 做 得 更 快 更 远 一些 。\n将来 个性化 教育 一定 是 能够 帮助 学生 在 成绩 上 的 提升 , 能够 更好 的 成长 , 进而 成为 对 社会 对 国家 更 有用 的 人才 , 就是 我们 的 成绩 、 成长 、 成才 。\n学大 1 对 1 教育 的 教师 团队 由 各科 优秀教师 、 考试 指导 专家 、 心理 辅导 专家 及 学习 方法 指导 专家 组成 , 同时 配备 专职 班主任 及 学习 监管 师 , 全方位 辅导   顺利 而 有序 的 运作 。\n其中 部分 教师 担任 多年 毕业班 教学 工作 , 多次 参与 中 考试 命题 研究 及 阅卷 工作 , 深谙 中 考试 精髓 , 能够 在 短 的 时间 内 引领 学生 掌握 中 考试 知识   重点 , 快速 提分 。\n■   对于 成绩 差 的 学生 : 注重 学生 基础知识 , 力求 让 学生 在 基础 中 找 自信 , 在 自信 中 提升 ；\n注重 主观题 的 解题 方法 及 思路 , 以此 来 加强 对 基础知识 的 运用 。\n■   对于 成绩 需要 拔高 的 学生 : 找出 学生 弱点 , 加强 基础 , 重点 提高 弱势 项目 。\n"}
+{"text": "武田信玄 是 天生 的 武将 , 一生 开拓 了 八十五万 石至 九十余万 石之多 的 领地 。\n武田信玄  他 21 岁 时 流放 自己 的 父亲 武田信虎  至骏河 , 避免 父亲 传位 给 弟弟 , 从而 登上 了 第 19 代家督 之位 。\n他 将 信 浓国 ( 现 长野县 ) 纳入 控制 范围 后 , 又 与 当时 的 豪强 今井氏 、 北条 氏 结成 三国 军事同盟 , 与 上 杉谦信 在 川 中岛 前后 展开 了 五次 大战 。\n武田信玄  勇于 进攻 。\n他 连续 攻打 邻国 , 扩大 自己 势力范围 , 可称 遇神 杀神 , 遇佛 杀佛 。\n他 不仅 流放 了 自己 的 父亲 , 连 自己 的 嫡子 武田义信 因 与 他 在 战略 方向 上 相左 , 也 被 他 幽禁 于 佛寺 , 随即 被迫 自杀 。\n武田信玄  虽然 是 战国 武将 中 的 最强者 , 但 他 的 弱点 是 年龄 。\n信玄比 织田信长 年长 13 岁 , 比上 杉谦信 年长 9 岁 。\n当信 玄年 届 五十 之 时 , 信长 和 谦信 犹 在 壮年 。\n上杉谦信 而且 , 武田信玄  虽 驰骋 天下 , 却 未率 军 进过 京都 , 而 织田信长 在 永禄 十一年 ( 1568 年 ) 就 以 拥立 第 15 代 将军 足利义 昭 为名 率兵 上洛 了 。\n所谓 \" 制 京都 者 得 天下 \" , 所以 , 想要 一统天下 , 武田信玄  的 时间 很 紧迫 。\n元龟 三年 ( 1572 年 ) , 武田信玄  与 室 町 幕府 第 15 代 将军 足利义 昭 、 本愿 寺 显如 , 以及 浅井 氏 、 朝仓氏 等 反 织田信长 实力 组成 联盟 , 编织 \" 反信长 包围圈 \" 。\n同年 10 月 3 日 , 武田信玄  率领 大军 , 开始 了 第一次 上洛之行 。\n是 年 , 信玄 52 岁 , 这 也许 是 他 统一天下 的 最后 一次 机会 。\n武田信玄 所 率领 的 是 当时 战国 最强 的 3 万甲州 精兵 。\n打着 \" 风林火山 \" 的 旗帜 , 武田军 第一站 就 到达 了 织田信长 的 同盟 德川家康  所在 的 三河 远江 。\n织田信长 德川家康  的 军队 在 甲州 精兵 之前 显得 不堪一击 , 到 了 10 月 13 日 , 只来 成 、 天 方城 、 一 宫城 、 饭田 城 、 各和城 、 向 笠 城 等 城池 纷纷 被 攻陷 。\n德川家康  见势不妙 , 决定 在 浜松 城中 闭门不出 。\n但是 武田信玄  毫不 松懈 , 又 将 家康 在 远江 地区 的 重要 据点 二俣城 攻破 。\n德川家康  集合 所有 军队 共 1 万 1 千人 , 出城 与 信玄 决一死战 , 但 大败 而 还 , 险些 失 了 性命 。\n这次 战争 被 称为 \" 三方 原战 \" , 德川家康  曾经 承认 这次 战争 是 他 生平 最大 的 失败 。\n"}
+```
+
+## 中文预训练数据制作
+
+下面是针对训练任务的数据集应用。
+
+* llama 为例
+
+注：若使用 llama 模型，则不需要提前进行分词，请将 WuDaoCorpus2.0_base_200G 中的 json 文件预处理为如下格式的 jsonl 文件：
+```
+{"text": "飞桨是功能完备、开源开放的产业级深度学习平台。飞桨拥有..."}
+{"text": "PaddleNLP是自然语言..."}
+```
+
+之后利用如下脚本将对应的 jsonl 文件转化为.bin & .idx 文件。
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "idea-ccnl/ziya-llama-13b-v1" \
+    --input_path "wudao_corpus_200g.jsonl" \
+    --output_prefix "wudao_corpus_200g" \
+    --data_format "JSON" \
+    --json_key "text" \
+    --data_impl "mmap" \
+    --append_eos \
+    --log_interval 10000 \
+    --workers 48
+```
+
+* ernie 为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "ernie-3.0-base-zh" \
+    --input_path "wudao_corpus_200g.jsonl" \
+    --output_prefix "wudao_corpus_200g"  \
+    --data_format "JSON" \
+    --json_key "text" \
+    --split_sentences \
+    --data_impl "mmap" \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func "jieba" \
+    --cn_splited \
+    --log_interval 10000 \
+    --workers 48
+```
+
+
+- 我们提前进行了分词，所以加上了 `cn_splited`，否则不需要使用此选项。
+- model_name 可以更换为[其他模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm)。
+- workers 表示转化的线程数目
+
+在当前目录下产出训练所需数据。
+```
+wudao_corpus_200g.bin
+wudao_corpus_200g.idx
+```
+用户可以使用此数据进行预训练任务。
diff --git a/examples/pre-training/tools/preprocess/merge.py b/examples/pre-training/tools/preprocess/merge.py
@@ -0,0 +1,104 @@
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from datetime import datetime
+
+from paddleformers.data import indexed_dataset
+
+
+def print_datetime(string):
+    time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    print("[" + string + "] datetime: {} ".format(time_str))
+
+
+def main(args):
+
+    prefixes = set()
+    for basename in os.listdir(args.input):
+        prefix, ext = os.path.splitext(basename)
+
+        if prefix in prefixes:
+            continue
+
+        if not os.path.isfile(os.path.join(args.input, basename)):
+            continue
+
+        ext_pair = ".bin" if ext == ".idx" else ".idx"
+        assert os.path.isfile(
+            os.path.join(args.input, prefix) + ext_pair
+        ), f"ERROR: {ext_pair} file not provided for {os.path.join(args.input, prefix)}"
+
+        prefixes.add(prefix)
+
+    builder = None
+
+    for prefix in sorted(prefixes):
+        print_datetime(f"start processing file {prefix}")
+        if builder is None:
+            dataset = indexed_dataset.make_dataset(
+                os.path.join(args.input, prefix), args.data_impl
+            )
+
+            if isinstance(dataset, indexed_dataset.MMapIndexedDataset):
+                builder = indexed_dataset.MMapIndexedDatasetBuilder(
+                    args.output_prefix + ".bin", dtype=dataset._index.dtype
+                )
+            else:
+                builder = indexed_dataset.IndexedDatasetBuilder(
+                    args.output_prefix + ".bin", dtype=dataset.dtype
+                )
+
+            del dataset
+        print_datetime(f"start merge file {prefix}")
+        builder.merge_file_(os.path.join(args.input, prefix))
+        print_datetime(f"end merge file {prefix}")
+
+    print_datetime("start finalize")
+    builder.finalize(args.output_prefix + ".idx")
+    print_datetime("end finalize")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    group = parser.add_argument_group(title="input data")
+    group.add_argument(
+        "--input",
+        type=str,
+        required=True,
+        help="Path to directory containing all document files to merge",
+    )
+    group.add_argument("--data_impl", type=str, required=True, help="data_impl")
+
+    group = parser.add_argument_group(title="output data")
+    group.add_argument(
+        "--output-prefix",
+        type=str,
+        required=True,
+        help="Path to binary output file without suffix",
+    )
+
+    args = parser.parse_args()
+
+    assert os.path.isdir(
+        args.input
+    ), f"ERROR: {args.input} is not a directory or does not exist"
+
+    assert os.path.isdir(
+        os.path.dirname(args.output_prefix)
+    ), f"ERROR: {os.path.dirname(args.output_prefix)} is not a directory or does not exist"
+
+    main(args)