InstructGPT高效实践——【DeepSpeed( 三 ) _模型

查看是否存在数据集缓存(1)：如果存在则直接读取返回(14)；如果不存在则构建缓存(2-13)：读取全量数据集(3-6)；查看是否缓存有切分后的index（该index可作为索引，从原始数据中取出对应数据构成子数据集），如果有则直接读取返回，如果没有则进行构建（此处并不十分重要，故不再加以叙述）(7-9)；根据index从全量数据集中取出子数据集，将子数据集进一步处理成对应phase所需的格式（如、等），并且使用提前进行，将后的内容使用类进行维护，得到最终所需的实例(10-12)；将实例进行存储(13) 。0.2.3 关键代码详解
上述过程存在几个值得关注的地方（即文字描述加粗、UML时序图高亮的部分）：
以下将对两个部分的源码进行详细介绍。
0.2.3.1 自定义类
UML时序图(3-6)
# applications/DeepSpeed-Chat/training/utils/data/raw_datasets.pyclass PromptRawDataset(object):def __init__(self, output_path, seed, local_rank, dataset_name):"""初始化:param output_path: 输出缓存路径。:param seed: 随机种子。:param local_rank: 当前进程序号。:param dataset_name: 数据集名称，后续指定所需读取的数据集时将以名称为准。"""self.dataset_name = dataset_nameself.dataset_clean_name = dataset_clean_nameself.output_path = output_pathself.seed = seedself.local_rank = local_rank# load_dataset源自datasets库，该方法支持读取csv/json/text等多种文件格式的数据self.raw_datasets = load_dataset(dataset_name)def get_train_data(self):"""获取训练集:return: dataset数据格式"""returndef get_eval_data(self):"""获取验证集:return: dataset数据格式"""return# The prompt should be in the format of: " Human: " + actual_prompt_sentence + " Assistant:"def get_prompt(self, sample):"""从dataset的sample（单个样本）中获取prompt 。:param sample: dataset的元素:return: prompt 。prompt的格式必须为 "Human: {} Assistant:".format(actual_prompt_sentence)"""return# The chosen response should be in the format of: " " + actual_response_sentencedef get_chosen(self, sample):"""从dataset的sample（单个样本）中获取chosen 。chosen实际上是“chosen response”，指的是“精选的回复”，即人类所偏好的、高分的回复。:param sample: dataset的元素:return: chosen 。chosen的格式必须为" {}".format(actual_response_sentence)"""return# The rejected response should be in the format of: " " + actual_response_sentence# If the dataset does not have rejected response, return Nonedef get_rejected(self, sample):"""从dataset的sample（单个样本）中获取rejected 。rejected实际上是“rejected response”，指的是“排斥的回复”，即人类所厌恶的、低分的回复。:param sample: dataset的元素:return: rejected 。如果数据集中不存在则返回为None；如果存在，则其格式必须为 " {}".format(actual_response_sentence)"""returndef get_prompt_and_chosen(self, sample):"""从dataset的sample（单个样本）中获取prompt与chosen 。:param sample: dataset的元素:return: prompt与chosen的衔接。同样需要满足上述格式要求，即衔接结果为"Human: {} Assistant: {}".format(actual_prompt_sentence, actual_response_sentence)"""returndef get_prompt_and_rejected(self, sample):"""从dataset的sample（单个样本）中获取prompt与rejected 。:param sample: dataset的元素:return: prompt与rejected的衔接。同样需要满足上述格式要求，即衔接结果为"Human: {} Assistant: {}".format(actual_prompt_sentence, actual_response_sentence)"""return
自定义的数据集可以继承自上述的“”类，例如class ()，然后重写其中的self.及self.，此处的“”即为传参指定数据集时所要填写的名称，例如self.=，在设置传参--=‘’时，将会读取到的数据用于进行训练。另外其中的()等实例函数也需要进行重写，主要是实现将原始数据处理成注释所提及格式。