沒有標記數據集，如何做大模型指令微調？介紹一款有潛力的標記數據集生成模型

發布于 2024-6-20 09:49

瀏覽

0收藏

在構建大模型應用時，通常有兩種方式來改進效果，一種是構建外部知識庫，利用RAG來完成。但RAG并不是萬能的，對于特定領域的LLM應用，以及無需示例，就能完成特定任務等場合就需要進行微調。然而，微調本身相較于RAG來講，需要更多的算力資源和時間周期，但更大的瓶頸在于微調需要標記過的樣本數據。這對于很多企業來講，很難有這樣高質量的數據積累，他們的數據通常是未經標記的，可能是一篇一篇的文章或者規章制度，并不是以問答對的方式而存在。

為了完成微調，傳統做法就是通過人工的方式進行問答對構造，在此基礎上斯坦福研究團隊也提出了Alpaca使用GPT-4這樣的強模型模仿種子樣本生成標記數據集。

沒有標記數據集，如何做大模型指令微調？介紹一款有潛力的標記數據集生成模型-AI.x社區

??https://arxiv.org/pdf/2402.18334??

筆者介紹一個新的樣本數據生成的項目Bonito（https://github.com/BatsResearch/bonito），一個用于條件任務生成的開源模型，它可以將未標注的文本轉換為特定任務的訓練數據集，用于指令微調。根據論文介紹，該模型本身是在 mistralai/Mistral-7B-v0.1 的基礎上，利用包含 165 萬個示例的數據集（https://huggingface.co/datasets/BatsResearch/ctga-v1）進行微調，支持多種任務類型，包括多選題回答、是非題回答、自然語言推理、主題分類等。

沒有標記數據集，如何做大模型指令微調？介紹一款有潛力的標記數據集生成模型-AI.x社區

Benito項目本身是一個數據生成的LLM應用，模型由vllm加速，使用方法比較簡單。基本過程為將文檔內容提取出來（datasets），比如PDF等，然后指定生成任務類型，并將其傳給bonito.generate_task即可。

Bonito定義：

class Bonito(LLM, AbstractBonito):
    def generate_tasks(
        self,
        text_dataset: Dataset,
        context_col: str,
        task_type: str,
        sampling_params: SamplingParams,
        **kwargs,
    ):
        """
        Generates tasks using the Bonito model.


        This method takes a text dataset, a context column name,
        a task type, and sampling parameters, and generates tasks
        using the Bonito model. It processes the input dataset,
        generates outputs, collects multiple generations into
        one dataset object, and filters out the examples that
        cannot be parsed.


        Args:
            text_dataset (Dataset): The dataset that provides the text
                for the tasks.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (SamplingParams): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dataset: The synthetic dataset with the generated tasks.
        """
        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col, **kwargs
        )
        outputs = self.generate(processed_dataset["input"], sampling_params)


        # collect multiple generations into one dataset object
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            for output in outputs[i].outputs:
                examples.append(
                    {"context": example[context_col], "prediction": output.text.strip()}
                )


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset = self._postprocess_dataset(
            synthetic_dataset, context_col="context", **kwargs
        )


        return synthetic_dataset

基本使用：

from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset


# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")


# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))


# Generate synthetic instruction tuning dataset
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_text,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params
)

如果想要在顯存較小的GPU上運行，如T4，可對模型進行量化。

from typing import Optional, List, Dict
from datasets import Dataset
from awq import AutoAWQForCausalLM
from bonito import AbstractBonito
from transformers import AutoTokenizer




class QuantizedBonito(AbstractBonito):
    def __init__(self, model_name_or_path):
        self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)


    def generate_task(
        self,
        unannotated_paragraph: str,
        task_type: str,
        sampling_params: dict,
    ) -> Dict:
        """
        Generates synthetic instruction tuning pair using the Quantized Bonito model.
        This method takes a text unannotated text, a task type, and sampling parameters,
        and generates synthetic input-output pair.


        Args:
            unannotated_paragraph (str): The unannotated text or a paragraph
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (dict): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.


        Returns:
            Dict: The synthetic input-output pair for the task type.
        """


        text_dataset = Dataset.from_list([{"input": unannotated_paragraph}])


        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col="input"
        )


        outputs = self._generate_text(processed_dataset["input"], sampling_params)
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            output = outputs[i]
            example["prediction"] = output.strip()
            examples.append(example)


        synthetic_dataset = Dataset.from_list(examples)


        # filter out the examples that cannot be parsed
        synthetic_dataset_dict = self._postprocess_dataset(
            synthetic_dataset, context_col="input"
        ).to_list()[0]


        return synthetic_dataset_dict


    def _generate_text(
        self,
        dataset: Dataset,
        sampling_params: dict,
        ) -> List[str]:
        """
        Generate text using huggingface transformers generate function.


        This method takes a dataset of prompts, encodes them,
        generates text using the model, decodes the generated
        text, and appends it to a list.


        Args:
            dataset (Dataset): A dataset containing prompts for text generation.
            sampling_params (dict): Parameters for sampling during generation.


        Returns:
            List[str]: A list of generated texts corresponding to the prompts.
        """
        generated_texts = []


        for prompt in dataset:
            input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
            input_ids = input_ids.cuda()


            output = self.model.generate(
                input_ids,
                do_sample=True,
                **sampling_params
            )


            generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
            generated_texts.append(generated_text)


        return generated_texts

以tasktype為ynqa，即yes-or-no問題為例，其生成的結果如下：

sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.7, 'num_return_sequences':1}
synthetic_dataset = bonito.generate_task(
    unannotated_paragraph,
    task_type="ynqa",
    sampling_params=sampling_params
)
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset["input"]}')
pprint(f'Output: {synthetic_dataset["output"]}')


'----Generated Instructions----'
('Input: Based on the following passage, is a written communication '
 'confidential? 1. “Confidential Information”, whenever used in this '
 'Agreement, shall mean any data, document, specification and other '
 'information or material, that is delivered or disclosed by UNHCR to the '
 'Recipient in any form whatsoever, whether orally, visually in writing or '
 'otherwise (including computerized form), and that, at the time of disclosure '
 'to the Recipient, is designated as confidential.')
'Output: Yes'

其中，tasktype支持的任務類型如下：