revision for chatbot instruction data (intel#1046)

* revision Signed-off-by: XuhuiRen <[email protected]> * add Signed-off-by: XuhuiRen <[email protected]> --------- Signed-off-by: XuhuiRen <[email protected]>
mengfei25 · Jun 30, 2023 · f8a6ea6 · f8a6ea6
1 parent 55aa6c3
commit f8a6ea6
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 10,837 deletions.
diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/data/instruction_data.md b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/data/instruction_data.md
@@ -0,0 +1,42 @@
+Instruction Dataset
+======
+1. [Introduction](#introduction)
+2. [General Domain](#general-domain)
+3. [Chinese Domain](#chinese-domain)
+4. [Medical Domain](#medical-domain)
+
+## Introduction
+
+Instruction-following models, such as ChatGPT, Claude, and MPT-Chat, have garnered significant attention from both academia and industry due to their remarkable performance in various language tasks. Many developers are venturing into this promising field, seeking to leverage the training procedure of instruction learning to fine-tune customized models for downstream application scenarios.
+
+In this repository, we offer a range of solutions to assist users in customizing their own large language model. Currently, our target application scenarios encompass general domain, Chinese domain, and medical domain. We are also working on expanding our offerings to include more business-related domains in the near future.
+
+## General Domain
+[Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) introduces a novel approach that leverages OpenAI's text-davinci-003 as the teacher model to assist developers in annotating given data using the knowledge from ChatGPT. By employing the self-instruct method, Alpaca generates instruction-following demonstrations that stimulate ChatGPT to automatically produce instruction data. A total of 52,000 instruction-following demonstrations have been generated, covering a diverse range of topics in the general domain. These demonstrations were created based on 175 manually crafted seed instruction instances. You can download the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca#data-release) for further validation.
+
+Each Alpaca-style instruction following data consists of three parts: `instruction`, `input` and `ouput`:
+- `instruction`: `str`, describes the task the model should perform. There are 52,000 distinct instructions in the dataset.
+- `input`: `str`, optional context or input for the task. For instance, if the instruction is "Summarize the following article," the input would be the article itself. Approximately 40% of the examples include an input.
+- `output`: `str`, the answer to the instruction, generated by `text-davinci-003`.
+
+We have verified the performance of Alpaca on our own tuning receipt and reported the results in this [blog](https://medium.com/intel-analytics-software/create-your-own-chatbot-on-cpus-b8d186cfefb2).
+
+## Chinese Domain
+Considering the limited availability of large language models capable of answering Chinese queries, we have conducted further investigations to enable our model to respond to such queries. We have identified two relevant resources that can be downloaded based on individual requirements:
+
+[Chinese-Alpaca](https://github.com/A-baoYang/alpaca-7b-chinese/tree/main/data/general): This dataset is a direct translation of the English Alpaca dataset into Chinese. It serves to activate the model's ability to engage in conversations in Chinese.
+
+[Chinese-legal](https://raw.githubusercontent.com/AndrewZhe/lawyer-llama/main/data/judical_examination.json): This dataset comprises answers from ChatGPT to questions from The China National Judicial Examination.
+
+Both datasets follow the Alpaca-style format, making them compatible with our approach. You can download the datasets directly and utilize our provided recipe to fine-tune your model. By leveraging these resources, you can enhance your model's capability to understand and respond to Chinese queries effectively.
+
+## Medical Domain
+The medical domain is a popular and promising application area for large language models. These models can assist doctors in analyzing patient conditions and providing suitable advice. We have identified two available instruction datasets in the medical domain:
+
+1. [HealthCareMagic](https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view): This dataset transforms conversations from an online medical advice website, [HealthCareMagic](https://www.healthcaremagic.com/) into a role play format. Each instance includes an `instruction` that asks the model to act like a doctor and provide suggestions. The `input` for each instruction is the patient's self-description of their condition, and the corresponding `output` is the doctor's advice for the patient.
+
+2. [iCliniq](https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view): This dataset comprises patient self-descriptions, answers from online doctors on [iCliniq](https://www.icliniq.com/), answers from chatgpt and answers from [chatdoctor](https://github.com/Kent0n-Li/ChatDoctor).
+
+Using the provided datasets, you can easily transform your local model into an online doctor by following our recipe. This allows your model to provide medical advice and engage in conversations with patients effectively.
+
+By leveraging these datasets, you can enhance your model's capability to understand patient descriptions and offer appropriate recommendations, enabling it to function as an online doctor.