Skip to content

zw-SIMM/SFTLLMs_for_ChemText_Mining

Repository files navigation

SFTLLMs_for_ChemText_Mining

Download

git clone https://github.com/zw-SIMM/SFTLLMs_for_chemtext_mining
cd SFTLLMs_for_chemtext_mining

🖊 Datasets and Codes

Preprocessed data, fine-tuning codes, README workflows have been placed in corresponding folders:

  • Paragraph2Comound/

  • Paragraph2RXNRole/prod/ and Paragraph2RXNRole/role/

  • Paragraph2MOFInfo/

  • Paragraph2NMR/

  • Paragraph2Action/ (dataset is derived from pistachio dataset, which is available upon request.)

💿Fine-tuning ChatGPT (GPT-3.5-Turbo) and Prompt-Engineering GPT-4

Environment (OS: Windows or Linux)

pip install openai
pip install pandas

Note: The fine-tuning code has been slightly different as the version of openai updated to v1.0.0+.

Here, we provide the latest code.

Implementation

Specific scripts for each task are in the corresponding folders.

All notebooks of fine-tuning and prompt engineering GPTs (GPT-4, GPT-3.5) as well as evaluating for each task has beed released!

Demo of Fine-tuning ChatGPT on small dataset

Here, we gave an example notebook of fine-tuning ChatGPT on 25 Paragraph2NMR data in demo/fine-tuning_chatgpt_on_25_paragraph2NMR_data.ipynb, including:

  • Preprocessing
  • Training
  • Inferencing
  • Evaluating

📀Fine-tuning Open-source Language Models (Mistral, Llama3, Bart, T5)

Environment (Linux)

mamba create -n llm python=3.10
mamba activate llm 
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas numpy ipywidgets tqdm
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch==2.1.2  transformers==4.38.2 datasets tiktoken wandb==0.11 openpyxl
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple peft==0.8.0 accelerate bitsandbytes safetensors jsonlines
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple vllm==0.3.1
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple trl==0.7
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboardX tensorboard
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple textdistance nltk matplotlib seaborn seqeval
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple modelscope

Pretrained Models Downloads

Open-sourced pretrained models (Llama3, Llama2, Mistral, Bart, T5) can be downloaded from huggingface or modelscope.

Here is an example for downloading pretrained models by scripts on linux servers from modelscope:

from modelscope import snapshot_download
model_dir = snapshot_download("LLM-Research/Meta-Llama-3-8B-Instruct", revision='master', cache_dir='/home/pretrained_models')
model_dir = snapshot_download('AI-ModelScope/Mistral-7B-Instruct-v0.2', revision='master', cache_dir='/home/pretrained_models')

Fine-tuning

The codes and tutorials of Fine-tuning Language models (ChatGPT, Llama3, Llama2, Mistral, Bart, T5) for each task are in the corresponding folders.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published