Skip to content

🧙🏻Code and benchmark for our Findings of ACL 2024 paper - "TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models"

License

Notifications You must be signed in to change notification settings

ahnjaewoo/timechara

Repository files navigation

TimeChara

This is the official repository of our ACL 2024 Findings paper:
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

timechara example

Please cite our work if you found the resources in this repository useful:

@inproceedings{ahn2024timechara,
    title={TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models},
    author={Jaewoo Ahn and Taehyun Lee and Junyoung Lim and Jin-Hwa Kim and Sangdoo Yun and Hwaran Lee and Gunhee Kim},
    booktitle={Findings of ACL},
    year=2024
}

For a brief summary of our paper, please see this webpage.

TimeChara

You can load TimeChara from the HuggingFace hub as the following:

from datasets import load_dataset

dataset = load_dataset("ahnpersie/timechara")
Details on TimeChara

(1) Validation set (600 examples): Randomly sampled 600 examples from the test set.

(2) Test set (10,895 examples): All datasets, including the validation set.

(3) We provide create_dataset.py to automatically construct TimeChara. Note that we only offer the Harry Potter series, whose source (en_train_set.json) can be obtained from the HPD dataset.

python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fact_event_summary
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fact_freeform_question
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fake_event_summary
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fake_freeform_question
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode create_single_turn_dataset
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_gold_response

(3-1) To use the OpenAI API for GPT-4, you need to export your OPENAI_API_KEY:

export OPENAI_API_KEY='your-openai-api-key'

Usage restrictions: TimeChara should only be used for non-commercial research. For more details, refer to the Ethics Statement in the paper.

Running evaluation on TimeChara

We recommend using Anaconda. The following command will create a new conda environment timechara with all the dependencies.

conda env create -f environment.yml

​ To activate the environment:

conda activate timechara

First, you can generate a response given a question by running the following command:

python generate.py --model_name gpt-4o-2024-05-13 --method_name zero-shot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name zero-shot-cot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name few-shot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name self-refine
# python generate.py --model_name gpt-4o-2024-05-13 --method_name rag-cutoff
# python generate.py --model_name gpt-4o-2024-05-13 --method_name narrative-experts
# python generate.py --model_name gpt-4o-2024-05-13 --method_name narrative-experts-rag-cutoff
Details on Generation

(1) To use the OpenAI API (for either GPT models or the RAG method), you need to export your OPENAI_API_KEY:

export OPENAI_API_KEY='your-openai-api-key'

(2) To use RAG, you should manually download the Chroma DB files directly by clicking this link:

unzip chroma_db_files.zip
mv text-embedding-ada-002 methods/rag

Finally, you can evaluate a response by running the following command:

python evaluate.py --eval_model_name gpt-4-1106-preview --model_name gpt-4o-2024-05-13 --method_name zero-shot
Details on Evaluation

(1) Since we don't support AlignScore directly, use an independent GitHub repository (AlignScore) to evaluate generated responses via AlignScore instead of GPT-4 judges:

from alignscore import AlignScore

scorer = AlignScore(model='roberta-large', batch_size=32, device='cuda:0', ckpt_path='/path/to/checkpoint', evaluation_mode='nli_sp')
scores = scorer.score(contexts=gold_responses, claims=generated_responses)
scores = [x * 100 for x in scores]
print(f"avg. AlignScore (# {len(scores)}) = {sum(scores)/len(scores)}")

All generation & evaluation results will be saved under outputs.

Have any questions?

Please contact Jaewoo Ahn at jaewoo.ahn at vision.snu.ac.kr

License

This repository is MIT licensed. See the LICENSE file for details.

About

🧙🏻Code and benchmark for our Findings of ACL 2024 paper - "TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages