This repository contains the code and data related to our paper Arabic Automatic Story Generation with Large Language Models , published on the 2nd edition of ArabicNLP conference, Co-located with ACL 2024 Bangkok, Thailand.
AraStories is a comprehensive set of models and datasets designed to facilitate research in the area of story generation for MSA and its different dialects (e.g., Egyptian and Moroccan in this work). The dataset includes a wide variety of stories and corresponding prompts that challenge models to exhibit a deep performance of Arabic story generation structures and common knowledge in the Arabic language.
data/
: Contains the AraStories dataset in CSV formats.src/
: Source code for preprocessing, training, and evaluation.models/
: Pre-trained models and checkpoints.notebooks/
: Jupyter notebooks for data exploration and analysis.Results/
: Results and evaluation metrics.
The AraStories dataset consists of 3 CSV files, each file contains one of the three Arabic varieties covered in our work: Modern Standard Arabic(MSA), Egyptian, and Moroccan. Each file contains two columns:
- Story: A diverse collection of Arabic stories from various genres and sources.
- Prompt: Prompts used to generate those stories.
You can download the dataset from the data folder in this GitHub repo.
- Python 3.10+
- Required Python libraries are listed in
requirements.txt
.
-
Clone the repository:
git clone https://github.com/UBC-NLP/arastories.git cd arastories
-
Install the required packages:
pip install -r requirements.txt
To preprocess the data, run the following command:
python src/preprocess.py --input data/raw --output data/processed
To train a model on the AraStories dataset, use:
python src/train.py --config configs/train_config.json
To evaluate a trained model, run:
python src/evaluate.py --model models/model_checkpoint.pth --data data/processed
Explore the dataset and results using the provided Jupyter notebooks in the notebooks/
directory.
We provide benchmark results for various models trained on the AraStories dataset. Detailed results and evaluation metrics are available in the results/
directory.
If you use AraStories in your research, please cite our paper:
@misc{elshangiti2024arabicautomaticstorygeneration,
title={Arabic Automatic Story Generation with Large Language Models},
author={Ahmed Oumar El-Shangiti and Fakhraddin Alwajih and Muhammad Abdul-Mageed},
year={2024},
eprint={2407.07551},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.07551},
}
This project is licensed under the MIT License - see the LICENSE file for details.
Similar to other generative models, our model can reflect the bias in its data. Any use of the model should take this into account.
We acknowledge support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435-2018-0576; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada, and UBC ARCSockeye.