This repository provides dataset and codes included in the EMNLP-finding 2023 paper On the Zero-Shot Generalization of Machine-Generated Text Detectors
Authors: Xiao Pu, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov and Tianxing He
Codes coming soon...
Our introduced dataset is under this directory.
Each subset contains 5,000 real-world human-written samples (labelled as 1) and 5,000 machine-generated samples (labelled as 0), with a train/dev/test split ratio of 8:1:1.
- news: RealNews
- reviews: IMDBreview
- knowledge: Wikipedia
model | parameter |
---|---|
GPT-1 | 117M |
GPT-2 small | 124M |
GPT-2 medium | 355M |
GPT-2 large | 774M |
GPT-2 xl | 1.5B |
text-davinci-003 | 175B |
GPT-4 | 1.7T |
GPT-neo small | 125M |
GPT-neo medium | 1.3B |
GPT-neo large | 2.7B |
GPT-J | 6B |
LLaMA7B | 7B |
LLaMA13B | 13B |
* We finetune GPT1 and GPT2s on 5,000 human samples to make their generation more human-like. Because we find that text generated by these models in a zero-shot way are flooded with influent and repetitive sentences.
- top_p: 0.96
- top_k: 50
- temperature: 2.0 1
Please cite our paper if you use our dataset or code in your work:
@inproceedings{pu-etal-2023-zero,
title = "On the Zero-Shot Generalization of Machine-Generated Text Detectors",
author = "Pu, Xiao and
Zhang, Jingyu and
Han, Xiaochuang and
Tsvetkov, Yulia and
He, Tianxing",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.318",
doi = "10.18653/v1/2023.findings-emnlp.318",
pages = "4799--4808"
}
Footnotes
-
In order to make sure all the samples are of similar length, we regenerate samples if their number of tokens are out of range (100~120). Temperature is therefore set to 2 to avoid repetition of generation. ↩