Skip to content

SophiaPx/detectors-generalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the Zero-Shot Generalization of Machine-Generated Text Detectors

pic This repository provides dataset and codes included in the EMNLP-finding 2023 paper On the Zero-Shot Generalization of Machine-Generated Text Detectors

Authors: Xiao Pu, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov and Tianxing He

Quick Links

Codes coming soon...

A Dataset of Machine-Generated Text

Our introduced dataset is under this directory.

Each subset contains 5,000 real-world human-written samples (labelled as 1) and 5,000 machine-generated samples (labelled as 0), with a train/dev/test split ratio of 8:1:1.

of 3 domains:

from 13 LMs:

model parameter
GPT-1 117M
GPT-2 small 124M
GPT-2 medium 355M
GPT-2 large 774M
GPT-2 xl 1.5B
text-davinci-003 175B
GPT-4 1.7T
GPT-neo small 125M
GPT-neo medium 1.3B
GPT-neo large 2.7B
GPT-J 6B
LLaMA7B 7B
LLaMA13B 13B

* We finetune GPT1 and GPT2s on 5,000 human samples to make their generation more human-like. Because we find that text generated by these models in a zero-shot way are flooded with influent and repetitive sentences.

Hyperparameters:

  • top_p: 0.96
  • top_k: 50
  • temperature: 2.0 1

Citation

Please cite our paper if you use our dataset or code in your work:

@inproceedings{pu-etal-2023-zero,
    title = "On the Zero-Shot Generalization of Machine-Generated Text Detectors",
    author = "Pu, Xiao  and
      Zhang, Jingyu  and
      Han, Xiaochuang  and
      Tsvetkov, Yulia  and
      He, Tianxing",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.318",
    doi = "10.18653/v1/2023.findings-emnlp.318",
    pages = "4799--4808"
}

Footnotes

  1. In order to make sure all the samples are of similar length, we regenerate samples if their number of tokens are out of range (100~120). Temperature is therefore set to 2 to avoid repetition of generation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages