GitHub - SophiaPx/detectors-generalization

On the Zero-Shot Generalization of Machine-Generated Text Detectors

This repository provides dataset and codes included in the EMNLP-finding 2023 paper On the Zero-Shot Generalization of Machine-Generated Text Detectors

Authors: Xiao Pu, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov and Tianxing He

Quick Links

Dataset

Codes coming soon...

A Dataset of Machine-Generated Text

Our introduced dataset is under this directory.

Each subset contains 5,000 real-world human-written samples (labelled as 1) and 5,000 machine-generated samples (labelled as 0), with a train/dev/test split ratio of 8:1:1.

of 3 domains:

news: RealNews
reviews: IMDBreview
knowledge: Wikipedia

from 13 LMs:

model	parameter
GPT-1	117M
GPT-2 small	124M
GPT-2 medium	355M
GPT-2 large	774M
GPT-2 xl	1.5B
text-davinci-003	175B
GPT-4	1.7T
GPT-neo small	125M
GPT-neo medium	1.3B
GPT-neo large	2.7B
GPT-J	6B
LLaMA7B	7B
LLaMA13B	13B

* We finetune GPT1 and GPT2s on 5,000 human samples to make their generation more human-like. Because we find that text generated by these models in a zero-shot way are flooded with influent and repetitive sentences.

Hyperparameters:

top_p: 0.96
top_k: 50
temperature: 2.0 ¹

Citation

Please cite our paper if you use our dataset or code in your work:

@inproceedings{pu-etal-2023-zero,
    title = "On the Zero-Shot Generalization of Machine-Generated Text Detectors",
    author = "Pu, Xiao  and
      Zhang, Jingyu  and
      Han, Xiaochuang  and
      Tsvetkov, Yulia  and
      He, Tianxing",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.318",
    doi = "10.18653/v1/2023.findings-emnlp.318",
    pages = "4799--4808"
}

In order to make sure all the samples are of similar length, we regenerate samples if their number of tokens are out of range (100~120). Temperature is therefore set to 2 to avoid repetition of generation. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
finetune_lm_data		finetune_lm_data
generation_data		generation_data
pics		pics
README.md		README.md
finetune_lm.py		finetune_lm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Zero-Shot Generalization of Machine-Generated Text Detectors

Quick Links

A Dataset of Machine-Generated Text

of 3 domains:

from 13 LMs:

Hyperparameters:

Citation

About

Uh oh!

Releases

Packages

Languages

SophiaPx/detectors-generalization

Folders and files

Latest commit

History

Repository files navigation

On the Zero-Shot Generalization of Machine-Generated Text Detectors

Quick Links

A Dataset of Machine-Generated Text

of 3 domains:

from 13 LMs:

Hyperparameters:

Citation

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages