Pun Generation with Surprise

This repo contains code and data for the paper Pun Generation with Surprise.

Requirements

Python 3.6
Pytorch 0.4

conda install pytorch=0.4.0 torchvision -c pytorch

Fairseq(-py)

git clone -b pungen https://github.com/hhexiy/fairseq.git
cd fairseq
pip install -r requirements.txt
python setup.py build develop

Pretrained WikiText-103 model from Fairseq

curl --create-dirs --output models/wikitext/model https://dl.fbaipublicfiles.com/fairseq/models/wiki103_fconv_lm.tar.bz2
tar xjf models/wikitext/model -C models/wikitext
rm models/wikitext/model

Training

Word relatedness model

We approximate relatedness between a pair of words with a long-distance skip-gram model trained on BookCorpus sentences. The original BookCorpus data is parsed by scripts/preprocess_raw_text.py and you can see the sample file in sample_data/bookcorpus/raw/train.txt.

Preprocess bookcorpus data:

python -m pungen.wordvec.preprocess --data-dir data/bookcorpus/skipgram \
	--corpus data/bookcorpus/raw/train.txt \
	--min-dist 5 --max-dist 10 --threshold 80 \
	--vocab data/bookcorpus/skipgram/dict.txt

Train skip-gram model:

python -m pungen.wordvec.train --weights --cuda --data data/bookcorpus/skipgram/train.bin \
    --save_dir models/bookcorpus/skipgram \
    --mb 3500 --epoch 15 \
    --vocab data/bookcorpus/skipgram/dict.txt

Edit model

The edit model takes a word and a template (masked sentence) and combine the two coherently.

Preprocess data:

for split in train valid; do \
	PYTHONPATH=. python scripts/make_src_tgt_files.py -i data/bookcorpus/raw/$split.txt \
        -o data/bookcorpus/edit/$split --delete-frac 0.5 --window-size 2 --random-window-size; \
done

python -m pungen.preprocess --source-lang src --target-lang tgt \
	--destdir data/bookcorpus/edit/bin/data --thresholdtgt 80 --thresholdsrc 80 \
	--validpref data/bookcorpus/edit/valid \
	--trainpref data/bookcorpus/edit/train \
	--workers 8

Training:

python -m pungen.train data/bookcorpus/edit/bin/data -a lstm \
    --source-lang src --target-lang tgt \
    --task edit --insert deleted --combine token \
    --criterion cross_entropy \
    --encoder lstm --decoder-attention True \
    --optimizer adagrad --lr 0.01 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
    --clip-norm 5 --max-epoch 50 --max-tokens 7000 --no-epoch-checkpoints \
    --save-dir models/bookcorpus/edit/deleted --no-progress-bar --log-interval 5000

Retriever

Build a sentence retriever based on Bookcorpus. The input should have a tokenized sentence per line.

python -m pungen.retriever --doc-file data/bookcorpus/raw/sent.tokenized.txt \
    --path models/bookcorpus/retriever.pkl --overwrite

Analyze what makes a pun funny

Compute correlation between local-global suprise scores and human funniness ratings. We provide our annotated dataset in data/funniness_annotation:

analysis_pun_scores.txt: sentences annotated with funniness scores from 1 to 5.
analysis_zscored_pun_scores.txt: the same data where scores are standardized for each annotator.

python eval_scoring_func.py --human-eval data/funniness_annotation/analysis_zscored_pun_scores.txt \
	--lm-path models/wikitext/wiki103.pt --word-counts-path models/wikitext/dict.txt \
    --skipgram-model data/bookcorpus/skipgram/dict.txt \
                     models/bookcorpus/skipgram/sgns-e15.pt \
    --outdir results/pun-analysis/analysis_zscored \
    --features grammar ratio --analysis --ignore-cache

Generate puns

We generate puns given a pair of pun word and alternative word. We support pun generation with the following methods specified by the system argument.

rule: the SURGEN method described in the paper
rule+neural: in the last step of SURGEN, use a neural combiner to edit the topic words
retrieve: retrieve a sentence containing the pun word
retrieve+swap: retrieve a sentence containing the alternative word and replace it with the pun word For arguments controlling the neural generator (e.g., --beam, --nbest), see fairseq.options. All results and logs are saved in outdir.

python generate_pun.py data/bookcorpus/edit/bin/data \
	--path models/bookcorpus/edit/delete/checkpoint_best.pt \
	--beam 20 --nbest 1 --unkpen 100 \
	--system rule --task edit \
	--retriever-model models/bookcorpus/retriever.pkl --doc-file data/bookcorpus/raw/sent.tokenized.txt \
	--lm-path models/wikitext/wiki103.pt --word-counts-path models/wikitext/dict.txt \
	--skipgram-model data/bookcorpus/skipgram/dict.txt models/bookcorpus/skipgram/sgns-e15.pt \
	--num-candidates 500 --num-templates 100 \
	--num-topic-word 100 --type-consistency-threshold 0.3 \
	--pun-words data/semeval/hetero/dev.json \
	--outdir results/semeval/hetero/dev/rule \
	--scorer random \
	--max-num-examples 100

Reference

If you use the annotated SemEval pun dataset, please cite our paper:

@inproceedings{he2019pun,
    title={Pun Generation with Surprise},
    author={He He and Nanyun Peng and Percy Liang},
    booktitle={North American Association for Computational Linguistics (NAACL)},
    year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
MTurk_aggregate_results		MTurk_aggregate_results
data		data
pungen		pungen
sample_data/bookcorpus/raw		sample_data/bookcorpus/raw
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
eval_scoring_func.py		eval_scoring_func.py
generate_pun.py		generate_pun.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pun Generation with Surprise

Requirements

Training

Word relatedness model

Edit model

Retriever

Analyze what makes a pun funny

Generate puns

Reference

About

Releases

Packages

Contributors 2

Languages

hhexiy/pungen

Folders and files

Latest commit

History

Repository files navigation

Pun Generation with Surprise

Requirements

Training

Word relatedness model

Edit model

Retriever

Analyze what makes a pun funny

Generate puns

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages