Boosting Factual Consistency and High Coverage in Unsupervised Abstractive Summarization

Dependencies

(Optional) Highly recommand to creat a virtual environment to run the following code:
1. python3 -m venv {path/venv_name}
2. source {path/venv_name}/bin/activate
3. python -m pip install --upgrade pip setuptools
With Python 3.6+:
- Install all Python packages: pip install -r requirements.txt
- Install spacy english: python -m spacy download en_core_web_sm
Python 3.8+:
- Install all Python packages: pip install -r requirements_python38.txt
- Install spacy english: python -m spacy download en_core_web_sm
- pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html (source)

Leverage with the pre-trained models from Summary Loop, download the following files and place them under models directory. Here are the models needed to run the train_summarizer.py:

bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,
fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,
gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

Also leverage with the qg_model from FEQA and qa_model from deepset/minilm:

Download checkpoints folder and place it under bart_qg directory.
No need to install qa_model, it will be automaticly download.

You can try running the following two instruction to see if all component are correct.

python3 model_faith.py
python3 model_coverage.py

Data Prepare

Follow the instructions here to download CNNDM dataset under data directory. Recommand follow Option1. (See discussion here about why we do not provide it ourselves). And see to create a dataset that will be capable with Summary Loop training script.

cd data
git clone https://github.com/abisee/cnn-dailymail.git
download CNN_STORIES_TOKENIZED, DM_STORIES_TOKENIZED from here and unzip it
python3 make_datafiles.py
test_dataset.db will be create

Otherwise, you can modify the scripts' data loading (Dataloader) and collate function (collate_fn) to bring in your own data.

Training Procedure

Once all the pretraining models and data are ready, train a Summarizer can be done using train_summarizer.py:

python3 train_summarizer.py --dataset_file {path/to/test_dataset.db} --root_folder {path/to/mywork_backup} --experiment {experiment_name}

Generate Summary

from model_generator import GeneTransformer

generator = GeneTransformer(device="cuda") # Initialize the generator
generator.reload("/path/to/summarizer.bin")
document = "This is a long document I want to summarize"

# Have to put in list because the decode function is meant to be used in batches for efficiency.
# You can use a beam size or not (beam_size), and you can use sampling or not (sample), without sampling it does argmax/top_k

summary = generator.decode([document], max_output_length=25, beam_size=1, return_scores=False, sample=False)
print(summary)

Evaluation (Optional)

To evaluate the summarizer, you can run:

python3 eval.py

Scorer Models (Optional)

The Factual Consistency, Coverage, Fluency models and Brecity can be used separatelt for analysis, evaluation, etc. They are respectively in model_faith.py, model_coverage.py, model_generator.py, model_guardrails.py, each model is implemented as a class with a score(document, summary) function.

Build your own Summarizer & Fluency Scorer

You can used utils/train_generator.py to build your own Summarizer & Fluency model.
```
python3 train_generator.py --dataset_file {path/to/test_dataset.db} --task {cgen/copy/lm} --max_output_length {23} --experiment {experiment_name}
```
- cgen and copy is used to create Summarizer.
- lm: is used to create Fluency Scorer.
Build your own Coverage Scorer

You can use utils/pretrain_bert.py to fine-tune BERT model to your target domain, in our example, news domain.
```
python3 pretrain_bert.py --dataset_file {path/to/test_dataset.db}
```
And used utils/pretrain_coverage.py to build Coverage Scorer.
```
python3 pretrain_coverage.py --dataset_file {path/to/test_dataset.db} --experiment {experiment_name}
```

Name	Name	Last commit message	Last commit date
Latest commit s103321048 aligned the summarizer name eval Aug 11, 2021 af267e2 · Aug 11, 2021 History 29 Commits
data	data	pickle -> pkl	Aug 10, 2021
logs	logs	add folder	Aug 6, 2021
models	models	add folder	Aug 6, 2021
utils	utils	eval Faith & ROUGE	Aug 10, 2021
.gitignore	.gitignore	add .gitignore	Aug 6, 2021
README.md	README.md	aligned the summarizer name eval	Aug 11, 2021
eval.py	eval.py	aligned the summarizer name eval	Aug 11, 2021
model_coverage.py	model_coverage.py	Bug fix and pass through testing	Aug 7, 2021
model_faith.py	model_faith.py	faith fix	Aug 10, 2021
model_generator.py	model_generator.py	Bug fix and pass through testing	Aug 7, 2021
model_guardrails.py	model_guardrails.py	update train_summzrizer	Aug 7, 2021
requirements.txt	requirements.txt	faith fix	Aug 10, 2021
requirements_python38.txt	requirements_python38.txt	requirements for python3.8 update	Aug 11, 2021
train_generator.py	train_generator.py	train_generator fix	Aug 10, 2021
train_summarizer.py	train_summarizer.py	aligned the summarizer name eval	Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boosting Factual Consistency and High Coverage in Unsupervised Abstractive Summarization

Dependencies

Data Prepare

Training Procedure

Generate Summary

Evaluation (Optional)

Scorer Models (Optional)

About

Releases

Packages

Languages

charkkri/robin

Folders and files

Latest commit

History

Repository files navigation

Boosting Factual Consistency and High Coverage in Unsupervised Abstractive Summarization

Dependencies

Data Prepare

Training Procedure

Generate Summary

Evaluation (Optional)

Scorer Models (Optional)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages