- (Optional) Highly recommand to creat a virtual environment to run the following code:
python3 -m venv {path/venv_name}
source {path/venv_name}/bin/activate
python -m pip install --upgrade pip setuptools
- With Python 3.6+:
- Install all Python packages:
pip install -r requirements.txt
- Install spacy english:
python -m spacy download en_core_web_sm
- Install all Python packages:
- Python 3.8+:
- Install all Python packages:
pip install -r requirements_python38.txt
- Install spacy english:
python -m spacy download en_core_web_sm
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
(source)
- Install all Python packages:
Leverage with the pre-trained models from Summary Loop, download the following files and place them under models directory. Here are the models needed to run the train_summarizer.py
:
bert_coverage.bin
: A bert-base-uncased finetuned model on the task of Coverage for the news domain,fluency_news_bs32.bin
: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,gpt2_copier23.bin
: A GPT2 (base) model that can be used as an initial point for the Summarizer model.
Also leverage with the qg_model from FEQA and qa_model from deepset/minilm:
- Download checkpoints folder and place it under bart_qg directory.
- No need to install qa_model, it will be automaticly download.
You can try running the following two instruction to see if all component are correct.
python3 model_faith.py
python3 model_coverage.py
Follow the instructions here to download CNNDM dataset under data directory. Recommand follow Option1. (See discussion here about why we do not provide it ourselves). And see to create a dataset that will be capable with Summary Loop training script.
cd data
git clone https://github.com/abisee/cnn-dailymail.git
- download CNN_STORIES_TOKENIZED, DM_STORIES_TOKENIZED from here and unzip it
python3 make_datafiles.py
test_dataset.db
will be create
Otherwise, you can modify the scripts' data loading (Dataloader
) and collate function (collate_fn
) to bring in your own data.
Once all the pretraining models and data are ready, train a Summarizer can be done using train_summarizer.py
:
python3 train_summarizer.py --dataset_file {path/to/test_dataset.db} --root_folder {path/to/mywork_backup} --experiment {experiment_name}
from model_generator import GeneTransformer
generator = GeneTransformer(device="cuda") # Initialize the generator
generator.reload("/path/to/summarizer.bin")
document = "This is a long document I want to summarize"
# Have to put in list because the decode function is meant to be used in batches for efficiency.
# You can use a beam size or not (beam_size), and you can use sampling or not (sample), without sampling it does argmax/top_k
summary = generator.decode([document], max_output_length=25, beam_size=1, return_scores=False, sample=False)
print(summary)
To evaluate the summarizer, you can run:
python3 eval.py
The Factual Consistency, Coverage, Fluency models and Brecity can be used separatelt for analysis, evaluation, etc. They are respectively in model_faith.py
, model_coverage.py
, model_generator.py
, model_guardrails.py
, each model is implemented as a class with a score(document, summary)
function.
-
Build your own Summarizer & Fluency Scorer
You can used
utils/train_generator.py
to build your own Summarizer & Fluency model.python3 train_generator.py --dataset_file {path/to/test_dataset.db} --task {cgen/copy/lm} --max_output_length {23} --experiment {experiment_name}
cgen
andcopy
is used to create Summarizer.lm
: is used to create Fluency Scorer.
-
Build your own Coverage Scorer
You can use
utils/pretrain_bert.py
to fine-tune BERT model to your target domain, in our example, news domain.python3 pretrain_bert.py --dataset_file {path/to/test_dataset.db}
And used
utils/pretrain_coverage.py
to build Coverage Scorer.python3 pretrain_coverage.py --dataset_file {path/to/test_dataset.db} --experiment {experiment_name}