Skip to content

Latest commit

 

History

History
executable file
·
117 lines (94 loc) · 6.29 KB

README.md

File metadata and controls

executable file
·
117 lines (94 loc) · 6.29 KB

SummaRuNNer (extractive summarization)

This repository presents an in-depth study of SummaRuNNer (ablation and replacement study) and also presents SummaRuNNer's results on CNN-DailyMail, NYT50 and part of the French wikipedia, as well as the influence of the length of the summary/document ratio on performance. It also presents the influence of the named entity recognition task when combined with the summarization task and vice versa.

paper: SummaRuNNer

Clone project

git clone https://github.com/Baragouine/SummaRuNNer.git

Enter into the directory

cd SummaRuNNer

Create environnement

conda create --name SummaRuNNer python=3.9

Activate environnement

conda activate SummaRuNNer

Install dependencies

pip install -r requirements.txt

Install nltk data

To install nltk data:

  • Open a python console.
  • Type import nltk; nltk.download().
  • Download all data.
  • Close the python console.

Convert initial dataset to valid pandas json

Download the initial dataset (CNN-DailyMail from Kaggle).
Copy train.json, val.json and test.json to ./data/cnn_dailymail/raw/ .
Run the notebook: 00-0-convert_raw_cnndailymail_to_json.ipynb.

To know how to download and preprocess NYT (convert NYT to NYT50 and preprocess it), see: https://github.com/Baragouine/HeterSUMGraph.

Compute labels

python3 ./00-1-compute_label_cnndailymail.py

You can adapt this script for other dataset containing texts and summaries.

Embeddings

For training you must use glove 100 embeddings, they must have the following path: data/glove.6B/glove.6B.100d.txt

Training

Run train_RNN_RNN.ipynb to train the paper model.
Run train_SIMPLE_CNN_RNN.ipynb to train the model whose first RNN is replaced by a single-layer CNN.
Run train_COMPLEX_CNN_RNN.ipynb to train the model whose the first RNN is replaced by a complex CNN (3 layers).
Run train_COMPLEX_CNN_RNN_max_pool.ipynb to train the model whose the first RNN is replaced by a complex CNN (3 layers). And the model uses max pooling instead of average pooling.
Run train_RES_CNN_RNN.ipynb to train the model whose first RNN is replaced by a CNN that uses residual connections (3 layers).

The other notebooks are used to train SIMPLE_CNN_RNN that have been ablated to see the importance of each component of the model:

  • Run train_SIMPLE_CNN_RNN_abs_pos_only.ipynb to train a SIMPLE_CNN_RNN that uses only the absolute position to predict.
  • Run train_SIMPLE_CNN_RNN_rel_pos_only.ipynb to train a SIMPLE_CNN_RNN that uses only the relative position to predict.
  • ...

To find out what these notebooks are for, just look at the file name.
The pt files are located in ./checkpoints, each training result is stored in a different sub directory.

Result

CNN-DailyMail (full-length f1 rouge)

model ROUGE-1 ROUGE-2 ROUGE-L
SummaRuNNer(Nallapati) 39.6 ± 0.2 16.2 ± 0.2 35.3 ± 0.2
RNN_RNN 39.7 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
COMPLEX_CNN_RNN 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
COMPLEX_CNN_RNN_max_pool 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
RES_CNN_RNN 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_without_text_content (positions only) 39.4 ± 0.0 16.0 ± 0.0 24.2 ± 0.0
SIMPLE_CNN_RNN_without_positions (text content only) 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_absolute_position_only 39.4 ± 0.0 16.0 ± 0.0 24.3 ± 0.0
SIMPLE_CNN_RNN_relative_position_only 39.0 ± 0.0 15.8 ± 0.0 24.1 ± 0.0
SIMPLE_CNN_RNN_without_positions_and_content 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_without_positions_and_salience 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_without_position_and_novelty 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_novelty_only 40.0 ± 0.0 16.7 ± 0.0 25.3 ± 0.0
SIMPLE_CNN_RNN_salience_only 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_content_only 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0

RNN_RNN on NYT50 (limited-length ROUGE Recall)

model ROUGE-1 ROUGE-2 ROUGE-L
HeterSUMGraph (Wang) 46.89 26.26 42.58
RNN_RNN 47.3 ± 0.0 26.7 ± 0.0 35.7* ± 0.0

*: maybe the ROUGE-L have changed in the rouge library I use.

RNN_RNN on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

dataset ROUGE-1 ROUGE-2 ROUGE-L
Wikipedia-0.5 31.5 ± 0.0 10.0 ± 0.0 20.0 ± 0.0
Wikipedia-high-25 24.0 ± 0.0 6.8 ± 0.0 15.0 ± 0.0
Wikipedia-low-25 33.3 ± 0.0 13.3 ± 0.0 23.0 ± 0.0

RNN_RNN with NER on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

model ROUGE-1 ROUGE-2 ROUGE-L ACCURACY
RNN_RNN_summary_and_ner 31.6 ± 0.1 10.0 ± 0.0 20.0 0.875 ± 0.0
RNN_RNN_OnlyNER N/A N/A N/A 0.879 ± 0.0

* Wikipedia-0.5: general geography, architecture town planning and geology French wikipedia articles with len(summary)/len(content) <= 0.5.
* Wikipedia-high-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) descending.
* Wikipedia-low-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) ascending.

See HSG_ExSUM_NER repository for wikipedia scraping and preprocessing (that repository conatain script for scrapping and preprocessing).