Replicating OpenI results #1

jantrienes · 2022-05-20T07:58:05Z

Hi @jinpeng01, I'd like to replicate the WGSum results on my hardware and have a question about the ROUGE score I get.

What I did so far:

Install dependencies (see conda environment below). Because of a newer GPU, I had to upgrade PyTorch to version 1.7.1.
Download model_trans/openi.pt for here: https://drive.google.com/drive/folders/1okrhVfsfTqZ4mnsmABiZG0sWBqrHPOg2?usp=sharing
Adapt paths in src/WGSum_test.sh and run inference on the data in bert_openI/radiology/ (see full log below)

I get following ROUGE scores:

>> ROUGE-F(1/2/l): 59.61/49.60/59.28

Questions:

What run should I compare this to in the paper? Is it Table 3, result OpenI WGSUM (TRANS+GAT) with R1/R2/RL of 61.63/50.98/61.73?
Comparing to that run, do you know why the ROUGE scores could be lower?

Thank you!
Jan

Conda environment:

name: wgsum
channels:
  - pytorch
  - defaults
dependencies:
  - python=3.8.13
  - pip
  - pytorch=1.7.1
  - cudatoolkit=11.0
  - pip:
    - multiprocess==0.70.9
    - pyrouge==0.1.3
    - pytorch-transformers==1.2.0
    - tensorboardX==2.4
    - tensorboard==2.6.0
    - git+https://github.com/tagucci/pythonrouge.git
    - torch-scatter==2.0.7 -f https://data.pyg.org/whl/torch-1.7.1+cu110.html
    - torch-sparse==0.6.9 -f https://data.pyg.org/whl/torch-1.7.1+cu110.html
    - "torch-geometric<2"

Full execution log

/local/work/WGSum/src$ sh WGSUM_test.sh
../models/openi
[2022-05-20 07:48:29,438 INFO] Loading checkpoint from ../models/openi/openi.pt
Namespace(accum_count=1, alpha=0.95, batch_size=3000, beam_size=5, bert_data_path='../bert_openI/radiology/radiology', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=512, dec_layers=6, enc_dropout=0.2, enc_ff_size=2048, enc_hidden_size=512, enc_layers=6, encoder='baseline', ext_dropout=0.2, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='../models/openi/testlog', lr=1, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=50, max_pos=512, max_tgt_len=140, min_length=7, mode='test', model_path='../models/openi', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='../models/openi/result', save_checkpoint_steps=5, seed=666, sep_optim=True, share_emb=False, task='abs', temp_dir='../temp', test_all=False, test_batch_size=500, test_from='../models/openi/openi.pt', test_start_from=-1, train_from='', train_steps=1000, use_bert_emb=False, use_interval=True, visible_gpus='0', warmup_steps=8000, warmup_steps_bert=8000, warmup_steps_dec=8000, world_size=1)
[2022-05-20 07:48:30,268 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at ../temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2022-05-20 07:48:30,269 INFO] Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pad_token_id": 0,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

[2022-05-20 07:48:30,740 INFO] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at ../temp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
[2022-05-20 07:48:36,152 INFO] Loading test dataset from ../bert_openI/radiology/radiology.test.0.bert.pt, number of examples: 576
[2022-05-20 07:48:36,612 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ../temp/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
/local/work/WGSum/src/models/predictor.py:359: UserWarning: This overload of nonzero is deprecated:
	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370172916/work/torch/csrc/utils/python_arg_parser.cpp:882.)
  finished_hyp = is_finished[i].nonzero().view(-1)
[2022-05-20 07:50:03,498 INFO] Calculating Rouge
Calculating ROUGE...
ROUGE done.
test set results:

Metric	Score	95% CI
ROUGE-1	59.61	(-2.74,2.93)
ROUGE-2	49.60	(-3.21,3.29)
ROUGE-L	59.28	(-2.79,2.92)
[2022-05-20 07:50:05,417 INFO] Rouges at step 0
>> ROUGE-F(1/2/l): 59.61/49.60/59.28

The text was updated successfully, but these errors were encountered:

jantrienes · 2022-05-20T08:21:13Z

I noticed that I had accidentally changed min_length from 6 to 7. Reverting that change brings following results which are very close to the ones reported in the paper.

>> ROUGE-F(1/2/l): 62.23/50.13/61.87

I noticed that the ROUGE scores change when running inference again (still staying relatively high). Do you know how to keep the results stable?

njan-creative · 2022-12-20T11:12:54Z

Can you help me with the process of preprocessing the actual openI data.

@jantrienes @jinpeng01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating OpenI results #1

Replicating OpenI results #1

jantrienes commented May 20, 2022

jantrienes commented May 20, 2022

njan-creative commented Dec 20, 2022

Replicating OpenI results #1

Replicating OpenI results #1

Comments

jantrienes commented May 20, 2022

Conda environment:

Full execution log

jantrienes commented May 20, 2022

njan-creative commented Dec 20, 2022