Skip to content
View pae-crf's full-sized avatar

Block or report pae-crf

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
pae-crf/README.md

The official implementation of paper "A Phrase-level Attention Enhanced CRF for Keyphrase Extraction".

About our PAE-CRF

This is the framework of our paecrf.

framework of paecrf

This is the results between PAE-CRF and other baselines.

main result number ablated result example

Code

This code is mainly adapted from bert4torch. Thanks for his work.

Quick Start

The whole process includes the following steps: Preprocess, Training, Evaluation.

The original datasets is provided by kenchan. The original and labeled datasets and our PAE-CRF's results can be download from here. Please read the readme.md under /datasets to get more details.

The pre-trained models, consisting of bert-base-uncased to train word embedding and all-mpnet-base-v1 to rank keyphrases, can be download from here. Please download and replace them under /pretrain_models.

The trained parameters of our PAE-CRF can be download from here. Please download and replace it under /checkpoints.

Preprocess

We label the datasets with BIOES scheme.

When two keyphrases overlap, the latter keyphrase is not labeled to ensure that each word has unique word-level label. For example, given an input text Abductive network committees for improved classification of medical data and the corresponding keyphrases abductive networks and network committee, following the previous labeling method, the first three words are labeled as ${B_{CW}, B_{CW}, E_{CW}}$. Since the word network is matched and labeled twice, and the latter label covers the former label, resulting in a illegal label pair ${B_{CW}, B_{CW}}$ in BIOES scheme. Thus, to solve this problem, the first three words are labeled as ${B_{CW}, E_{CW}, O}$. Perhaps you have a better way to address this issue, please raise it in the issues.

The annotations we used in our labeled data differ from those in the paper, and below are their corresponding relationships.

These are word-level labels:

B-CW = $B_{kp}$

I-CW = $I_{kp}$

E-CW = $E_{kp}$

S-SW = $SW$

O = $O$

These are phrase-level labels:

CW = $MW$

SW = $SW$

O = $O$

If you downloaded our preprocessed data, you can skip the preprocess step.

You can also process other dataset with this command, but sure that the format is same as the file under /datasets/copyrnn_datasets/kp20k_sorted.

python ./code/labeling.py -src_file_path [path of source file] -trg_file_path [path of target file] -sl_src_file_path [path of source file for sequence label]

Training

To train the model.

python ./code/train.py

Test and Evaluation

We merge testing, post-process and evaluation in the eval.sh.

If you want to onlt test the model, you can execute the following command:

python ./code/test.py -dataset_directorys ["kp20k", "nus", "inspec", "semeval"] -model_names ["paecrf"]

You can comment out lines 34 to 37 in eval.sh, and then execute the following command to perform post-processing and evaluate the results.

./code/eval.sh "kp20k" "inspec" "nus" "semeval" "--" "paecrf" 

Popular repositories Loading

  1. PAE-CRF PAE-CRF Public

    This is code for PAE-CRF model.

    Python