Local to global graphical reasoning framework for extracting structured information from biomedical literature
python==3.8
torch==1.13.1
transformers==3.0.0
numpy==1.19.5
scikit-learn==0.23.1
stanfordcorenlp
Download DV dataset reference: Cross-Sentence N-ary Relation Extraction with Graph LSTMs,
Download CDR dataset from URL: https://biocreative.bioinformatics.udel.edu/media/store/files/2016/CDR_Data.zip,
Download GDA dataset from URL: https://bitbucket.org/alexwuhkucs/gda-extraction/get/fd4a7409365e.zip,
Download DocRED dataset from URL: https://github.com/thunlp/DocRED .
Please extract the downloaded dataset file and place it in the ./data folder, as shown below:
For CDR dataset:
--data
--CDR_data
--CDR_DevelopmentSet.PubTator.txt
--CDR_TrainingSet.PubTator.txt
--CDR_TestSet.PubTator.txt
Download scibert or bert models from URL: https://huggingface.co/allenai/scibert_scivocab_uncased or https://huggingface.co/bert-base-uncased.
To process the dataset, execute the following code:
preprocessing DV dataset:
python dv_preprocessing.py
or preprocessing CDR dataset:
python cdr_preprocessing.py
or preprocessing GDA dataset:
python gda_preprocessing.py
or preprocessing DocRED dataset:
python doc_preprocessing.py
When the program is finished, the following files are written to the ./prepro_data folder:
--CDR_DevelopmentSet.PubTator.json
--CDR_TrainingSet.PubTator.json
--CDR_TestSet.PubTator.json
Run the main.py file to train and test the model:
python main.py
...
BEST: Epoch: 36 | NT F1: 0.7071072883657764 | F1: 0.7071072883657764 | Intra F1: 0.7372654155495979 | Inter F1: 0.6424581005586593 | Precision: 0.696078431372549 | Recall: 0.718491260349586 | AUC: 0.6109029466769528 | THETA: 0.9996838569641113