- One implementation of the paper
DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing
andMultilingual Neural RST Discourse Parsing
. - Users can apply it to parse the input text from scratch, and get the EDU segmentations and the parsed tree structure.
- The model supports both sentence-level and document-level RST discourse parsing.
- This repo and the pre-trained model are only for research use. Please cite the papers if they are helpful.
The model training and inference scripts were tested on following libraries and versions:
- pytorch==1.7.1
- transformers==4.8.2
- Following steps in the two sub-folders under
Preprocess_RST_Data
. - Note that the
XLM-Roberta-base tokenizer
is used in both treebank pre-processing and model training scripts. For other tokenizers, you should change them accordingly. - After all treebank pre-processing steps, all samples will be stored in pickle files (the output path is set by user).
- Since some treebanks need LDC license, here we only provide one public dataset as example.
- Tne example pre-processed treebank GUM (Zeldes, A., 2017) (English-only) is located at the folder
./depth_mode/pkl_data_for_train/en-gum/
.
- Run the script
MUL_main_Train.py
to train a model. - Before you start to train, we recommend that you read the parameter settings.
- The pre-processed data in folder
./depth_mode/pkl_data_for_train/en-gum/
(English-only) will be used for training by default. - Note that the
XLM-Roberta-base tokenizer
is used in both treebank pre-processing and model training scripts. For other tokenizers, you should change them accordingly.
Instead of re-training the model, you can use the well-trained parser for inference (model checkpoint is located at ./depth_mode/Savings/
).
We trained and evaluated the model with the multilingual collection of RST discourse treebanks, and it natively supports 6 languages: English, Portuguese, Spanish, German, Dutch, Basque. Interested users can also try other languages.
-
[Input]
InputSentence
: The input document/sentence, and the raw text will be tokenizaed and encoded by thexlm-roberta-base
language backbone.- Raw Sequence Example:
Although the report, which has released before the stock market opened, didn't trigger the 190.58 point drop in the Dow Jones Industrial Average, analysts said it did play a role in the market's decline.
- Raw Sequence Example:
-
[Output]
EDU_Breaks
: The indices of the EDU boundary tokens, including the last word of the sentence.- Output Example: [5, 10, 17, 33, 37, 49]
- Segmented Sequence Example ('||' denotes the EDU boundary positions for better readability):
Although the report, || which has released || before the stock market opened, || didn't trigger the 190.58 point drop in the Dow Jones Industrial Average, || analysts said || it did play a role in the market's decline. ||
- Output Example: [5, 10, 17, 33, 37, 49]
-
[Output]
tree_parsing_output
: The model outputs of the discourse parsing tree follow this top-down constituency parsing format.- (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
- (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
- Put the text paragraph to the file
./data/text_for_inference.txt
. - Pre-trained model checkpoint is located at
./depth_mode/Savings/
. - Run the script
MUL_main_Infer.py
to obtain the RST parsing result. See the script for detailed model output. - We recommend users to run the parser on a GPU-equipped environment.
@inproceedings{liu-etal-2021-dmrst,
title = "{DMRST}: A Joint Framework for Document-Level Multilingual {RST} Discourse Segmentation and Parsing",
author = "Liu, Zhengyuan and Shi, Ke and Chen, Nancy",
booktitle = "Proceedings of the 2nd Workshop on Computational Approaches to Discourse",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic and Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.codi-main.15",
pages = "154--164",
}
@inproceedings{liu2020multilingual,
title={Multilingual Neural RST Discourse Parsing},
author={Liu, Zhengyuan and Shi, Ke and Chen, Nancy},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
pages={6730--6738},
year={2020}
}