This repo contains code for convert tar.gz file (downloaded from n2c2 data portal) to labels_SPLIT.txt and text_SPLIT.txt, where SPLIT is in [train, dev, test]. This data format is compatible for NeMo TokenClassification Model.
The exact steps of conversion is as follows:
- Convert .xml file to brat format
- Convert brat to bio/iob2 format
- Convert bio to nemo-comptabile format
Usage
python i2b2_2012_preprocessing.py