Skip to content

Latest commit

 

History

History
14 lines (12 loc) · 743 Bytes

README.md

File metadata and controls

14 lines (12 loc) · 743 Bytes

I2B2 2012 Preprocessing

This repo contains code for convert tar.gz file (downloaded from n2c2 data portal) to labels_SPLIT.txt and text_SPLIT.txt, where SPLIT is in [train, dev, test]. This data format is compatible for NeMo TokenClassification Model.

The exact steps of conversion is as follows:

  1. Convert .xml file to brat format
  2. Convert brat to bio/iob2 format
  3. Convert bio to nemo-comptabile format

Usage

python i2b2_2012_preprocessing.py