HAREM collection

All of HAREM's resources can be downloaded here, which includes the dataset and the golden collection, all of the participants results and extra programs available in the HAREM conference.

It is comprised of 129 annotated documents. With texts in both native Portuguese (pt-PT, ~60%) and brazilian Portuguese (pt-BR, ~40%).

HAREM categories, types and subtypes

ABSTRACAO: DISCIPLINA, ESTADO, IDEIA, NOME, OUTRO
ACONTECIMENTO: EFEMERIDE, EVENTO, ORGANIZADO, OUTRO
COISA: CLASSE, MEMBROCLASSE, OBJECTO, SUBSTANCIA, OUTRO
LOCAL: FISICO (ILHA, AGUACURSO, PLANETA, REGIAO, RELEVO, AGUAMASSA, OUTRO), HUMANO (RUA, PAIS, DIVISAO, REGIAO, CONSTRUCAO, OUTRO), VIRTUAL (COMSOCIAL, SITIO, OBRA, OUTRO), OUTRO
OBRA: ARTE, PLANO, REPRODUZIDA, OUTRO
ORGANIZACAO: ADMINISTRACAO, EMPRESA, INSTITUICAO, OUTRO
PESSOA: CARGO, GRUPOCARGO, GRUPOIND, GRUPOMEMBRO, INDIVIDUAL, MEMBRO, POVO, OUTRO
TEMPO: DURACAO, FREQUENCIA, GENERICO, TEMPO_CALEND (HORA, INTERVALO, DATA, OUTRO), OUTRO
VALOR: CLASSIFICACAO, MOEDA, QUANTIDADE, OUTRO
OUTRO

Table form here.

Examples for each one here.

HAREM collection filter method

Used lxml for XML related methods.

Strip tags from unnecessary categories, types and subtypes (for filtered level)
1. Removed categories: ['OBRA','COISA','ABSTRACCAO','OUTRO']
2. Removed types: ['CARGO','GRUPOCARGO','GRUPOMEMBRO','MEMBRO','GRUPOIND','POVO', 'EFEMERIDE','VIRTUAL']
3. Removed subtypes: ['REGIAO','OUTRO','AGUAMASSA','AGUACURSO','RELEVO','PLANETA','ADMINISTRACAO']
For the remaining elements, remove unnecessary attributes
1. Removed: ['TIPO','SUBTIPO','COREL','TIPOREL','ID','COMENT']
Stripped OMITIDO tag and everything inside it
Deal with multiple category, type or subtype assignments
1. Select the first option in each alternative
Deal with the ALT tag (script)
1. Select all ALT tags
2. For the ALT tags which don't have entities inside, select the first alternative
3. For the rest, calculate the number of entities inside each alternative and select the alternative which has the highest number of entities
4. Strip all ALT tags
Output to file

Other processes:

Remove unwanted spaces (script)
Split dataset between train and test sets (script)
To output dataset with only categories, only types or only subtypes, set the category to the desired level
Replace & with &

Check scripts for filtration folder. Use these commands to run the scripts.