Skip to content

HAREM collection

André Pires edited this page Jun 28, 2017 · 12 revisions

All of HAREM's resources can be downloaded here, which includes the dataset and the golden collection, all of the participants results and extra programs available in the HAREM conference.

It is comprised of 129 annotated documents. With texts in both native Portuguese (pt-PT, ~60%) and brazilian Portuguese (pt-BR, ~40%).

HAREM categories, types and subtypes

  • ABSTRACAO: DISCIPLINA, ESTADO, IDEIA, NOME, OUTRO
  • ACONTECIMENTO: EFEMERIDE, EVENTO, ORGANIZADO, OUTRO
  • COISA: CLASSE, MEMBROCLASSE, OBJECTO, SUBSTANCIA, OUTRO
  • LOCAL: FISICO (ILHA, AGUACURSO, PLANETA, REGIAO, RELEVO, AGUAMASSA, OUTRO), HUMANO (RUA, PAIS, DIVISAO, REGIAO, CONSTRUCAO, OUTRO), VIRTUAL (COMSOCIAL, SITIO, OBRA, OUTRO), OUTRO
  • OBRA: ARTE, PLANO, REPRODUZIDA, OUTRO
  • ORGANIZACAO: ADMINISTRACAO, EMPRESA, INSTITUICAO, OUTRO
  • PESSOA: CARGO, GRUPOCARGO, GRUPOIND, GRUPOMEMBRO, INDIVIDUAL, MEMBRO, POVO, OUTRO
  • TEMPO: DURACAO, FREQUENCIA, GENERICO, TEMPO_CALEND (HORA, INTERVALO, DATA, OUTRO), OUTRO
  • VALOR: CLASSIFICACAO, MOEDA, QUANTIDADE, OUTRO
  • OUTRO

Table form here.

Examples for each one here.

HAREM collection filter method

Used lxml for XML related methods.

  1. Strip tags from unnecessary categories, types and subtypes (for filtered level)
    1. Removed categories: ['OBRA','COISA','ABSTRACCAO','OUTRO']
    2. Removed types: ['CARGO','GRUPOCARGO','GRUPOMEMBRO','MEMBRO','GRUPOIND','POVO', 'EFEMERIDE','VIRTUAL']
    3. Removed subtypes: ['REGIAO','OUTRO','AGUAMASSA','AGUACURSO','RELEVO','PLANETA','ADMINISTRACAO']
  2. For the remaining elements, remove unnecessary attributes
    1. Removed: ['TIPO','SUBTIPO','COREL','TIPOREL','ID','COMENT']
  3. Stripped OMITIDO tag and everything inside it
  4. Deal with multiple category, type or subtype assignments
    1. Select the first option in each alternative
  5. Deal with the ALT tag (script)
    1. Select all ALT tags
    2. For the ALT tags which don't have entities inside, select the first alternative
    3. For the rest, calculate the number of entities inside each alternative and select the alternative which has the highest number of entities
    4. Strip all ALT tags
  6. Output to file

Other processes:

  • Remove unwanted spaces (script)
  • Split dataset between train and test sets (script)
  • To output dataset with only categories, only types or only subtypes, set the category to the desired level
  • Replace & with &

Check scripts for filtration folder. Use these commands to run the scripts.

Clone this wiki locally