Skip to content
André Pires edited this page Aug 16, 2017 · 36 revisions

This wiki documents the development process for my master's thesis, named Named entity extraction from Portuguese web text.

First, the HAREM dataset was used to perform NER using available tools, namely Stanford CoreNLP, NLTK, OpenNLP and spaCy. Repeated 10-fold cross validation was used to evaluate all tools, all results are present in this wiki. More info on the HAREM collection on its page.

After evaluation all tools with the baseline configuration, I performed a Hyperparameter study for each tool, this time using repeated holdout cross-validation.

I manually annotated a subset of SIGARRA news, generating a Portuguese corpus with 905 annotated news. And finally, I trained models with each tool with this dataset. More info on the SIGARRA News Corpus on its page.

Main repository folders

All tools were intended to be ran across HAREM with four different entity levels:

  • Categories: use only categories
  • Types: use only types
  • Subtypes: use only subtypes
  • Filtered: use filtered categories (subset of categories)
Clone this wiki locally