Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FREME-NER datasets for training different classifier implementations #179

Open
reckart opened this issue Feb 21, 2017 · 5 comments
Open

Comments

@reckart
Copy link

reckart commented Feb 21, 2017

Are the FREME-NER datasets available to train alternative classifier implementations, e.g. Apache OpenNLP NER?

@m1ci
Copy link
Contributor

m1ci commented Feb 21, 2017

Hi, yes, we trained on the dbpedia abstracts dataset, see: http://wiki.dbpedia.org/nif-abstract-datasets
The data is in the NIF format, so you'll need to write small script which reads NIF and creates the input for the learning. This is how we created a script for NIF to StanfordNER input.
It would be great to have NIF2Any input learning converter.

@reckart
Copy link
Author

reckart commented Feb 21, 2017

DKPro Core might help you out here :)

  • We have a NIF reader - I have tested it on some NIF samples I found on the net but not on the DBPedia datasets. Also it's presently only available in SNAPSHOT builds.
  • We have writers for all kinds of formats
  • It is pretty straight forward to create a script to convert from one format to another
  • We even have started adding some training components, e.g. for Stanford NER and OpenNLP NER

@m1ci
Copy link
Contributor

m1ci commented Feb 21, 2017

glad to hear that! will definitely look into it.

@m1ci
Copy link
Contributor

m1ci commented Feb 22, 2017

@reckart we have just released the latest version of DBpedia abstracts for several languages. See http://downloads.dbpedia.org/2016-10/core-i18n/ which are nice source for training NER.

Let us know if you have any questions.

Best,
Milan

@reckart
Copy link
Author

reckart commented Feb 24, 2017

@m1ci puh, these files are huge! I was kind of hoping for one a ZIP with ttl file per article. How do you work with such large files? Would you recommend some RDF store?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants