NLP project notes: 2019 10 24 to 2019 10 31

Mythili tried to get DeepDive to run, but encountered memory issues. Presumably this would be an issue for deploying this system, so DeepDive may not be a good fit. Mythili and Ian will take another look to see if there is an obvious way to make it work, but we are not very hopeful at this point.

Mythili experimented with another NER system, NeuroNER (which is also open source, https://github.com/Franck-Dernoncourt/NeuroNER ). NeuroNER can use a library called bratreader to work directly with BRAT data. All her work is located here: https://github.com/security-force-monitor/mythili-nlp-2019

Both Ian and Mythili retrained their model using part of the annotated NLP test data set and evaluated on held-out data from that data set. Mythili's experiments are still in progress.

Ian trained the model to also predict Role, Title and Rank annotations. There is some example output in https://github.com/security-force-monitor/ian-nlp-2019/blob/master/Oct_24_examples.txt The overall numerical results of his model are still low because of tokenization issues, which we will address by next week, using the SpaCy tokenizer.

The example output looks great. We were positively surprised by how well the model learned Rank and Title, given the small training data set. Presumably this is because there is not a lot of variations in Rank and Title.

One issue is that in some cases parts of the named entities are missed. For exampl, instead of the organization "Operation Python Dance II", the system only recognized "Operation Python Dance" as an entity. Our plan is to address this issue by adding your list of organizations as additional training data. The DoS data is a lot noisier, so we will not be using that data at this time.

Concrete next steps:

Mythili and Ian will revisit DeepDive one last time.
Ian will use a tokenizer and then retrain his model using the list of known organizations.
Mythili will finish training the NeuroNER model to also predict Roles, Ranks, and Titles using the same data split as Ian. We should then be able to directly compare results between the two models.
Mythili and Ian will add each other to their private Github repos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP project notes: 2019 10 24 to 2019 10 31

Clone this wiki locally