CRF-LSTM-NER

A CRF-BiLSTM model aims at quick and convenient benchmarking the performances of different word embeddings on your own corpus.

the objectives of this model are:

Build a CRF-BiLSTM Network in Tensorflow with methods for easisly switching among different word embeddings (Word2vec, GloVe, Fasttext, ELMo, Flair and any combinations of them) while keep the same CRF-LSTM Network unchanged.
Methods for easily gridsearch on the suitable parameters.

Requirements

Python 3, TensorFlow 1.0+, Gensim, and Flair(optinal):

if need to use "Contextual" embedding, Flair library(https://github.com/zalandoresearch/flair) is required.

How To Use

Modify the directory to the Cropus and Configure the Hyper-parameter accordingly in config.py

    # embeddings_size
    dim_word = 300
    dim_char = 50
    #
    hidden_size_char = 64 # lstm on chars
    hidden_size_lstm = 128 # lstm on word embeddings

    # dataset
    path_data_root = 'data/CoNLL2003/'
    path_train = path_data_root +'eng.testa'
    path_eval = path_data_root +'eng.testa'
    path_test = path_data_root +'eng.testb'

Designate the embedding you want. Since different embeddings come with different file formats, this part maybe vary slightly accordding to the embedding you choose. there is a example for them in "How To Use.ipynb"

    # glove
	config = Config('glove')
	glove_file_path = 'data/glove/glove.6B.100d.txt'
	config.init_glove(glove_file_path)

    # fasttext
	config = Config('fasttext')
	command ='../fastText/fasttext'
	bin_file ='../fastText/data/cc.en.300.bin'
	config.init_fasttext(command, bin_file)

Parse the corpus and generate the "index" and "input". the following code will base on the vocabularies of embedding and corpus to generate the index for token/character/label and the mapping the each sentence into a sequence of index. this part also handle the sepcific configuration of model base on the corpus, like the number of kind of label, the number of unique character in corpus.

# parse the corpus and generate the input data
token2idx, char2idx, label2idx, lookup_table = get_idx(config)
train_x, train_y = get_inputs('train', token2idx, char2idx, label2idx, config)
eval_x, eval_y = get_inputs('eval', token2idx, char2idx, label2idx, config)
test_x, test_y = get_inputs('test', token2idx, char2idx, label2idx, config)

initial the model's graph and train/eval/test.

# initial the same NER model 
ner_model = Model(config)
ner_model.build_graph()
ner_model.initialize_session()

resutl: the F1 score based on the label will be print and detail of training processing can be find in "./output/log.log".

you could find more detail in "How To Use.ipynb"

Reference

This model is based on the following papers:

Lample, Guillaume, et al. "Neural architectures for named entity recognition." arXiv preprint arXiv:1603.01360 (2016).
Zhiheng Huang, et al. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv preprint arXiv:1508.01991 (2015).

Name	Name	Last commit message	Last commit date
Latest commit JZ-LIANG clean repo Mar 26, 2019 86b19f4 · Mar 26, 2019 History 148 Commits
config_examples	config_examples	repo clean	Mar 26, 2019
output	output	clean repo	Mar 26, 2019
test	test	clean repo	Mar 26, 2019
.gitignore	.gitignore	clean repo	Mar 26, 2019
How To Use.ipynb	How To Use.ipynb	clean repo	Mar 26, 2019
README.md	README.md	clean repo	Mar 26, 2019
config.py	config.py	clean repo	Mar 26, 2019
conlleval	conlleval	example result	Dec 9, 2018
main.py	main.py	clean repo	Mar 26, 2019
model.py	model.py	clean repo	Mar 26, 2019
utils.py	utils.py	clean repo	Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRF-LSTM-NER

Requirements

How To Use

Reference

About

Releases

Packages

Languages

JZ-LIANG/CRF-LSTM-NER

Folders and files

Latest commit

History

Repository files navigation

CRF-LSTM-NER

Requirements

How To Use

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages