Keywordextractionisacriticaltechniqueinnaturallanguageprocessing. For this essential task we present a simple yet efficient architecture involving character-level convolutional neural tensor networks. The proposed architecture learns the relations between a document and each word within the document and treats keyword extraction as a supervised binary classification problem. In contrast to traditional supervised approaches, our model learns the distributional vector representations for both documents and words, which directly embeds semantic information and background knowledge without the need for handcrafted features. Most importantly, we model semantics down to the character level to capture morphological information about words, which although ignored in related lit- erature effectively mitigates the unknown word problem in supervised learning approaches for keyword extraction. In the experiments, we compare the proposed model with several state-of-the-art supervised and unsupervised approaches for keyword extraction. Experiments conducted on two datasets attest the effectiveness of the proposed deep learning framework in significantly outperforming several baseline methods.
- python2.7
- Chainer
- nltk
if using GPU
- cupy
We provide two dataset, SemEval2013 and Inspec, and the original version is from https://github.com/snkim/AutomaticKeyphraseExtraction
|--data/
|--semeval_wo_stem
|--inspec_wo_stem
|--test.txt
The dataset we used is from Hulth2003 and SemEval2010 with removing stopwords and stemming. And it contains a toy testing document.
$ git clone https://github.com/cnclabs/CNTN
$ cd ./CNTN
$ pip install -r requirements.txt
The model we provided is trained by semeval_wo_stem and is for the documents with stemming and removing stopwords.
Data preporcessing is following by https://github.com/Tixierae/EMNLP_2016
$ python train.py
$ python predict.py
$ python train.py -h
usage: Training [-h] [--dlen DLEN] [--wlen WLEN] [--wdim WDIM] [--hdim HDIM]
[--label LABEL] [--flen FLEN] [--channel CHANNEL]
[--batch BATCH] [--epoch EPOCH] [--gpu GPU]
[--model MODEL] [--train TRAIN] [--predict PREDICT]
Arguments
optional arguments:
-h, --help show this help message and exit
--dlen, -dl doc length default=300
--wlen, -wl word length default=30
--wdim, -wd word dimension default=27
--hdim, -hd dimension of hidden layer default=50
--flen, -f filter length default=3
--channel, -c channel size default=50
--batch, -b batch size default=64
--epoch, -e epoch size default=50
--gpu, -g gpu device(-1=cpu, 0,1,...=gpu) default=0
--model, -model path of model default=./model
--train, -train path of training data default=./data/semeval_wo_stem/train.txt
--predict, -predict path of predicted data default=./data/test.txt
@inbook{inbook,
author = {Lin, Zhe-Li and Wang, Chuan-Ju},
year = {2019},
month = {03},
pages = {400-413},
title = {Keyword Extraction with Character-Level Convolutional Neural Tensor Networks},
isbn = {978-3-030-16141-5},
doi = {10.1007/978-3-030-16148-4_31}
}