-
Notifications
You must be signed in to change notification settings - Fork 171
Any idea to get the performance to 70% #20
Comments
I get the sense that it has something to do with finely tuning the hyperparameters. Or maybe they used better pre-trained embeddings... The best result I've gotten so far was around 55% using a generative RNN model plus an embedding layer, although I was hoping it would be better. I would be really interested to see if someone can duplicate their results. |
I was looking through some of it yesterday and realized my |
hi, @codekansas Epoch 49 :: 2016-08-04 00:35:19 :: Train on 16686 samples, validate on 1854 samples |
hi, @codekansas @eshijia |
I trained the attention model and printed out some predicted and expected answers, then dumped them in this gist. You guys can decide for yourself. I'm more or less ready to change datasets. The top-1 precision was still much worse than the basic embedding model. |
There is a theano version for this task (and the paper). The results are identical with the paper. For the I haven't read the theano code carefully, but i believe the implementation is different from ours. When I have enough time, I will try to hack the code to find out whether I can improve it. |
Just to report back that I have no luck with my try with cosine similarity |
Hi, I suggest we try using the V2 data. There is a choice of pool size. I think they may get the 70% by using the smallest pool. |
I noticed the two scripts run for 2000000 (CNN) and 20000000 (LSTM+CNN) batches, I think it must have taken a really long time to train. The results I included were after training for only about 30000 batches. |
20000000! That does not look realistic for departments without access to super computer? It takes me a day to run 100 epoch + 20 batch size. I need 10000 days to get to 70%... |
I have asked the author of the theano version. He told me that it took about 1 day to run for 20000000 epochs with his Tesla GPU. But i don't think it really needs 2000000 epochs. In addition, he used character level embeddings. |
Wow, I did not realize the Teslas are so fast... I'll just run it for a while on my 980ti I suppose. Character level embeddings though? It looks like regular word embeddings here I would really like to replicate their result haha |
Hi, will the code run on tensorflow backend in its current state? I am asking this question because I think I need to run it on multiple gpu to improve training speed. This thread says that Keras would support multiple GPU when running with tensorflow backend but not theano backend. If it cannot run on tensorflow backend at the moment, how can I change hopefully a couple of lines to get it run on tensorflow? |
I think the performance really depends on how long you run it. I ran a CNN-LSTM model for ~700 epochs and got a precision of 0.52, going to run it longer to see if it improves. conf = {
'question_len': 150,
'answer_len': 150,
'n_words': 22353, # len(vocabulary) + 1
'margin': 0.05,
'training_params': {
'print_answers': False,
'save_every': 1,
'batch_size': 100,
'nb_epoch': 3000,
'validation_split': 0.1,
'optimizer': SGD(lr=0.05), # Adam(clipnorm=1e-2),
},
'model_params': {
'n_embed_dims': 100,
'n_hidden': 200,
# convolution
'nb_filters': 500, # * 4
'conv_activation': 'tanh',
# recurrent
'n_lstm_dims': 141, # * 2
'initial_embed_weights': np.load('models/word2vec_100_dim.h5'),
'similarity_dropout': 0.25,
},
'similarity_params': {
'mode': 'gesd',
'gamma': 1,
'c': 1,
'd': 2,
}
}
evaluator = Evaluator(conf)
##### Define model ######
model = CNNLSTM(conf)
optimizer = conf.get('training_params', dict()).get('optimizer', 'rmsprop')
model.compile(optimizer=optimizer)
# train the model
best_loss = evaluator.train(model)
evaluator.load_epoch(model, best_loss['epoch'])
evaluator.get_score(model, evaluate_all=True) class CNNLSTM(LanguageModel):
def build(self):
question = self.question
answer = self.get_answer()
# add embedding layers
weights = self.model_params.get('initial_embed_weights', None)
weights = weights if weights is None else [weights]
embedding = Embedding(input_dim=self.config['n_words'],
output_dim=self.model_params.get('n_embed_dims', 100),
weights=weights,
# mask_zero=True)
mask_zero=False)
question_embedding = embedding(question)
answer_embedding = embedding(answer)
f_rnn = LSTM(self.model_params.get('n_lstm_dims', 141), return_sequences=True, consume_less='mem')
b_rnn = LSTM(self.model_params.get('n_lstm_dims', 141), return_sequences=True, consume_less='mem')
qf_rnn = f_rnn(question_embedding)
qb_rnn = b_rnn(question_embedding)
question_pool = merge([qf_rnn, qb_rnn], mode='concat', concat_axis=-1)
af_rnn = f_rnn(answer_embedding)
ab_rnn = b_rnn(answer_embedding)
answer_pool = merge([af_rnn, ab_rnn], mode='concat', concat_axis=-1)
# cnn
cnns = [Convolution1D(filter_length=filter_length,
nb_filter=self.model_params.get('nb_filters', 500),
activation=self.model_params.get('conv_activation', 'tanh'),
# W_regularizer=regularizers.l1(1e-4),
# activity_regularizer=regularizers.activity_l1(1e-4),
border_mode='same') for filter_length in [1, 2, 3, 5]]
question_cnn = merge([cnn(question_pool) for cnn in cnns], mode='concat')
answer_cnn = merge([cnn(answer_pool) for cnn in cnns], mode='concat')
maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2]))
question_pool = maxpool(question_cnn)
answer_pool = maxpool(answer_cnn)
return question_pool, answer_pool |
Ended up with
after training for about 4-5 days on my 980ti. I can see how after enough iterations you could get up to ~60-70%, but my GPU would take way too long... |
Sounds great! I would like to follow your training progress. The duration of one epoch with the CNNLSTM model is 490s. It will take about 17 days to complete 3000 epochs. My GPU device is Tesla K20c. By the way, I think another important thing is to let the code fit the latest keras version :) |
@eshijia |
17 days seems slow for that GPU? I wonder if it is slow for some reason (maybe it's running on the CPU instead of the GPU?) But 3000 epochs * 16686 samples per epoch is 50,058,000 samples, where as the other script it was 20,000,000 * 128 or 2,560,000,000 samples. On my GPU (980ti) it will take ~6.4 days to train 3000 epochs, it would take nearly a year to train as many samples as their model used on my GPU. Also, I found a big difference in training while using different optimizers. I think the Adadelta optimizer works well, RMSprop was overfitting a lot. |
It is really running on GPU. There are four GPU devices (K20c) in my server, and each of them always runs different tasks at the same time. I can see that the GPU-Util of the device used to run this task is 96% with the command |
I think my Tesla GPU is really old. The configuration is not up to 980ti. |
@wailoktam Could you share how you change the training part to make sure the bad answers are really bad answers? |
My pleasure.
|
I think I can also share the version2 insurance data and Japanese wiki data, which I have structured in a way to be used for this great work of codekansas. However, I am running them without pretrained word2vec weights. The reason is that the program complains about the different size of vocabularies. As you guys can tell, without the pretrained weights, it will even take longer time to get the 70% claimed. |
I have tried to train about 3000 epochs for the CNNLSTM model, and the loss is stable at about 0.0013. The test results are just same as @codekansas mentioned above.
|
Hi, I mean without doing something not in the paper dos santos 2016
I am mentioning 70% coz it is what the author of this paper reported on using the LSTM+ attention with the insuranceQA data. I get 40 something like codekansas. Can I be confident in blaming dos santos in faking the result?
The text was updated successfully, but these errors were encountered: