OOM error on colab TPU when pretraining XLNet #6

cxa-unique · 2019-07-27T06:24:04Z

I am sorry to bother you here with the problme about xlnet pretraining.

I saw your comment on xlnet issues, you has the same error: Error recorded from outfeed: Bad hardware status: 0x1, on colab TPU. Nowadays, I try to pretrain XLNet on colab tpu, and I am meeting the problem too. I also have tried with minimal batch_size= 16, but still get the error. So I want to ask you if you have solved the problem, and can you pretrain xlnet on colab TPU now?

Thanks!

rusiaaman · 2019-07-31T17:46:21Z

No sorry, I haven't been able to solve the problem. But what I found is that it is not the problem of tensorflow software version, because the same OOM errors occur for tensorflow "1.14.1.dev20190518" on TPUv2 using google cloud. I have tried reducing the sequence size to absurd minimum and batch size to 8, but no use.

I think the problem is due to some code in custom TPU estimator but not sure what exactly. The OOM errors signal that padding allocation has blown up, but I can't say why this is the case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error on colab TPU when pretraining XLNet #6

OOM error on colab TPU when pretraining XLNet #6

cxa-unique commented Jul 27, 2019

rusiaaman commented Jul 31, 2019

OOM error on colab TPU when pretraining XLNet #6

OOM error on colab TPU when pretraining XLNet #6

Comments

cxa-unique commented Jul 27, 2019

rusiaaman commented Jul 31, 2019