Skip to content
This repository has been archived by the owner on Jan 30, 2021. It is now read-only.

OOM error on colab TPU when pretraining XLNet #6

Open
cxa-unique opened this issue Jul 27, 2019 · 1 comment
Open

OOM error on colab TPU when pretraining XLNet #6

cxa-unique opened this issue Jul 27, 2019 · 1 comment

Comments

@cxa-unique
Copy link

I am sorry to bother you here with the problme about xlnet pretraining.

I saw your comment on xlnet issues, you has the same error: Error recorded from outfeed: Bad hardware status: 0x1, on colab TPU. Nowadays, I try to pretrain XLNet on colab tpu, and I am meeting the problem too. I also have tried with minimal batch_size= 16, but still get the error. So I want to ask you if you have solved the problem, and can you pretrain xlnet on colab TPU now?

Thanks!

@rusiaaman
Copy link
Owner

No sorry, I haven't been able to solve the problem. But what I found is that it is not the problem of tensorflow software version, because the same OOM errors occur for tensorflow "1.14.1.dev20190518" on TPUv2 using google cloud. I have tried reducing the sequence size to absurd minimum and batch size to 8, but no use.

I think the problem is due to some code in custom TPU estimator but not sure what exactly. The OOM errors signal that padding allocation has blown up, but I can't say why this is the case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants