Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

In Colab using 'x10_training_loop' leads to "error: Couldn't lookup symbols:" #1016

Open
mikowals opened this issue Jun 17, 2020 · 6 comments
Assignees

Comments

@mikowals
Copy link
Contributor

Opening a blank notebook which I expect is running S4TF v0.9 and entering the following:

import TensorFlow
import x10_training_loop 
Device.trainingDevices // Same error for runOnThreads(), HostStatistics(), ... etc.

Gives the error:

error: Couldn't lookup symbols:
      static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
      static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>

Code completion works correctly after import x10_training_loop has been run once so I think the import line for 'x10_training_loop' is correct. The error appears to occur for everything included in 'x10_training_loop' but I didn't try them all.

@mikowals
Copy link
Contributor Author

I should have added that importing 'x10_training_loop' works as expected using the MacOS v0.9 toolchain or the June 12 toolchain for MacOS. So I think it is either specific to Colab or the linux toolchain.

@8bitmp3
Copy link
Contributor

8bitmp3 commented Jun 18, 2020

@mikowals I noticed that examples in the swift-models repo, such as the BERT-Cola and others, set the calculations to be run on an accelerator via XLA on the X10 backend with:

...
let device = Device.defaultXLA
...

And eager mode is done via device = Device.defaultTFEager.

So, if you run Device.defaultXLA instead, it should be error-free and recognizing you have a Colab GPU/TPU with an XLA backend:

import TensorFlow
import x10_training_loop 

Device.defaultXLA

Output:

...
▿ Device(kind: .GPU, ordinal: 0, backend: .XLA)
  - kind : TensorFlow.Device.Kind.GPU
  - ordinal : 0
  - backend : TensorFlow.Device.Backend.XLA

However, as you already mentioned, Device.trainingDevices will return:

error: Couldn't lookup symbols:
  static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>
  static (extension in x10_training_loop):TensorFlow.Device.trainingDevices.getter : Swift.Array<TensorFlow.Device>

(I'm sure @BradLarson @saeta and others can explain this better.)

@mikowals
Copy link
Contributor Author

Hi @8bitmp3. Yes, my reduced example with Device.trainingDevices can be worked around or I can train in x10 using other code. I demonstrated the problem with .trainingDevices only because it was the first line in my code that referenced the 'x10_training_loop' module and showed an error with that module.

I created the issue because I think it shows that something is going wrong in the building or use of the toolchain in Colab leading to a problem using 'x10_training_loop'.

@texasmichelle
Copy link
Member

Thank you for reporting! I can reproduce this in the latest Colab build, which does not occur in the linux nightlies. I suspect there is something in CMake that's not happening in Colab, so I'll hunt down this code and identify a fix.

@texasmichelle
Copy link
Member

We narrowed this down to a problem with the static linking of the CX10 library. This appears to work fine when used with the swift binary, but Colab uses a different dynamic execution approach. @compnerd has a few ideas about how to resolve this for all platforms.

@philipturner
Copy link

For the sake of documenting bugs:

#1177 (comment) - similar problem, but not the same.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants