Skip to content

[GPU] CUDA error: when some use CPU and some use GPU, CUDA tensor will not match #779

Open
@chaoyanghe

Description

@chaoyanghe

https://open.fedml.ai/octopus/collaboratorGroup/runDetail?projectId=1591&groupId=2751&runId=5249

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR] device = validate_cuda_device(location)

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR] File "C:\Users\chaoy\anaconda3\envs\fedml\lib\site-packages\torch\serialization.py", line 166, in validate_cuda_device

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR] raise RuntimeError('Attempting to deserialize object on a CUDA '

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR] RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [ERROR]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [INFO] [init.py:312:log_training_failed_status] log training inner status FAILED

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [INFO]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [INFO] log training inner status FAILED

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [INFO]

[FedML-Client @device-id-11226] [Wed, 15 Feb 2023 16:07:19] [INFO] [mlops_metrics.py:160:common_broadcast_client_training_status] report_client_training_status. message_json = {"edge_id": 11226, "run_id": "5249", "status": "FAILED"}

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions