Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: No backend type associated with device type npu #49998

Open
CurtainRight opened this issue Jan 22, 2025 · 0 comments
Open

RuntimeError: No backend type associated with device type npu #49998

CurtainRight opened this issue Jan 22, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@CurtainRight
Copy link

What happened + What you expected to happen

错误信息:
2025-01-22 03:15:30,756 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_10b4a_00000
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/_private/worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=241781, ip=172.17.0.6, actor_id=f53b89fce763c01fc470097d01000000, repr=TorchTrainer)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=242146, ip=172.17.0.6, actor_id=0f87aeb59bca498728b6a1e001000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0xfffecc1a6eb0>)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/data/holly/hl-ascend-model/ray/train1.py", line 97, in train_func
trainer.train()
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 2330, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1339, in prepare
result = tuple(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1340, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 795, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: No backend type associated with device type npu

Training errored after 0 iterations at 2025-01-22 03:15:30. Total running time: 30s
Error file: /tmp/ray/session_2025-01-22_03-14-54_357147_230295/artifacts/2025-01-22_03-15-00/TorchTrainer_2025-01-22_03-15-00/driver_artifacts/TorchTrainer_10b4a_00000_0_2025-01-22_03-15-00/error.txt
2025-01-22 03:15:30,771 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/TorchTrainer_2025-01-22_03-15-00' in 0.0044s.

2025-01-22 03:15:30,773 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_10b4a_00000]
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=241781, ip=172.17.0.6, actor_id=f53b89fce763c01fc470097d01000000, repr=TorchTrainer)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=242146, ip=172.17.0.6, actor_id=0f87aeb59bca498728b6a1e001000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0xfffecc1a6eb0>)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/data/holly/hl-ascend-model/ray/train1.py", line 97, in train_func
trainer.train()
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 2330, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1339, in prepare
result = tuple(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1340, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1215, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/accelerator.py", line 1469, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 795, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: No backend type associated with device type npu

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/holly/hl-ascend-model/ray/train1.py", line 126, in
result: ray.train.Result = ray_trainer.fit()
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/ray/train/base_trainer.py", line 638, in fit
raise TrainingFailedError(
ray.train.base_trainerTrainingFailedError: The Ray Train run failed. lease inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: trainer = TorchTrainer.restore("/root/ray_results/TorchTrainer_2025-01-22_03-15-00").
To start a new run that will retry on training failures, set train.RunConfig(failure_config=train.FailureConfig(max_failures)) in the Trainer's run_config with max_failures > 0, or max_failures = -1 for unlimited retries.
(RayTrainWorker pid=242146) Process ForkServerProcess-2:
(RayTrainWorker pid=242146) Process ForkServerProcess-8:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmetho
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.addess, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RyTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Process ForkServerProcess-10:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/common.py", line 103, in daemon_process
(RayTrainWorker pid=242146) running.set()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 1081, in set
(RayTrainWorker pid=242146) return self._callmethod('set')
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Process ForkServerProcess-9:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Process ForkServerProcess-7:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Process ForkServerProcess-6:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Process ForkServerProcess-5:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Process ForkServerProcess-3:
(RayTrainWorker pid=242146) Process ForkServerProcess-4:
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
(RayTrainWorker pid=242146) conn = self._tls.connection
(RayTrainWorker pid=242146) AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) During handling of the above exception, another exception occurred:
(RayTrainWorker pid=242146)
(RayTrainWorker pid=242146) Traceback (most recent call last):
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
(RayTrainWorker pid=242146) self.run()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/process.py", line 108, in run
(RayTrainWorker pid=242146) self._target(*self._args, **self._kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 65, in wrapper
(RayTrainWorker pid=242146) raise exp
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 62, in wrapper
(RayTrainWorker pid=242146) func(*args, **kwargs)
(RayTrainWorker pid=242146) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 258, in task_distribute
(RayTrainWorker pid=242146) resource_proxy[SUB_PROCESS_STATE].append(True)
(RayTrainWorker pid=242146) File "", line 2, in append
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
(RayTrainWorker pid=242146) self._connect()
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
(RayTrainWorker pid=242146) conn = self._Client(self._token.address, authkey=self._authkey)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 502, in Client
(RayTrainWorker pid=242146) c = SocketClient(address)
(RayTrainWorker pid=242146) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
(RayTrainWorker pid=242146) s.connect(address)
(RayTrainWorker pid=242146) ConnectionRefusedError: [Errno 111] Connection refused
[ERROR] 2025-01-22-03:15:33 (PID:230295, Device:-1, RankID:-1) ERR99999 UNKNOWN application exception

Versions / Dependencies

ray:
-- CANN 版本: 8.0.RC1
-- Pytorch版本:2.1.0
-- torch_npu:2.1.0.post6
-- Python 版本:3.9.21
-- 训练卡: 910B2
-- transformers:4.48.1
-- ray:2.40.0

Reproduction script

`
import os
os.environ['ASCEND_RT_VISIBLE_DEVICES'] = '3'

import numpy as np
import evaluate
from datasets import load_dataset,Dataset
from transformers import (
Trainer,
TrainingArguments,
AutoTokenizer,
AutoModelForSequenceClassification,
)

import ray.train.huggingface.transformers
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
ray.init(resources={"NPU": 1})

import torch_npu

import pandas as pd

[1] Encapsulate data preprocessing, training, and evaluation

logic in a training function

============================================================

def train_func():
# Datasets
model_path = '/data/holly/hl-ascend-model/model/bert-base-chinese'
tokenizer = AutoTokenizer.from_pretrained(model_path)
dataset = 'ray-dataset/test.csv'
dataset_name =dataset.split('/')[-1]
dataset_path = os.path.join('/data/holly/hl-ascend-model/datasets', dataset_name)
# 标签处理
df = pd.read_csv(dataset_path)
df,num_labels,label_to_id,id_to_label = convert_label(df)
# df.to_csv(dataset_path,index=False)
# 暂时只支持csv
raw_datasets = Dataset.from_pandas(df)

def tokenize_function(examples):
    result = tokenizer(examples["text"], padding="max_length", truncation=True)
    result["label"] = [int(l) for l in examples["label"]]
    return result

# small_train_dataset = (
#     raw_datasets["train"].select(range(600)).map(tokenize_function, batched=True)
# )
# small_eval_dataset = (
#     raw_datasets["test"].select(range(200)).map(tokenize_function, batched=True)
# )

small_train_dataset = raw_datasets.map(tokenize_function, batched=True)
# print(type(dataset_train['label'][0]))
small_eval_dataset = raw_datasets.map(tokenize_function, batched=True)
small_train_dataset = small_train_dataset.remove_columns(['text'])

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    model_path,num_labels=num_labels,ignore_mismatched_sizes=True
)

# Evaluation Metrics
# metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"f1":1}

# Hugging Face Trainer
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

# [2] Report Metrics and Checkpoints to Ray Train
# ===============================================
callback = ray.train.huggingface.transformers.RayTrainReportCallback()
trainer.add_callback(callback)

# [3] Prepare Transformers Trainer
# ================================
trainer = ray.train.huggingface.transformers.prepare_trainer(trainer)

# Start Training
trainer.train()

def convert_label(df):
# 创建标签到ID的映射
label_to_id = {}
id_to_label = {}
id_counter = 0
# 遍历所有标签,创建映射
for label in df['label'].unique():
if label not in label_to_id:
label_to_id[label] = str(id_counter)
id_to_label[str(id_counter)] = label
id_counter += 1

# 将DataFrame中的标签替换为ID
df['label'] = df['label'].map(label_to_id)
return df,id_counter,label_to_id,id_to_label

[4] Define a Ray TorchTrainer to launch train_func on all workers

===================================================================

ray_trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(num_workers=1, resources_per_worker={"NPU":1}),
# [4a] If running in a multi-node cluster, this is where you
# should configure the run's persistent storage that is accessible
# across all worker nodes.
# run_config=ray.train.RunConfig(storage_path="s3://..."),
)
result: ray.train.Result = ray_trainer.fit()
`

Issue Severity

None

@CurtainRight CurtainRight added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025
@CurtainRight CurtainRight changed the title [<Ray component: Core|RLlib|etc...>] RuntimeError: No backend type associated with device type npu Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

1 participant