You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm running an HTTP server according to instruction. However, the creating always fails on the "Scheduler found, paralleling scheduler..." step regardless of the number of GPUs I use. Could you help fixing it?
WARNING 01-22 09:31:56 [envs.py:82] Flash Attention library "flash_attn" not found, using pytorch attention implementation
WARNING 01-22 09:31:57 [args.py:358] Distributed environment is not initialized. Initializing...
DEBUG 01-22 09:31:57 [parallel_state.py:200] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
[W122 09:31:57.403391764 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W122 09:31:57.403416352 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
INFO 01-22 09:31:57 [config.py:164] Pipeline patch number not set, using default value 1
[Rank 0] 2025-01-22 09:31:57 - INFO - Initializing model on GPU: 0
Loading pipeline components...: 14%|███████████████████████████████▏ | 1/7 [00:00<00:01, 5.76it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.49it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.15it/s]
WARNING 01-22 09:32:01 [runtime_state.py:64] Model parallel is not initialized, initializing...
INFO 01-22 09:32:01 [base_pipeline.py:367] Transformer backbone found, but model parallelism is not enabled, use naive model
INFO 01-22 09:32:02 [base_pipeline.py:423] Scheduler found, paralleling scheduler...
0%| | 0/3 [00:00<?, ?it/s]/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0122 09:33:05.030219 823834 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 823906) of binary: /mnt/share/ai_studio/image_generation/.venv-xdit/bin/python
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 923, in <module>
main()
File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-22_09:33:05
host : computeinstance-e00ey7wmsf25d9crvz
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 823906)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 823906
============================================================
['/mnt/share/ai_studio/image_generation/.venv-xdit/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node=1', '/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py', '--model=black-forest-labs/FLUX.1-schnell', '--pipefusion_parallel_degree=1', '--ulysses_degree=1', '--ring_degree=1', '--height=1024', '--width=1024', '--max_queue_size=4', '--use_torch_compile']
Traceback (most recent call last):
File "/mnt/share/ai_studio/image_generation/xDiT/./http-service/launch_host.py", line 54, in <module>
main()
File "/mnt/share/ai_studio/image_generation/xDiT/./http-service/launch_host.py", line 50, in main
subprocess.run(cmd, check=True)
File "/usr/lib/python3.11/subprocess.py", line 569, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/mnt/share/ai_studio/image_generation/.venv-xdit/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node=1', '/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py', '--model=black-forest-labs/FLUX.1-schnell', '--pipefusion_parallel_degree=1', '--ulysses_degree=1', '--ring_degree=1', '--height=1024', '--width=1024', '--max_queue_size=4', '--use_torch_compile']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered:
The log shows that the distributed environment is not initialized and is being initialized automatically. Ensure that your environment is properly set up for distributed training.
Hi, I'm running an HTTP server according to instruction. However, the creating always fails on the "Scheduler found, paralleling scheduler..." step regardless of the number of GPUs I use. Could you help fixing it?
The text was updated successfully, but these errors were encountered: