Http server crashes on "paralleling scheduler." #441

Aktsvigun · 2025-01-22T09:41:01Z

Hi, I'm running an HTTP server according to instruction. However, the creating always fails on the "Scheduler found, paralleling scheduler..." step regardless of the number of GPUs I use. Could you help fixing it?

WARNING 01-22 09:31:56 [envs.py:82] Flash Attention library "flash_attn" not found, using pytorch attention implementation
WARNING 01-22 09:31:57 [args.py:358] Distributed environment is not initialized. Initializing...
DEBUG 01-22 09:31:57 [parallel_state.py:200] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
[W122 09:31:57.403391764 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W122 09:31:57.403416352 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
INFO 01-22 09:31:57 [config.py:164] Pipeline patch number not set, using default value 1
[Rank 0] 2025-01-22 09:31:57 - INFO - Initializing model on GPU: 0
Loading pipeline components...:  14%|███████████████████████████████▏                                                                                                                                                                                          | 1/7 [00:00<00:01,  5.76it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.49it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00,  2.15it/s]
WARNING 01-22 09:32:01 [runtime_state.py:64] Model parallel is not initialized, initializing...
INFO 01-22 09:32:01 [base_pipeline.py:367] Transformer backbone found, but model parallelism is not enabled, use naive model
INFO 01-22 09:32:02 [base_pipeline.py:423] Scheduler found, paralleling scheduler...
  0%|                                                                                                                                                                                                                                                                  | 0/3 [00:00<?, ?it/s]/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0122 09:33:05.030219 823834 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 823906) of binary: /mnt/share/ai_studio/image_generation/.venv-xdit/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-22_09:33:05
  host      : computeinstance-e00ey7wmsf25d9crvz
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 823906)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 823906
============================================================
['/mnt/share/ai_studio/image_generation/.venv-xdit/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node=1', '/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py', '--model=black-forest-labs/FLUX.1-schnell', '--pipefusion_parallel_degree=1', '--ulysses_degree=1', '--ring_degree=1', '--height=1024', '--width=1024', '--max_queue_size=4', '--use_torch_compile']
Traceback (most recent call last):
  File "/mnt/share/ai_studio/image_generation/xDiT/./http-service/launch_host.py", line 54, in <module>
    main()
  File "/mnt/share/ai_studio/image_generation/xDiT/./http-service/launch_host.py", line 50, in main
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.11/subprocess.py", line 569, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/mnt/share/ai_studio/image_generation/.venv-xdit/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node=1', '/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py', '--model=black-forest-labs/FLUX.1-schnell', '--pipefusion_parallel_degree=1', '--ulysses_degree=1', '--ring_degree=1', '--height=1024', '--width=1024', '--max_queue_size=4', '--use_torch_compile']' returned non-zero exit status 1.

The text was updated successfully, but these errors were encountered:

feifeibear · 2025-01-28T05:40:57Z

The log shows that the distributed environment is not initialized and is being initialized automatically. Ensure that your environment is properly set up for distributed training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Http server crashes on "paralleling scheduler." #441

Http server crashes on "paralleling scheduler." #441

Aktsvigun commented Jan 22, 2025

feifeibear commented Jan 28, 2025

Http server crashes on "paralleling scheduler." #441

Http server crashes on "paralleling scheduler." #441

Comments

Aktsvigun commented Jan 22, 2025

feifeibear commented Jan 28, 2025