Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http server crashes on "paralleling scheduler." #441

Open
Aktsvigun opened this issue Jan 22, 2025 · 1 comment
Open

Http server crashes on "paralleling scheduler." #441

Aktsvigun opened this issue Jan 22, 2025 · 1 comment

Comments

@Aktsvigun
Copy link

Hi, I'm running an HTTP server according to instruction. However, the creating always fails on the "Scheduler found, paralleling scheduler..." step regardless of the number of GPUs I use. Could you help fixing it?

WARNING 01-22 09:31:56 [envs.py:82] Flash Attention library "flash_attn" not found, using pytorch attention implementation
WARNING 01-22 09:31:57 [args.py:358] Distributed environment is not initialized. Initializing...
DEBUG 01-22 09:31:57 [parallel_state.py:200] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
[W122 09:31:57.403391764 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W122 09:31:57.403416352 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
INFO 01-22 09:31:57 [config.py:164] Pipeline patch number not set, using default value 1
[Rank 0] 2025-01-22 09:31:57 - INFO - Initializing model on GPU: 0
Loading pipeline components...:  14%|███████████████████████████████▏                                                                                                                                                                                          | 1/7 [00:00<00:01,  5.76it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.49it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00,  2.15it/s]
WARNING 01-22 09:32:01 [runtime_state.py:64] Model parallel is not initialized, initializing...
INFO 01-22 09:32:01 [base_pipeline.py:367] Transformer backbone found, but model parallelism is not enabled, use naive model
INFO 01-22 09:32:02 [base_pipeline.py:423] Scheduler found, paralleling scheduler...
  0%|                                                                                                                                                                                                                                                                  | 0/3 [00:00<?, ?it/s]/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0122 09:33:05.030219 823834 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 823906) of binary: /mnt/share/ai_studio/image_generation/.venv-xdit/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/share/ai_studio/image_generation/.venv-xdit/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-22_09:33:05
  host      : computeinstance-e00ey7wmsf25d9crvz
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 823906)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 823906
============================================================
['/mnt/share/ai_studio/image_generation/.venv-xdit/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node=1', '/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py', '--model=black-forest-labs/FLUX.1-schnell', '--pipefusion_parallel_degree=1', '--ulysses_degree=1', '--ring_degree=1', '--height=1024', '--width=1024', '--max_queue_size=4', '--use_torch_compile']
Traceback (most recent call last):
  File "/mnt/share/ai_studio/image_generation/xDiT/./http-service/launch_host.py", line 54, in <module>
    main()
  File "/mnt/share/ai_studio/image_generation/xDiT/./http-service/launch_host.py", line 50, in main
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.11/subprocess.py", line 569, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/mnt/share/ai_studio/image_generation/.venv-xdit/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node=1', '/mnt/share/ai_studio/image_generation/xDiT/http-service/host.py', '--model=black-forest-labs/FLUX.1-schnell', '--pipefusion_parallel_degree=1', '--ulysses_degree=1', '--ring_degree=1', '--height=1024', '--width=1024', '--max_queue_size=4', '--use_torch_compile']' returned non-zero exit status 1.
@feifeibear
Copy link
Collaborator

The log shows that the distributed environment is not initialized and is being initialized automatically. Ensure that your environment is properly set up for distributed training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants