Update pyproject.toml: add dependancy "ninja" #2522

adarshxs · 2024-12-19T14:56:27Z

Motivation

fixes : #2514
and the following:

[2024-12-19 14:44:19 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=41.70 GB
[2024-12-19 14:44:20 TP0] Memory pool end. avail mem=5.26 GB
[2024-12-19 14:44:20 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|                                                                                                                                              | 0/6 [00:00<?, ?it/s]2024-12-19 14:44:21,421 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_64_posenc_0_use_swa_False_use_logits_cap_False
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
  0%|                                                                                                                                              | 0/6 [00:01<?, ?it/s]
[2024-12-19 14:44:21 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 209, in __init__
    self.capture()
  File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 275, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture_one_batch_size
    self.model_runner.attn_backend.init_forward_metadata_capture_cuda_graph(
  File "/workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 194, in init_forward_metadata_capture_cuda_graph
    self.indices_updater_decode.update(
  File "/workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 378, in update_single_wrapper
    self.call_begin_forward(
  File "/workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 478, in call_begin_forward
    wrapper.begin_forward(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 788, in plan
    self._cached_module = get_batch_decode_module(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 148, in get_batch_decode_module
    mod = gen_batch_decode_module(*args)
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/attention.py", line 173, in gen_batch_decode_module
    return load_cuda_ops(uri, source_paths)
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 112, in load_cuda_ops
    module = torch_cpp_ext.load(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1312, in load
    return _jit_compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1804, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1853, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1531, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 192, in __init__
    self.tp_worker = TpWorkerClass(
  File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 184, in __init__
    self.init_cuda_graphs()
  File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 639, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 211, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Ninja is required to load C++ extensions
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

fixes: sgl-project#2514

zhyncs · 2024-12-19T14:57:48Z

duplicate with #2501

zhyncs · 2024-12-19T15:08:58Z

Thank you for your contribution, even though similar feature has already been implemented in other PR!

adarshxs · 2024-12-19T16:07:15Z

my bad I hadn't synced my fork. Thanks!

Update pyproject.toml: add dependancy "ninja"

9d46827

fixes: sgl-project#2514

zhyncs added the duplicate This issue or pull request already exists label Dec 19, 2024

zhyncs closed this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pyproject.toml: add dependancy "ninja" #2522

Update pyproject.toml: add dependancy "ninja" #2522

adarshxs commented Dec 19, 2024

zhyncs commented Dec 19, 2024

zhyncs commented Dec 19, 2024

adarshxs commented Dec 19, 2024

Update pyproject.toml: add dependancy "ninja" #2522

Update pyproject.toml: add dependancy "ninja" #2522

Conversation

adarshxs commented Dec 19, 2024

Motivation

zhyncs commented Dec 19, 2024

zhyncs commented Dec 19, 2024

adarshxs commented Dec 19, 2024