vLLM terminated unexpectedly at the multi-gpu environment #512

gnekt · 2024-06-20T08:23:58Z

Hello,

While using the ELI5 and TriviaQA datasets from the Hugging Face library, I encountered errors related to missing documents that are not present in the corpus. I experienced a similar issue with the HotpotQA dataset but managed to resolve it by cleaning the mismatched documents.

However, when I switched to using the HotpotQA dataset, I observed some strange behavior with the VLLM module. Specifically, the process is terminated without any error messages.

Thank you in advance.

vkehfdl1 · 2024-06-20T08:26:31Z

Hello @gnekt First of all, Thanks for the report.
We'll check Eli5 and triviaQA dataset first. @bwook00 @Eastsidegunn will help this also.

And about the termination of vLLM module. It can be OOM error. Do you check your VRAM status?
And what file did you use, and what system you used for running each dataset?

CristianCosci · 2024-06-20T08:34:56Z

And what file did you use, and what system you used for running each dataset?

CPU: AMD EPYC 7402P (48) @ 2.800GHz
GPU0: NVIDIA GeForce RTX 3090
GPU1: NVIDIA GeForce RTX 3090
Memory: 3042MiB / 128667MiB

And about the termination of vLLM module. It can be OOM error. Do you check your VRAM status?

This is our config.yaml:

# This config YAML file does not contain any optimization.
node_lines:
- node_line_name: retrieve_node_line  # Arbitrary node line name
  nodes:
    - node_type: retrieval
      strategy:
        metrics: [retrieval_f1, retrieval_recall, retrieval_precision]
      top_k: 3
      modules:
        - module_type: vectordb
          embedding_model: huggingface_baai_bge_small
- node_line_name: post_retrieve_node_line  # Arbitrary node line name
  nodes:
    - node_type: prompt_maker
      strategy:
        metrics: [bleu, meteor, rouge]
      modules:
        - module_type: fstring
          prompt: "Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : "
    - node_type: generator
      strategy:
        metrics: [bleu, meteor, rouge]
      modules:
        - module_type: vllm
          llm: mistralai/Mistral-7B-Instruct-v0.2

We checked the VRAM status and it takes only 1GB for the embedding computation (it seems not possible to occur in OOM error).

CristianCosci · 2024-06-20T08:43:07Z

This is the last logger output when executing autorag:

UserWarning: This pandas object has duplicate indices, and swifter may not be able to improve performance. Consider resetting the indices with `df.reset_index(drop=True)`.
  warnings.warn(
[06/20/24 08:39:54] INFO     [evaluator.py:97] >> Running node line post_retrieve_node_line...                                    evaluator.py:97
                    INFO     [node.py:55] >> Running node prompt_maker...                                                              node.py:55
                    INFO     [base.py:20] >> Running prompt maker node - fstring module...                                             base.py:20
                    INFO     [node.py:55] >> Running node generator...                                                                 node.py:55
                    INFO     [base.py:34] >> Running generator node - vllm module...                                                   base.py:34
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

vkehfdl1 · 2024-06-22T05:50:45Z

Hi @gnekt I checked Eli5 and triviaQA dataset and the dataset had no missing doc_id in the corpus dataset. Unfortunately, I can't find the error you mentioned.

@gnekt @CristianCosci Plus, about the vLLM termination, actually I can't find this error also. I think it is because the vLLM install environment? (It worked fine on our system)

Maybe re-install vLLM to the latest version can be helpful.... My vllm version is 0.4.3

vkehfdl1 · 2024-08-05T06:15:26Z

It looks like there is an error at the multi-gpu environment...

vkehfdl1 · 2024-08-16T03:53:27Z

I finally get a multi-gpu environment. So I can try to reproduce a bug and resolve it.

vkehfdl1 · 2024-08-28T07:07:39Z

Because of the CUDA-related issue, we lose our multi-gpu environment again 😭
We need more time to resolve this.....

CristianCosci · 2024-08-28T08:23:19Z

No problem 😃

vkehfdl1 added the bug Something isn't working label Jun 21, 2024

vkehfdl1 changed the title ~~Sample Datasets Error~~ vLLM terminated unexpectedly Jun 22, 2024

vkehfdl1 changed the title ~~vLLM terminated unexpectedly~~ vLLM terminated unexpectedly at the multi-gpu environment Aug 5, 2024

vkehfdl1 added the High Priority label Aug 5, 2024

vkehfdl1 added the help wanted Extra attention is needed label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM terminated unexpectedly at the multi-gpu environment #512

vLLM terminated unexpectedly at the multi-gpu environment #512

gnekt commented Jun 20, 2024

vkehfdl1 commented Jun 20, 2024

CristianCosci commented Jun 20, 2024

CristianCosci commented Jun 20, 2024

vkehfdl1 commented Jun 22, 2024

vkehfdl1 commented Aug 5, 2024

vkehfdl1 commented Aug 16, 2024

vkehfdl1 commented Aug 28, 2024

CristianCosci commented Aug 28, 2024

vLLM terminated unexpectedly at the multi-gpu environment #512

vLLM terminated unexpectedly at the multi-gpu environment #512

Comments

gnekt commented Jun 20, 2024

vkehfdl1 commented Jun 20, 2024

CristianCosci commented Jun 20, 2024

CristianCosci commented Jun 20, 2024

vkehfdl1 commented Jun 22, 2024

vkehfdl1 commented Aug 5, 2024

vkehfdl1 commented Aug 16, 2024

vkehfdl1 commented Aug 28, 2024

CristianCosci commented Aug 28, 2024