unexpected response when using llama2-7b-chat #3

kaishxu · 2024-04-17T13:28:05Z

Hello!

I'm trying to use your pre-trained model with this command:
CUDA_VISIBLE_DEVICES=4,5,6,7 python inference.py -i -m llama-2-7b-chat --eval_name concat_recur

However, there is an unexpected generation stop when inputting the query:
help me list popular songs written by Taylor Swift.

The result is shown as follows:

It stops generating more content and outputs </s> instead.

Are there any other settings I missed?

The text was updated successfully, but these errors were encountered:

Janghyun1230 · 2024-04-17T15:58:30Z

Hello!
I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists:

However, after the compression, the model seems to produce EOS token before the lists:

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

kaishxu · 2024-04-18T02:49:45Z

Thanks so much for your quick reply!

I have another question about the class LinearMask() in most modeling files under the directory "arch". As shown in the following figure, the forward input of LinearMask() includes comp_mask. However, the specific operation doesn't apply this variable.

If this variable is not used, the linear mapping function is the same as the original function in "modeling_llama.py".

kaishxu · 2024-04-18T06:25:51Z

Hello! I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists:

However, after the compression, the model seems to produce EOS token before the lists:

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

It is an interesting phenomenon as compression tokens affect generation capability.

Janghyun1230 · 2024-04-18T14:46:45Z

For the question regarding LinearMask, comp_mask works with LoRA. I modified the LoRA Huggingface code at src/peft_custom/lora.py.

Context-Memory/src/peft_custom/lora.py

Line 565 in 24af6a0

def forward(self, x: torch.Tensor, comp_mask=None):

Without LoRA, our model works the same as the original function, while the LoRA activates only for the compression tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected response when using llama2-7b-chat #3

unexpected response when using llama2-7b-chat #3

kaishxu commented Apr 17, 2024

Janghyun1230 commented Apr 17, 2024

kaishxu commented Apr 18, 2024 •

edited

Loading

kaishxu commented Apr 18, 2024

Janghyun1230 commented Apr 18, 2024

unexpected response when using llama2-7b-chat #3

unexpected response when using llama2-7b-chat #3

Comments

kaishxu commented Apr 17, 2024

Janghyun1230 commented Apr 17, 2024

kaishxu commented Apr 18, 2024 • edited Loading

kaishxu commented Apr 18, 2024

Janghyun1230 commented Apr 18, 2024

kaishxu commented Apr 18, 2024 •

edited

Loading