Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected response when using llama2-7b-chat #3

Open
kaishxu opened this issue Apr 17, 2024 · 4 comments
Open

unexpected response when using llama2-7b-chat #3

kaishxu opened this issue Apr 17, 2024 · 4 comments

Comments

@kaishxu
Copy link

kaishxu commented Apr 17, 2024

Hello!

I'm trying to use your pre-trained model with this command:
CUDA_VISIBLE_DEVICES=4,5,6,7 python inference.py -i -m llama-2-7b-chat --eval_name concat_recur

However, there is an unexpected generation stop when inputting the query:
help me list popular songs written by Taylor Swift.

The result is shown as follows:
Screenshot 2024-04-17 at 21 26 19

It stops generating more content and outputs </s> instead.

Are there any other settings I missed?

@Janghyun1230
Copy link
Collaborator

Hello!
I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists:
스크린샷 2024-04-17 오전 11 50 11

However, after the compression, the model seems to produce EOS token before the lists:
스크린샷 2024-04-17 오전 11 48 32

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

@kaishxu
Copy link
Author

kaishxu commented Apr 18, 2024

Thanks so much for your quick reply!

I have another question about the class LinearMask() in most modeling files under the directory "arch". As shown in the following figure, the forward input of LinearMask() includes comp_mask. However, the specific operation doesn't apply this variable.

Screenshot 2024-04-18 at 10 42 20

If this variable is not used, the linear mapping function is the same as the original function in "modeling_llama.py".

@kaishxu
Copy link
Author

kaishxu commented Apr 18, 2024

Hello! I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists: 스크린샷 2024-04-17 오전 11 50 11

However, after the compression, the model seems to produce EOS token before the lists: 스크린샷 2024-04-17 오전 11 48 32

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

It is an interesting phenomenon as compression tokens affect generation capability.

@Janghyun1230
Copy link
Collaborator

For the question regarding LinearMask, comp_mask works with LoRA. I modified the LoRA Huggingface code at src/peft_custom/lora.py.

def forward(self, x: torch.Tensor, comp_mask=None):

Without LoRA, our model works the same as the original function, while the LoRA activates only for the compression tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants