You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, great work! I am having some issues with training Llama-2-13b-chat on Anthropic HH dataset.
I followed to train SFT on HH and then DPO according to README
The only things I change is policy_dtype: bfloat16 to use Flash attention V2 and change the tokenization so that it match llama-2 instruction-following format. Here are examples of the tokens
Hi, great work! I am having some issues with training Llama-2-13b-chat on Anthropic HH dataset.
I followed to train SFT on HH and then DPO according to README
The only things I change is
policy_dtype: bfloat16
to use Flash attention V2 and change the tokenization so that it match llama-2 instruction-following format. Here are examples of the tokensHowever, I found the reward accuracies are not better than 50%, see below, and comparison performances are worse than before (evaluated by GPT-4)
As our system cannot access public Wandb, so I don't have wandb link or better metric indications to diagnose.
The text was updated successfully, but these errors were encountered: