-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different hug_trans version cause different BertScore #164
Comments
Hi @dzf0023, I ran following code with both from bert_score import score
pred = "aaaaaa"
ref = "hello there"
print(score([pred], [ref], lang='en'))
print(score([pred], [ref], lang='en', rescale_with_baseline=True)) However, I got identical outputs as follows (with two different versions).
Can you double check if you call the function with the same inputs? |
Hi @felixgwu, Thank you so much for your quick response. I basically tried the metric from hugging face website: https://huggingface.co/spaces/evaluate-metric/bertscore
It still gives me different results for different "hug_trans" versions.
I see previously you answered that BertScore can change with different huggingface's transformers versions and you would look into it. It is because of this reason? https://github.com/Tiiiger/bert_score/issues/143#issuecomment-1327988420 Please find my screenshots below for two hashcode: and |
Thank you so much for your time. May I know which library you are using? I can definitely go and check the source code of how they implement the transformer. Also, this is very interesting to me. Intuitively, when we compare "aaaaaa" Vs "hello there", which is a very different pair, in this case, BertScore still gives a high score, "0.808". (which I feel it should not). From your original paper, I understand most of the experiments are MT, so I referred to another benchmark paper "SummEval: Re-evaluating Summarization Evaluation." https://arxiv.org/abs/2007.12626 since I am more focused on summarization task. In this paper, Table 4, their average BertScore for most AI models such as T5 is just 0.4450. I am curious, even for many intelligent/big models such as T5 just achieves an average BertScore of 0.445, how could a very different pair get 0.808? Please check another case I used below: these ref/generation pair are even more different, using 4.24.0 still gets BertScore_F1 as 0.77 Thus, my feeling is, the version that gives your very high score, is tend to give a relatively high score; that is, the lower bound is very high. Another follow-up question is, if the high/low of the BertScore is not that important, should we care more about the human correlation instead of being more focused on the score itself? Since I assume different transformers have different embedding understandings which cause the variable results. Please find the screenshot of the SummEval paper table below: |
You can find all the libraries in https://github.com/Tiiiger/bert_score/blob/master/requirements.txt 0.80 is a reasonable BERTScore when having roberta-large_L17. As you can see in this file, the BERTScore between two random sentences is about 0.83. This is why we recommend using |
Thank you so much for your suggestions! I will try it asap. |
Dear Author,
I would like to ask why using a different hug_trans version, even for the same model and reference/prediction, the BertScore would be different.
For instance, given prediction = "aaaaaa", reference = "hello there", using the default model (roberta-large_L17_no-idf_version=0.3.12), if we use hug_trans=4.30.2, the BertScore_F1 would be 0.80 while using hug_trans=4.24.0, the BertScore_F1 would be 0.238, for the same input.
Meanwhile, using identical prediction and reference as input, BertScore_F1 with hug_trans = 4.24.0 not giving "1" as the result.
Although the original paper mentioned that they calculated a random BertScore to do the baseline rescaling, this random score should be very small, thus it's a little tricky to understand why the significant gap is shown.
Thank you so much for your contribution and your time to answer our questions.
The text was updated successfully, but these errors were encountered: