Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different hug_trans version cause different BertScore #164

Open
dzf0023 opened this issue Jun 22, 2023 · 6 comments
Open

Different hug_trans version cause different BertScore #164

dzf0023 opened this issue Jun 22, 2023 · 6 comments

Comments

@dzf0023
Copy link

dzf0023 commented Jun 22, 2023

Dear Author,

I would like to ask why using a different hug_trans version, even for the same model and reference/prediction, the BertScore would be different.

For instance, given prediction = "aaaaaa", reference = "hello there", using the default model (roberta-large_L17_no-idf_version=0.3.12), if we use hug_trans=4.30.2, the BertScore_F1 would be 0.80 while using hug_trans=4.24.0, the BertScore_F1 would be 0.238, for the same input.

Meanwhile, using identical prediction and reference as input, BertScore_F1 with hug_trans = 4.24.0 not giving "1" as the result.
Although the original paper mentioned that they calculated a random BertScore to do the baseline rescaling, this random score should be very small, thus it's a little tricky to understand why the significant gap is shown.

Thank you so much for your contribution and your time to answer our questions.

@felixgwu
Copy link
Collaborator

Hi @dzf0023,

I ran following code with both transformers=4.30.2 and transformers=4.24.0 as you described.

from bert_score import score

pred = "aaaaaa"
ref = "hello there"

print(score([pred], [ref], lang='en'))
print(score([pred], [ref], lang='en', rescale_with_baseline=True))

However, I got identical outputs as follows (with two different versions).

no rescaling: (tensor([0.7637]), tensor([0.8599]), tensor([0.8089]))
with rescaling: (tensor([-0.4024]), tensor([0.1683]), tensor([-0.1321]))

Can you double check if you call the function with the same inputs?

@dzf0023
Copy link
Author

dzf0023 commented Jun 22, 2023

Hi @felixgwu,

Thank you so much for your quick response. I basically tried the metric from hugging face website: https://huggingface.co/spaces/evaluate-metric/bertscore
Below are the codes:

from evaluate import load
bertscore = load("bertscore")

pred = "aaaaaa"
ref = "hello there"
results = bertscore.compute(predictions=pred, references=ref, lang='en')

It still gives me different results for different "hug_trans" versions.

hug_trans = 4.30.2 , result is (tensor([0.7637]), tensor([0.8599]), tensor([0.8089]))
hug_trans = 4.24.0  , result is (tensor([0.1998]), tensor([0.2959]), tensor([0.2385]))

I see previously you answered that BertScore can change with different huggingface's transformers versions and you would look into it. It is because of this reason? https://github.com/Tiiiger/bert_score/issues/143#issuecomment-1327988420
.

Please find my screenshots below for two hashcode:
colab

and

Server

@felixgwu
Copy link
Collaborator

Unfortunately, I'm still not able to get 0.24 as what you showed. Here is what I got with 4.24.0 version. There might be some bugs in some other libraries that I don't know.
image

@dzf0023
Copy link
Author

dzf0023 commented Jun 23, 2023

Thank you so much for your time. May I know which library you are using? I can definitely go and check the source code of how they implement the transformer.

Also, this is very interesting to me. Intuitively, when we compare "aaaaaa" Vs "hello there", which is a very different pair, in this case, BertScore still gives a high score, "0.808". (which I feel it should not). From your original paper, I understand most of the experiments are MT, so I referred to another benchmark paper "SummEval: Re-evaluating Summarization Evaluation." https://arxiv.org/abs/2007.12626 since I am more focused on summarization task. In this paper, Table 4, their average BertScore for most AI models such as T5 is just 0.4450. I am curious, even for many intelligent/big models such as T5 just achieves an average BertScore of 0.445, how could a very different pair get 0.808? Please check another case I used below:

Screen Shot 2023-06-22 at 9 09 16 PM

these ref/generation pair are even more different, using 4.24.0 still gets BertScore_F1 as 0.77

Thus, my feeling is, the version that gives your very high score, is tend to give a relatively high score; that is, the lower bound is very high.

Another follow-up question is, if the high/low of the BertScore is not that important, should we care more about the human correlation instead of being more focused on the score itself? Since I assume different transformers have different embedding understandings which cause the variable results.

Please find the screenshot of the SummEval paper table below:

Screen Shot 2023-06-22 at 9 16 10 PM

@felixgwu
Copy link
Collaborator

felixgwu commented Jun 23, 2023

You can find all the libraries in https://github.com/Tiiiger/bert_score/blob/master/requirements.txt
But I think you may first use the the environment that you have transformers=4.30.2 and run pip install transformers==4.24.0 to downgrade it and check if you can also get 0.80 with the older version and then you can compare the other libraries in these two environment.

0.80 is a reasonable BERTScore when having roberta-large_L17. As you can see in this file, the BERTScore between two random sentences is about 0.83. This is why we recommend using rescale_with_baseline=True that gives you -0.1321. For more detailed explanation of this, we see our post.

@dzf0023
Copy link
Author

dzf0023 commented Jun 23, 2023

Thank you so much for your suggestions! I will try it asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants