-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is factCC reliable for factual correctness evaluation? #6
Comments
Noticed the metric based on uncased-bert, I did use lower-cased inputs . |
I've got same result... I used summary sentences as claim. |
As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result: On the author annotated dataset,give the result as follows: |
Some of my observations below:
In summary, I think FactCC can identify local errors like swapping entities or numbers. However, don't count on it to solve the hard NLI problem. Overall, it's still one of the better metrics. You can also check out the following paper. Goyal, Tanya, and Greg Durrett. "Evaluating factuality in generation with dependency-level entailment." arXiv preprint arXiv:2010.05478 (2020). |
Greatly Appreciate discussion above. |
I notice that somepaper use FactCC as a metric |
@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries. Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020). Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021). |
thanks a lot <3 |
My annotated dataset result is the same as yours. This result, however, is not consistent with the Table-3 F-1 score for FactCC. Anyone have an intuition for why? |
I really appreciate the excellent paper.
I tested factCC on CNN/DM dataset using gold reference sentences as claims(splitted into single sentence).
I strictly followed md, and used the official pre-trained factCC checkpoint.
I labeled all the claims as 'CORRECT' (because they are gold references).
The accuracy output by factCC is around 42% which means the model thinks only 42% of the reference sentences is factuality correct.
Is this reasonable or did I wrongly use the metric ?
The text was updated successfully, but these errors were encountered: