Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non standard Relation Extraction metric ? #5

Open
btaille opened this issue Nov 15, 2019 · 6 comments
Open

Non standard Relation Extraction metric ? #5

btaille opened this issue Nov 15, 2019 · 6 comments

Comments

@btaille
Copy link

btaille commented Nov 15, 2019

Hello,

In your paper you only specify the criterion to consider an entity as correct, and not a relation.
As I understand by a quick look at your code in model1/relation_metrics.py you consider a relation as correct if the relation type is correct along with the spans of its two arguments.
That is without considering the predicted entity type of the arguments.

If so, you use what (Bekoulis 2018) refers to as the "Boundaries" evaluation setting.
You cannot directly compare to previous works that take into account the entity type in the "Strict" evaluation setting as defined by (Bekoulis 2018).
As far as I know, (Li and Ji 2014) is the only related work using this "Boundaries" evaluation setting.

FYI (Sanh 2019) also uses a different metric and its scores are already not comparable to previous work, as pointed out in this issue.

(Bekoulis 2018) = "Joint entity recognition and relation extraction as a multi-head selection problem"

Best regards,

@luanyi
Copy link
Owner

luanyi commented Nov 16, 2019

Thank you for your comments. The (Bekoulis 2018) we have compared with is "Adversarial training for multi-context joint entity and relation extraction", not "Joint entity recognition and relation extraction as a multi-head selection problem".
In "Adversarial training for multi-context joint entity and relation extraction", they have compared with Miwa & Bansal 2016 which follows exactly the same evaluation as (Li and Ji 2014), so I don't think there is any problem in our evaluation.

@btaille
Copy link
Author

btaille commented Nov 18, 2019

Thank you for your answer but I beg to differ. In "Adversarial training for multi-context joint entity and relation extraction", Bekoulis et al. also introduce the 3 evaluation settings ("Strict", "Boundaries" and "Relaxed") in section 4. And for example on ACE04 they report Strict evaluation to compare to (Miwa and Bansal 2016). (Miwa and Bansal 2016) do compare with (Li and Ji 2014) but this is a mistake in my opinion since they state that they consider the type of an entity in section 4.1.

I am not familiar with the SciERC and WLPC literature but for ACE datasets I am confident that most of related works use the Strict evaluation setting.

@luanyi
Copy link
Owner

luanyi commented Nov 18, 2019

Both SciERC and WLPC uses span evaluation for relation.
For ACE04, in 4.1 of (Miwa and Bansal 2016) they say "We use the same data splits, preprocessing,
and task settings as Li and Ji (2014).", so I think they are following the same setup as Li and Ji which has the same evaluation method as us.
We will update our paper to make the evaluation metric more clear. We hope people can do a fair comparison with our results. But since there are so many confusions in previous literature, we do not have the bandwidth to verify each individual work. If you can verify with authors in (Bekoulis 2018) that they indeed use a different set of evaluation metric, we will remove that comparison from our result table.

@btaille
Copy link
Author

btaille commented Nov 18, 2019

I am still not convinced that (Miwa and Bansal 2016) used the same setting as you and am trying to get first-hand information on that.

They say for ACE05 : "We use the same data splits, preprocessing, and task settings as Li and Ji (2014) [...] We treat an entity as correct when its type and the region of its head are correct. We treat a relation as correct when its type and argument entities are correct".

And for ACE04 : "We follow the cross-validation setting of Chan and Roth (2011) and Li and Ji (2014), and the preprocessing and evaluation metrics of ACE05."

I am positive that (Bekoulis 2018) used the Strict setting for their results, as they state and as one can see in their code. And I am very confident that (Li and Ji 2014) is the only work using your setting on ACE datasets.

I agree that all of this is very confusing and, if anything, I thank you for releasing your code.
I will try to compute a "Strict" score with the model you released for ACE05.

@luanyi
Copy link
Owner

luanyi commented Nov 18, 2019

Since so many previous work on ACE are based on and compared with (Li and Ji 2014), I'm skeptical about the statement that "(Li and Ji 2014) is the only work using your setting on ACE datasets".
But I would appreciate if you let us know the performance of our model using the "Strict" score.

@luanyi luanyi closed this as completed Nov 18, 2019
@luanyi luanyi reopened this Nov 18, 2019
@btaille
Copy link
Author

btaille commented Nov 20, 2019

I am trying to run your model but it seems that the "glove.840B.300d.txt.filtered" file is missing for datasets other than genia and wlp. Could you kindly provide it? Or are we supposed to compute it from glove.840B.300d.txt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants