Non standard Relation Extraction metric ? #5

btaille · 2019-11-15T15:57:07Z

Hello,

In your paper you only specify the criterion to consider an entity as correct, and not a relation.
As I understand by a quick look at your code in model1/relation_metrics.py you consider a relation as correct if the relation type is correct along with the spans of its two arguments.
That is without considering the predicted entity type of the arguments.

If so, you use what (Bekoulis 2018) refers to as the "Boundaries" evaluation setting.
You cannot directly compare to previous works that take into account the entity type in the "Strict" evaluation setting as defined by (Bekoulis 2018).
As far as I know, (Li and Ji 2014) is the only related work using this "Boundaries" evaluation setting.

FYI (Sanh 2019) also uses a different metric and its scores are already not comparable to previous work, as pointed out in this issue.

(Bekoulis 2018) = "Joint entity recognition and relation extraction as a multi-head selection problem"

Best regards,

luanyi · 2019-11-16T17:26:15Z

Thank you for your comments. The (Bekoulis 2018) we have compared with is "Adversarial training for multi-context joint entity and relation extraction", not "Joint entity recognition and relation extraction as a multi-head selection problem".
In "Adversarial training for multi-context joint entity and relation extraction", they have compared with Miwa & Bansal 2016 which follows exactly the same evaluation as (Li and Ji 2014), so I don't think there is any problem in our evaluation.

btaille · 2019-11-18T09:27:38Z

Thank you for your answer but I beg to differ. In "Adversarial training for multi-context joint entity and relation extraction", Bekoulis et al. also introduce the 3 evaluation settings ("Strict", "Boundaries" and "Relaxed") in section 4. And for example on ACE04 they report Strict evaluation to compare to (Miwa and Bansal 2016). (Miwa and Bansal 2016) do compare with (Li and Ji 2014) but this is a mistake in my opinion since they state that they consider the type of an entity in section 4.1.

I am not familiar with the SciERC and WLPC literature but for ACE datasets I am confident that most of related works use the Strict evaluation setting.

luanyi · 2019-11-18T18:51:41Z

Both SciERC and WLPC uses span evaluation for relation.
For ACE04, in 4.1 of (Miwa and Bansal 2016) they say "We use the same data splits, preprocessing,
and task settings as Li and Ji (2014).", so I think they are following the same setup as Li and Ji which has the same evaluation method as us.
We will update our paper to make the evaluation metric more clear. We hope people can do a fair comparison with our results. But since there are so many confusions in previous literature, we do not have the bandwidth to verify each individual work. If you can verify with authors in (Bekoulis 2018) that they indeed use a different set of evaluation metric, we will remove that comparison from our result table.

btaille · 2019-11-18T21:22:19Z

I am still not convinced that (Miwa and Bansal 2016) used the same setting as you and am trying to get first-hand information on that.

They say for ACE05 : "We use the same data splits, preprocessing, and task settings as Li and Ji (2014) [...] We treat an entity as correct when its type and the region of its head are correct. We treat a relation as correct when its type and argument entities are correct".

And for ACE04 : "We follow the cross-validation setting of Chan and Roth (2011) and Li and Ji (2014), and the preprocessing and evaluation metrics of ACE05."

I am positive that (Bekoulis 2018) used the Strict setting for their results, as they state and as one can see in their code. And I am very confident that (Li and Ji 2014) is the only work using your setting on ACE datasets.

I agree that all of this is very confusing and, if anything, I thank you for releasing your code.
I will try to compute a "Strict" score with the model you released for ACE05.

luanyi · 2019-11-18T21:41:39Z

Since so many previous work on ACE are based on and compared with (Li and Ji 2014), I'm skeptical about the statement that "(Li and Ji 2014) is the only work using your setting on ACE datasets".
But I would appreciate if you let us know the performance of our model using the "Strict" score.

btaille · 2019-11-20T12:52:40Z

I am trying to run your model but it seems that the "glove.840B.300d.txt.filtered" file is missing for datasets other than genia and wlp. Could you kindly provide it? Or are we supposed to compute it from glove.840B.300d.txt?

luanyi closed this as completed Nov 18, 2019

luanyi reopened this Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non standard Relation Extraction metric ? #5

Non standard Relation Extraction metric ? #5

btaille commented Nov 15, 2019

luanyi commented Nov 16, 2019 •

edited

Loading

btaille commented Nov 18, 2019

luanyi commented Nov 18, 2019

btaille commented Nov 18, 2019

luanyi commented Nov 18, 2019

btaille commented Nov 20, 2019

Non standard Relation Extraction metric ? #5

Non standard Relation Extraction metric ? #5

Comments

btaille commented Nov 15, 2019

luanyi commented Nov 16, 2019 • edited Loading

btaille commented Nov 18, 2019

luanyi commented Nov 18, 2019

btaille commented Nov 18, 2019

luanyi commented Nov 18, 2019

btaille commented Nov 20, 2019

luanyi commented Nov 16, 2019 •

edited

Loading