Optimize RelationClassifier by adding the option to filter long sentences and truncate context #3593

alanakbik · 2025-01-02T21:25:05Z

The RelationClassifier separately considers each pair of entities in a Sentence for relation classification. If a sentence is long and contains many entities, this leads to a very large number of forward passes.

This PR introduces two new parameters to RelationClassifier to optimize this behavior:

max_allowed_tokens_between_entities allows users to specify the maximum allowed distance between two entities in order to be considered a relation candidate. The idea is that entities that lie far apart in a text are unlikely to be in relation. All entity pairs with too many tokens in between are filtered.
max_surrounding_context_length allows users to specify how much additional context (text around the two entities) is used to make the relation classification. Setting this to a low value makes classification more computationally efficient, since smaller encoded sentences are used in the forward pass.

Further, the PR introduces a new sanity check to ensure that one entity in a pair does not contain the other.

To enable this, the _encode_sentence function is changed to possibly return None if one of the criteria is not met. The _encode_sentence_for_inference and _encode_sentence_for_training methods are accordingly changed to check for None values before the yield.

dobbersc · 2025-01-07T17:57:59Z

flair/models/relation_classifier_model.py

+        max_allowed_tokens_between_entities: int = 20,
+        max_surrounding_context_length: int = 10,


Maybe if would make sense to also allow for None values to disable the measures. ̀This could then also be used as default value for _init_model_with_state_dict to not change the behaviour of existing models.

dobbersc · 2025-01-07T18:04:47Z

flair/models/relation_classifier_model.py

+            max_allowed_tokens_between_entities=state.get("max_allowed_tokens_between_entities", 25),
+            max_surrounding_context_length=state.get("max_surrounding_context_length", 50),


The default parameters for backwards compatibility are different to the ones in the ̀ init` method. Are these a better fit?

backwards compatibility is tricky, since older models will have no limitations on max allowed tokens or surrounding context. It's probably best to set really high numbers here (e.g., even higher).

dobbersc · 2025-01-07T18:09:47Z

flair/models/relation_classifier_model.py

+        head_idx = -10000
+        tail_idx = 10000


For safety and as a sanity check, we could also initialize these values as None and have an assertion after the for token in original_sentence loop that these variables should not be None.

I had that before, but that gave me mypy errors. But I agree that this is better.

…ax operations instead of if-statements

…urrounding_context_length` filter for backwards compatibility

…rounding_context_length` parameters

…lations

dobbersc · 2025-01-22T00:49:32Z

I've incorporated the suggestions and also added a test case for the new parameters.

alanakbik added 9 commits January 2, 2025 05:59

Optimize RelationClassifier by filtering long sentences

fc786b3

Optimize RelationClassifier by filtering long sentences

594d858

Fix serialization

8fc8a58

Change context window calculation

1fd1851

Change context window calculation

7f89bb0

Add sanity check to ensure entities are not contained in one another

70148da

Fix slicing such that left and right context are of equal length

f50c3b3

Make mypy happy

142703b

Remove unnecessary if statement

3ad499b

alanakbik requested a review from dobbersc January 2, 2025 21:43

dobbersc reviewed Jan 9, 2025

View reviewed changes

alanakbik and others added 9 commits January 11, 2025 16:28

Merge branch 'master' into filter_relations

f798a3c

Ensure presence of head and tail entity in the original sentence

0cca6a8

Refactor _slice_encoded_sentence_to_max_allowed_length to use min/m…

4ed2e49

…ax operations instead of if-statements

Allow to disable the max_allowed_tokens_between_entities and `max_s…

306412f

…urrounding_context_length` filter for backwards compatibility

Rearrange parameters and make sentence filters public

4fc4878

Add test cases for max_allowed_tokens_between_entities and `max_sur…

de8b7f4

…rounding_context_length` parameters

Merge branch 'master' into filter_relations

6789a6a

Fix tests due to additional training data point in train.conllup

a06fa30

Merge remote-tracking branch 'origin/filter_relations' into filter_re…

3ce22f1

…lations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize RelationClassifier by adding the option to filter long sentences and truncate context #3593

Optimize RelationClassifier by adding the option to filter long sentences and truncate context #3593

alanakbik commented Jan 2, 2025

dobbersc Jan 7, 2025

alanakbik Jan 9, 2025

dobbersc Jan 7, 2025

alanakbik Jan 9, 2025

dobbersc Jan 7, 2025

alanakbik Jan 9, 2025

dobbersc commented Jan 22, 2025

		max_allowed_tokens_between_entities: int = 20,
		max_surrounding_context_length: int = 10,

		max_allowed_tokens_between_entities=state.get("max_allowed_tokens_between_entities", 25),
		max_surrounding_context_length=state.get("max_surrounding_context_length", 50),

Optimize RelationClassifier by adding the option to filter long sentences and truncate context #3593

Are you sure you want to change the base?

Optimize RelationClassifier by adding the option to filter long sentences and truncate context #3593

Conversation

alanakbik commented Jan 2, 2025

dobbersc Jan 7, 2025

Choose a reason for hiding this comment

alanakbik Jan 9, 2025

Choose a reason for hiding this comment

dobbersc Jan 7, 2025

Choose a reason for hiding this comment

alanakbik Jan 9, 2025

Choose a reason for hiding this comment

dobbersc Jan 7, 2025

Choose a reason for hiding this comment

alanakbik Jan 9, 2025

Choose a reason for hiding this comment

dobbersc commented Jan 22, 2025