Sub-pattern labeling for pattern matcher #3275

cyclecycle · 2019-02-14T10:01:32Z

cyclecycle
Feb 14, 2019

My problem is that I would like to match patterns based on linguistic annotations, and then label certain tokens within the match, depending on which token_specs those matched tokens correspond to. I therefore need to know which tokens in the match span correspond to which token_specs in the provided patterns. For patterns which don't use operators, this is easy, as the match will be the same length as the pattern, so a corresponding one-to-one list of labels would suffice. For token_specs with operators, it is more complicated, as the number of corresponding tokens in the match is variable.

N.B. Which tokens get which labels is dependent upon their position in a given pattern, so the labeling is tightly coupled to the pattern matching in this context.

I see two solutions:

- Alter the code in matcher.pyx so that for each token added to each open match, also store the current token_spec.
- Reproduce the pattern matching logic in a callback in order to map tokens in the match to token_specs in the pattern.

Am I missing an easier solution? I prefer the former option, so I could fork and attempt to make a satisfactory change. I have no idea whether such a feature would merit its cost for the broader community.

Thanks,

Nick

ines · 2019-02-14T11:20:49Z

ines
Feb 14, 2019
Maintainer

Hi! Do you have an example of a pattern and token spec / subpattern? Just so I can make sure I understand the question and use case correctly.

It does sound like something that should be possible with the current API and without hacking around too much. Maybe you could use two matchers for this? For instance, if pattern A is matched, you could call .as_doc() on the matched span, and then match it again with your other token spec matcher?

0 replies

cyclecycle · 2019-02-14T14:06:27Z

cyclecycle
Feb 14, 2019
Author

Hi ines, thanks a lot for your response.

For example, let's say I have the text:

The action of x on y.

And I want to extract a relationship in the form of:

predicate: 'action of'
argument1: 'x'
argument2: 'y'

One approach is to use the pattern:

pattern = [{'LOWER': 'action'}, {'LOWER': 'of'}, {}, {'LOWER': 'on'}, {}]

to match the text. Now I want to assign the appropriate labels (pred, arg1, arg2, none) to the tokens in the returned match span.

In this example, I could simply use a list of labels that corresponds one-to-one with this pattern, i.e, ['pred', '-', 'arg1, 'pred', 'arg2'], and zip that with the match span to get the desired output.

However, in patterns with operators, you don't know what the shape of the match span is ahead of time, so this approach doesn't work. One approach would be to provide the labeling information in the token_spec dict like:

pattern = [ {'LOWER': 'action', 'label': 'pred'}, {'LOWER': 'of'}, {label: 'arg1'}, {'LOWER': 'on', 'label': 'pred'}, {label: 'arg1'} ]

and then when you get the token_spec dict back for each token in the match you would know what it's label should be. But I don't currently see a way to get the token_spec back for each matched token.

I hope that is clearer. I'm not sure that the title is a good description of the solution I'm seeking.

0 replies

honnibal · 2019-02-27T14:04:01Z

honnibal
Feb 27, 2019
Maintainer

I think I get what you mean --- sort of like the ?: grouping in a regex, right?

I don't think we'll be able to support this in the current matcher, though. At least, I don't see a simple way to do it, and the code is already fairly complicated.

0 replies

cyclecycle · 2019-03-09T11:20:39Z

cyclecycle
Mar 9, 2019
Author

Yes, that's it. Okay thanks for the response. I've been able to achieve the results I was looking for with use of the DependencyMatcher. Cheers.

0 replies

ines · 2019-03-09T11:46:03Z

ines
Mar 9, 2019
Maintainer

@cyclecycle Ah, cool! We still need to document the DependencyMatcher, so if you have any cool real-world examples of patterns you can share, that'd be super helpful 🙂

0 replies

fabio-reale · 2019-05-16T13:16:12Z

fabio-reale
May 16, 2019

Yes, that's it. Okay thanks for the response. I've been able to achieve the results I was looking for with use of the DependencyMatcher. Cheers.

I have a similar problem. As of this moment, I somewhat followed the advice @ines gave. I avoided .as_doc(), though, because that was causing problems. So, I'm retokenizing the matched text, which is also causing problems, but less so.

Anyway, I don't think a DependencyMatcher would solve my issues, but I can't find it on the documentation. I would appreciate a link to it and/or any recommendations with my own problem.

Thanks

0 replies

chozelinek · 2019-10-02T11:58:28Z

chozelinek
Oct 2, 2019

Something like a capturing group in python regex would difficult to implement? That would be a starting point. I would find it terribly useful to use patterns with wildcards and only keep the parts of the match that are interesting.

0 replies

lingvisa · 2020-03-10T16:51:00Z

lingvisa
Mar 10, 2020

I had experiences in working with large scale commercial rule systems and know that this is a must-to-have feature to make a rule engine really more useful. For both linear and dependency grammars, there is a strong need to access any tokens in a matched span. A simple and intuitive example could be:
[{"TEXT": "US"}, {"TEXT": "is"}, {"TEXT": "United States"}]
And this rule is placed in a json or xml file. I want to extract 'same_as' relations from text.:
US same_as Unites States
Then the caller has a way to get this two tokens/chunks back:
`US=^head

United States = ^tail`

The syntax could vary, but the need here is clear. This may request a re-design of the rule matcher, and build it on a finite state machine for efficient execution.

0 replies

genemishchenko · 2020-05-01T21:43:25Z

genemishchenko
May 1, 2020

Even something as basic as a "labeling opt out flag" for a token in a token-based pattern for EntityRuler would be extremely helpful.

This should not be as complicated as dynamic capture-group-based matching or specifying different target labels for different tokens in the pattern (like in Stanford TokensRegex, for instance).

Is there a good solution or a simple workaround for a simple use case like this?
(besides the obvious workaround of using the Matcher and then writing a separate "on_match" function for each pattern)

0 replies

courtarro · 2021-03-23T15:57:45Z

courtarro
Mar 23, 2021

Just my two cents ... this is a huge element of rule matching that is missing from spaCy. Every other NLP toolkit I've played with, that supports pattern matching, supports named capture as part of that pattern matching. Semgrex in Stanford CoreNLP, for example. In my use cases, this is the main purpose I have for those rules. Identifying the overall match for a rule has value, but identifying the components of that rule match is vital, in my opinion. This feature being missing is the main reason I can't use spaCy for most of the NLP projects I have at work.

I don't mean to sound demanding. SpaCy is open-source, I know! And it's great for a lot of stuff. I only want the best for spaCy. So please think of my request not as a demand or a statement of entitlement, but rather as being an observation of how to maximize the value of spaCy for more users and in more applications.

0 replies

adrianeboyd · 2021-03-24T07:06:47Z

adrianeboyd
Mar 24, 2021

There is an upcoming user contribution that adds optional alignments between tokens in the match and the token dicts in the pattern, so you know which part of the pattern matched each token. It's not exactly labeled subgroups, but in practice it's very close: #7321

We plan to include this in spacy v3.1.

0 replies

vquilon · 2023-05-19T11:30:10Z

vquilon
May 19, 2023

It would very usefull if the RuleMatcher can add labels or properties to specific match tokens inside Span match. For example I want to detect Persons, but knowing the name and surname in different parts:
-> "I hate Mr. Sr. John Braveheart"
pattern = [{LABEL: "TREATMENT", ENT_TYPE: "PER", OP: "+"},{LABEL: "GN", ENT_TYPE: "PER"}, {LABEL: "FN", ENT_TYPE: "PER"}]
pseudo_result = [
Span("Mr. Sr. John Braveheart",
matches=[
{orth: "Mr.", label: "TREATMENT"},
{orth: "Sr.", label: "TREATMENT"},
{orth: "John", label: "GN"},
{orth: "Braveheart", label: "FN"}
]
)
]

1 reply

adrianeboyd May 22, 2023

Thanks for the suggestion! I really like the idea, but on the technical side I can't immediately think of a way to store token-level labels in the returned spans, since the spans are just a view of the doc tokens plus some span-level attributes like label, kb_id, etc.

It's clearly a feature that many users are interested in, so we'll try to keep thinking about how we could implement something like this...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-pattern labeling for pattern matcher #3275

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Sub-pattern labeling for pattern matcher #3275

Replies: 12 comments · 1 reply

ines Feb 14, 2019 Maintainer

cyclecycle Feb 14, 2019 Author

honnibal Feb 27, 2019 Maintainer

cyclecycle Mar 9, 2019 Author

ines Mar 9, 2019 Maintainer

Replies: 12 comments 1 reply

ines
Feb 14, 2019
Maintainer

cyclecycle
Feb 14, 2019
Author

honnibal
Feb 27, 2019
Maintainer

cyclecycle
Mar 9, 2019
Author

ines
Mar 9, 2019
Maintainer