Sub-pattern labeling for pattern matcher #3275
Replies: 12 comments 1 reply
-
Hi! Do you have an example of a pattern and token spec / subpattern? Just so I can make sure I understand the question and use case correctly. It does sound like something that should be possible with the current API and without hacking around too much. Maybe you could use two matchers for this? For instance, if pattern A is matched, you could call |
Beta Was this translation helpful? Give feedback.
-
Hi ines, thanks a lot for your response. For example, let's say I have the text:
And I want to extract a relationship in the form of: predicate: 'action of' One approach is to use the pattern:
to match the text. Now I want to assign the appropriate labels (pred, arg1, arg2, none) to the tokens in the returned match span. In this example, I could simply use a list of labels that corresponds one-to-one with this pattern, i.e, However, in patterns with operators, you don't know what the shape of the match span is ahead of time, so this approach doesn't work. One approach would be to provide the labeling information in the token_spec dict like:
and then when you get the token_spec dict back for each token in the match you would know what it's label should be. But I don't currently see a way to get the token_spec back for each matched token. I hope that is clearer. I'm not sure that the title is a good description of the solution I'm seeking. |
Beta Was this translation helpful? Give feedback.
-
I think I get what you mean --- sort of like the I don't think we'll be able to support this in the current matcher, though. At least, I don't see a simple way to do it, and the code is already fairly complicated. |
Beta Was this translation helpful? Give feedback.
-
Yes, that's it. Okay thanks for the response. I've been able to achieve the results I was looking for with use of the DependencyMatcher. Cheers. |
Beta Was this translation helpful? Give feedback.
-
@cyclecycle Ah, cool! We still need to document the |
Beta Was this translation helpful? Give feedback.
-
I have a similar problem. As of this moment, I somewhat followed the advice @ines gave. I avoided .as_doc(), though, because that was causing problems. So, I'm retokenizing the matched text, which is also causing problems, but less so. Anyway, I don't think a DependencyMatcher would solve my issues, but I can't find it on the documentation. I would appreciate a link to it and/or any recommendations with my own problem. Thanks |
Beta Was this translation helpful? Give feedback.
-
Something like a capturing group in python regex would difficult to implement? That would be a starting point. I would find it terribly useful to use patterns with wildcards and only keep the parts of the match that are interesting. |
Beta Was this translation helpful? Give feedback.
-
I had experiences in working with large scale commercial rule systems and know that this is a must-to-have feature to make a rule engine really more useful. For both linear and dependency grammars, there is a strong need to access any tokens in a matched span. A simple and intuitive example could be: United States = ^tail` The syntax could vary, but the need here is clear. This may request a re-design of the rule matcher, and build it on a finite state machine for efficient execution. |
Beta Was this translation helpful? Give feedback.
-
Even something as basic as a "labeling opt out flag" for a token in a token-based pattern for EntityRuler would be extremely helpful. This should not be as complicated as dynamic capture-group-based matching or specifying different target labels for different tokens in the pattern (like in Stanford TokensRegex, for instance). Is there a good solution or a simple workaround for a simple use case like this? |
Beta Was this translation helpful? Give feedback.
-
Just my two cents ... this is a huge element of rule matching that is missing from spaCy. Every other NLP toolkit I've played with, that supports pattern matching, supports named capture as part of that pattern matching. Semgrex in Stanford CoreNLP, for example. In my use cases, this is the main purpose I have for those rules. Identifying the overall match for a rule has value, but identifying the components of that rule match is vital, in my opinion. This feature being missing is the main reason I can't use spaCy for most of the NLP projects I have at work. I don't mean to sound demanding. SpaCy is open-source, I know! And it's great for a lot of stuff. I only want the best for spaCy. So please think of my request not as a demand or a statement of entitlement, but rather as being an observation of how to maximize the value of spaCy for more users and in more applications. |
Beta Was this translation helpful? Give feedback.
-
There is an upcoming user contribution that adds optional alignments between tokens in the match and the token dicts in the pattern, so you know which part of the pattern matched each token. It's not exactly labeled subgroups, but in practice it's very close: #7321 We plan to include this in spacy v3.1. |
Beta Was this translation helpful? Give feedback.
-
It would very usefull if the RuleMatcher can add labels or properties to specific match tokens inside Span match. For example I want to detect Persons, but knowing the name and surname in different parts: |
Beta Was this translation helpful? Give feedback.
-
My problem is that I would like to match patterns based on linguistic annotations, and then label certain tokens within the match, depending on which token_specs those matched tokens correspond to. I therefore need to know which tokens in the match span correspond to which token_specs in the provided patterns. For patterns which don't use operators, this is easy, as the match will be the same length as the pattern, so a corresponding one-to-one list of labels would suffice. For token_specs with operators, it is more complicated, as the number of corresponding tokens in the match is variable.
N.B. Which tokens get which labels is dependent upon their position in a given pattern, so the labeling is tightly coupled to the pattern matching in this context.
I see two solutions:
Am I missing an easier solution? I prefer the former option, so I could fork and attempt to make a satisfactory change. I have no idea whether such a feature would merit its cost for the broader community.
Thanks,
Nick
Beta Was this translation helpful? Give feedback.
All reactions