Merging a noun_chunk slice for Hearst Pattern Detection #5450

Fourthought · 2020-05-18T11:38:44Z

Fourthought
May 18, 2020

How to reproduce the behaviour

I'm attempting to implement the code from this repository using spaCy matcher in place of regex:
https://github.com/mmichelsonIF/hearst_patterns_python/blob/master/hearstPatterns/hearstPatterns.py

I am having problems with the retokenizer for merging noun_chunks.

The overall problem is to remove modifier terms such as, "other" and "some other". They are normally included within the span of a noun_chunk, but are required to be separate as such terms are predicates for particular Hearst Patterns.

The following code has been written to address this problem:

`
self.predicates = [["other"], ["some", "other"]

 with doc.retokenize() as retokenizer:
        
           #iterate through the noun_chunks
           for chunk in doc.noun_chunks:

                attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}
                count = 0
                
                #iterate through all predicate terms.
                for predicate in self.predicates:

                # iterate through the noun_chunk. If its first, second etc token match those of a
                #predicate word or phrase, then add to count.
                while count < len(predicate) and doc[chunk.start + count].lemma_ == predicate[count]:
                     count += 1

                # Create a new noun_chunk based excluding the number of tokens detected as part of
                # a predicate phrase.
                print("result: ", chunk, " becomes ", doc[chunk.start + count : chunk.end])
                # eg "some other people" becomes "people"
                retokenizer.merge(doc[chunk.start + count : chunk.end], attrs = attrs)

`

It seems including lemma_ in this phrase while count < len(predicate) and doc[chunk.start + count].lemma_ == predicate[count]: returns a ZeroDivisionError: division by zero

Changing lemma_ to lower_ seems to reduce the problem, but there still seem to be failures at the with doc.retokenize() as retokenizer: line.

Do you know what the problem here is?

Your Environment

spaCy version: 2.2.4
Platform: Windows-10-10.0.18362-SP0
Python version: 3.7.6

Answered by Fourthought

May 22, 2020

Solved: The problem indeed is merging a noun_chunk slice of zero length.

Have developed the following to prevent zero length chunks:

`

 doc = nlp("We are using docs, spans, tokens and some other spacy features, such as merge entities, merge noun chunks and especially retokenizer")
 doc = nlp("this is a kind of magic")
 predicates = [["some"], ["some", "other"], ["such", "as"], ["especially"], ["a", "kind", "of"]]
 # zero spans were being created by the ["a", "kind", "of"]] predicate term

###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hy…

View full answer

svlandeg · 2020-05-18T14:48:46Z

svlandeg
May 18, 2020
Maintainer

Sorry to hear you're running into trouble. To help us investigate what may be going on, could you provide a minimal working code snippet that we can run and exhibits the errors you're getting? You can make the code above self-contained by adding an example text and show which nlp you're using to get to the doc etc. Also it looks like the for and while loop aren't indented properly.

It would help being able to execute this, as then we can also access the error stack trace etc.

0 replies

adrianeboyd · 2020-05-18T15:20:55Z

adrianeboyd
May 18, 2020

I suspect the problem is related to trying to retokenize a 0-length span, in effect something like this:

doc = nlp("some others")
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[2:2])

The retokenizer should reject this but doesn't, and the resulting doc is malformed.

0 replies

Fourthought · 2020-05-18T22:46:11Z

Fourthought
May 18, 2020
Author

Thank you both for the prompt feedback, I'll take a look at the 0-length span problem.
So you get a better idea of what is happening, I've added you both to the repo and the code can be accessed at this link (copy and paste in to address bar):
https://github.com/Fourthought/CNDPipeline/blob/master/Obj%201%20-%20detect%20ingroup%20and%20outgroup%20of%20a%20text/Experiment%201.3%20-%20Heast%20Pattern%20Detection/spaCy%20Hearst%20Patterns.ipynb

0 replies

Fourthought · 2020-05-19T13:41:57Z

Fourthought
May 19, 2020
Author

Thank you both, that has actually solved the problem, it was trying to retokenize a span of zero length.

The problem is where the count variable was placed. It is now inside the while loop.

for the phrase 'one kind' the loop was triggered firstly by the predicate ['one', 'of', 'those'], but where the counter was not reset was then triggered by the predicate ['a', 'kind', 'of']. Having been triggered by one predicate phrase, the count was beginning in the middle of another.

Placing count inside the while loop means count is reset to start at the beginning of each predicate.

I may have uncovered a second bug with merge_noun_chunks, and will post that as a separate issue.

0 replies

Fourthought · 2020-05-20T01:38:53Z

Fourthought
May 20, 2020
Author

Sorry, while I thought this was fixed there seems to be a problem with trying to merge a slice of an existing noun chunk. The code has been modified as follows:

`

  text = "We are using docs, spans, tokens and some other spacy features, such as merge_entities, 
  merge_noun_chunks and especially retokenizer"
  self.predicates = ["some", "some other", "such as", "especially"]
  
   ###### relevant patterns:
   # hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
   # hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
   # punct = {"IS_PUNCT": True, "OP": "?"}
   # {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA": "as"}, hyponym]}
   # {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"}, hyponym]}

  
  # having created the doc object this code iterates through the noun_chunks to remove modifier 
  terms for the pattern matcher. For example, 'some other', which are merged as part of a 
  noun_chunk are predicate terms for the Hearst Pattern: 
  `hyponym, punct, {"DEP": "cc", "OP" : "?"}, {"LEMMA" : "some"}, {"LEMMA" :  "other"}, hypernym,`

  with doc.retokenize() as retokenizer:
            
        for chunk in doc.noun_chunks:

            attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

            #iterate through all predicate terms.
            for predicate in self.predicates:
                count = 0

                # iterate through the noun_chunk. If its first, second etc token match those of a
                #predicate word or phrase, then add to count.

                while count < len(predicate) and doc[chunk.start + count].lemma_ == predicate[count]:
                    count += 1

                # Create a new noun_chunk based excluding the number of tokens detected as part of
                # a predicate phrase.
                # for example "some other spaCy features" become "spaCy features"
                # "especially retokenizer" becomes "retokenizer"

                retokenizer.merge(doc[chunk.start + count : chunk.end], attrs = attrs)`

The problem is happening at the retokenize.merge() stage. In slicing 'some' and 'other' from the noun_chunk 'some other spaCy features' returns the following error message:

[E102] Can't merge non-disjoint spans. 'spaCy' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper:
https://spacy.io/api/top-level#util.filter_spans

While it is possible to create a custom attribute containing the filtered spans, I need the noun_chunk spans to be merged within the doc for the Matcher patterns to work.

Where the point is the merge the slice of a span, would using filter_spans return the longest span including the tokens to be excluded?

So you have any ideas as to where I'm going wrong here?

0 replies

Fourthought · 2020-05-22T13:00:04Z

Fourthought
May 22, 2020
Author

Solved: The problem indeed is merging a noun_chunk slice of zero length.

Have developed the following to prevent zero length chunks:

`

 doc = nlp("We are using docs, spans, tokens and some other spacy features, such as merge entities, merge noun chunks and especially retokenizer")
 doc = nlp("this is a kind of magic")
 predicates = [["some"], ["some", "other"], ["such", "as"], ["especially"], ["a", "kind", "of"]]
 # zero spans were being created by the ["a", "kind", "of"]] predicate term

###### relevant patterns:
# hypernym = {"POS" : {"IN": ["NOUN", "PROPN"]}} 
# hyponym = {"POS" : {"IN": ["NOUN", "PROPN"]}}
# punct = {"IS_PUNCT": True, "OP": "?"}
# {"label" : "such_as", "pattern" : [hypernym, punct, {"LEMMA": "such"}, {"LEMMA": "as"}, hyponym]}
# {"label" : "especially", "pattern" : [hypernym, punct, {"LEMMA" : "especially"}, hyponym]}
# {"label" : "a_kind_of", "pattern" : [hyponym, punct, {"LEMMA" : "a"}, {"LEMMA" : "kind"}, {"LEMMA" : "of"}, hypernym]}

 def isPredicateMatch(self, chunk, predicates):
 # recursive function which returns noun_chunk slice if the predicate terms match the first tokens of the chunk

   def match(empty, count, chunk, predicates):#
        # empty: check whether predicates list is empty
        # chunk[count].lemma_ != predicates[0][-count]: checks convergence between chunk and predicate string, removes empty spans
        # chunk[count].lemma_ == predicates[0][count]: check whether chunk term is equal to the predicates term
        # for example: 
        # "some other spacy features" becomes "spacy features"
        # "especially retokenizer" becomes "retokenizer"
        # "a kind" is annotated by spacy as a noun_chunk, since this appears in the predicate "a kind of" it was being reduced to a zero length span.
        # this while statement needs checking and may well need modification, will post updates
        
        while not empty and chunk[count].lemma_ != predicates[0][-count] and chunk[count].lemma_ == predicates[0][count]:
            count += 1

        return empty, count

    def isMatch(chunk, predicates):

        empty, counter = match(predicates == [], 0, chunk, predicates)
        if empty or counter == len(predicates[0]):
            return chunk[counter:]
        else:
            return isMatch(chunk, predicates[1:])

    return isMatch(chunk, predicates)

 with doc.retokenize() as retokenizer:

        for chunk in doc.noun_chunks:

            attrs = {"tag": chunk.root.tag, "dep": chunk.root.dep}

            retokenizer.merge(isPredicateMatch(chunk, predicates), attrs = attrs)

`

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging a noun_chunk slice for Hearst Pattern Detection #5450

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Merging a noun_chunk slice for Hearst Pattern Detection #5450

Fourthought May 18, 2020

How to reproduce the behaviour

Your Environment

Replies: 6 comments

svlandeg May 18, 2020 Maintainer

adrianeboyd May 18, 2020

Fourthought May 18, 2020 Author

Fourthought May 19, 2020 Author

Fourthought May 20, 2020 Author

Fourthought May 22, 2020 Author

Fourthought
May 18, 2020

svlandeg
May 18, 2020
Maintainer

adrianeboyd
May 18, 2020

Fourthought
May 18, 2020
Author

Fourthought
May 19, 2020
Author

Fourthought
May 20, 2020
Author

Fourthought
May 22, 2020
Author