Several questions when trying to get the start/end token index of a span given the character offset of it? #3304
-
Hi, I am trying to get the start/end token indices of a span given the character offset of it. I have searched for some posts and solutions(#1264) but they don't work in my cases. I have the dataset that has the format:
A_offset is the character offset of the start of span A in Text Now I want to get the start and end token indices of span A(or B or Pronoun) in a doc after First case:
First, suppose we wanna get the inclusive token indices of span A,
This returned span will give me
Then It can't split
Second case: Inconsistent split Then I am going to find the mark of each span then calculate the token indices of the span For a sentence like this
When I didn't insert the matk, Could you kindly look into this? Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
I hope I understand your question correctly – but I think the problem here is that the tokenization of your data doesn't match spaCy's tokenization. A For example, it seems like
Modifying the tokenizer is a step in the right direction – but in your example, you're creating a blank tokenizer from scratch with only one suffix rule. So your new tokenizer won't have any of the other tokenization data like tokenizer exceptions available. So it'll produce very different results. You probably want to pass in The upcoming |
Beta Was this translation helpful? Give feedback.
-
Thank you for answering my question! I realized that I created a blank tokenizer from scratch. I am eager to know how to tweak |
Beta Was this translation helpful? Give feedback.
-
The suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search The The other way you could customize this is to update the The default list of suffix expressions can be found here: https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py |
Beta Was this translation helpful? Give feedback.
The
nlp.tokenizer.suffix_search
attribute is writable, so you should be able to do something like:The
nlp.tokenizer.suffix_search
attribute should be a function which takes a unicode string and returns a regex match object orNone
. Usually we use the.search
attribute of a compiled regex object, but you can use some other function that behaves the same way.The other way you could customize this is to update the
English.Defaults.suffixes
tuple. However, this won't change a tokenizer that you load, because when you load a tokenizer, it'll read the suffix regex fr…