Several questions when trying to get the start/end token index of a span given the character offset of it? #3304

p-null · 2019-02-20T21:44:30Z

p-null
Feb 20, 2019

Hi, I am trying to get the start/end token indices of a span given the character offset of it.

I have searched for some posts and solutions(#1264) but they don't work in my cases.
I have spent a lot of time on this and I think it will be helpful for others!

I have the dataset that has the format:

Text, Pronoun_offset, Pronoun, A_offset, A, B_offset, B

A_offset is the character offset of the start of span A in Text

Now I want to get the start and end token indices of span A(or B or Pronoun) in a doc after Text is tokenized.

First case:

Pronoun                                                         she
Pronoun-offset                                                  272
A                                                           Melinda
A-offset                                                        127
B                                                             Delia
B-offset                                                        261

First, suppose we wanna get the inclusive token indices of span A, Delia
my plan is,

text = " A woman comes to Mel's shop to sell antiques from a house she's moving from after the death of her daughter (from an illness). Melinda very quickly discovers that the house is haunted by violent spirits after Ned gets hurt there-which doesn't go down well with Delia- and she realizes that the ghost of the little girl (Cassidy) is being trapped there."

nlp = spacy.load("en_core_web_sm", disable = ['vectors', 'textcat', 'tagger', 'parser', 'ner'])
doc = nlp(text) 
char_offsets = [token.idx for token in doc]
token_len = len(self._spacy(A))
start_idx = char_offsets.index(A_offset)
end_idx = start_idx + token_len - 1
span = (start_idx, end_idx)
return tuple(span)

This returned span will give me Delia- instead of Delia.
Then I do this:

suffix_re = re.compile(r'''-+$''')
nlp.tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)

Then It can't split Pauline's to Pauline and 's for sentence

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.

Pronoun                                                         her
Pronoun-offset                                                  274
A                                                    Cheryl Cassidy
A-offset                                                        191
B                                                           Pauline
B-offset                                                        207

Second case: Inconsistent split
Then I think of a another way, I insert a mark before and behind the span in the text string.
The mark is like @$A$@
Note that there is a space at the beginning and the end of the mark.

Then I am going to find the mark of each span then calculate the token indices of the span

For a sentence like this

Reviewer Peter Travers wrote in Rolling Stone that ``Streep--at her brilliant, beguiling best--is the spice that does the trick for the yummy Julie & Julia.'' Similarly, Stephanie Zacharek of Salon concluded that ``Streep isn't playing Julia Child here, but something both more elusive and more truthful--she's playing our idea of Julia Child.''

Pronoun                                                         she
Pronoun-offset                                                  305
A                                                                Streep
A-offset                                                        215
B                                                           Julia Child
B-offset                                                        236

When I didn't insert the matk, truthful-- will be tokenized to truthful and --, which is what I want.
After the mark is inserted, the truthful-- will be tokenized as truthful
Since I traced the token index of tokenization result, this make me can't get correct token index.

Could you kindly look into this?
I really appreciate your help!

Environment

spaCy version      2.0.12
Platform           Linux-4.4.0-31-generic-x86_64-with-debian-stretch-sid
Python version     3.7.2
Models             en

Answered by honnibal

Feb 24, 2019

The nlp.tokenizer.suffix_search attribute is writable, so you should be able to do something like:

suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search

The nlp.tokenizer.suffix_search attribute should be a function which takes a unicode string and returns a regex match object or None. Usually we use the .search attribute of a compiled regex object, but you can use some other function that behaves the same way.

The other way you could customize this is to update the English.Defaults.suffixes tuple. However, this won't change a tokenizer that you load, because when you load a tokenizer, it'll read the suffix regex fr…

View full answer

ines · 2019-02-21T09:30:48Z

ines
Feb 21, 2019
Maintainer

I hope I understand your question correctly – but I think the problem here is that the tokenization of your data doesn't match spaCy's tokenization. A Span is a sequence of tokens, so if the tokens boundaries don't line up with the span boundaries and character offsets, you won't be able to create a span either.

For example, it seems like "Delia-" stays one token, so you won't be able to create a span consisting of "Delia".

nlp.tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)

Modifying the tokenizer is a step in the right direction – but in your example, you're creating a blank tokenizer from scratch with only one suffix rule. So your new tokenizer won't have any of the other tokenization data like tokenizer exceptions available. So it'll produce very different results. You probably want to pass in rules=English.Defaults.tokenizer_exceptions or tweak English.Defaults.suffixes and so on.

The upcoming v2.1 (currently available as spacy-nightly) also features a new method for splitting tokens (Doc.retokenize and Retokenizer.split). This might be helpful for cases you can't easily handle with rules. You can then check if the token boundaries line up with your offsets and if not, split the tokens so they match.

0 replies

p-null · 2019-02-21T18:21:10Z

p-null
Feb 21, 2019
Author

Thank you for answering my question!

I realized that I created a blank tokenizer from scratch.
I also added rules=English.Defaults.tokenizer_exceptions to initialize the Tokenizer but it still doesn't give correct tokenization result.

I am eager to know how to tweak English.Defaults.suffixes ? Like I want to add additional suffixes search rules instead of replacing the default one?

0 replies

honnibal · 2019-02-24T16:58:38Z

honnibal
Feb 24, 2019
Maintainer

The nlp.tokenizer.suffix_search attribute is writable, so you should be able to do something like:

suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search

The nlp.tokenizer.suffix_search attribute should be a function which takes a unicode string and returns a regex match object or None. Usually we use the .search attribute of a compiled regex object, but you can use some other function that behaves the same way.

The other way you could customize this is to update the English.Defaults.suffixes tuple. However, this won't change a tokenizer that you load, because when you load a tokenizer, it'll read the suffix regex from the model. So, changing the English.Defaults.suffixes attribute will only work if you're calling spacy.blank(), or English.Defaults.create_tokenizer().

The default list of suffix expressions can be found here: https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several questions when trying to get the start/end token index of a span given the character offset of it? #3304

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Several questions when trying to get the start/end token index of a span given the character offset of it? #3304

p-null Feb 20, 2019

Environment

Replies: 3 comments

ines Feb 21, 2019 Maintainer

p-null Feb 21, 2019 Author

honnibal Feb 24, 2019 Maintainer

p-null
Feb 20, 2019

ines
Feb 21, 2019
Maintainer

p-null
Feb 21, 2019
Author

honnibal
Feb 24, 2019
Maintainer