Skip to content

Several questions when trying to get the start/end token index of a span given the character offset of it? #3304

Discussion options

You must be logged in to vote

The nlp.tokenizer.suffix_search attribute is writable, so you should be able to do something like:

suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search

The nlp.tokenizer.suffix_search attribute should be a function which takes a unicode string and returns a regex match object or None. Usually we use the .search attribute of a compiled regex object, but you can use some other function that behaves the same way.

The other way you could customize this is to update the English.Defaults.suffixes tuple. However, this won't change a tokenizer that you load, because when you load a tokenizer, it'll read the suffix regex fr…

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects
3 participants
Converted from issue

This discussion was converted from issue #3304 on December 10, 2020 13:43.