Adding already tokenized document to spaCy pipeline #2606
-
Hello, I already have a tokenized document, so is there any way to add the tokenized document to the pipeline and excluding the tokenization process. For Example: Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Yes, you can always initialise a import spacy
from spacy.tokens import Doc
nlp = spacy.load('en_core_web_sm')
doc = Doc(nlp.vocab, words=['Eight', 'Iraqi', 'Kurds', 'killed', 'yesterday']) The doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, True, True])
print(doc.text)
# hello world ! doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, False, False])
print(doc.text)
# hello world! Of course, this only works if your data actually contains the whitespace information as well. Unfortunately, many tokenizers destroy that information and once it's gone, it's difficult to restore. (One approach is to use a statistical model to predict the whitespace, which has worked well in the past.) |
Beta Was this translation helpful? Give feedback.
Yes, you can always initialise a
Doc
object directly with the shared vocab, and pass in a list ofwords
:The
spaces
keyword argument lets you pass in a list of boolean values that indicate whether the token is followed by whitespace or not. Here's an example:Of course, this only works if your data actually contains…