Adding already tokenized document to spaCy pipeline #2606

ghost · 2018-07-26T18:10:45Z

ghost
Jul 26, 2018

Hello, I already have a tokenized document, so is there any way to add the tokenized document to the pipeline and excluding the tokenization process.

For Example:
tokenized_doc=[ ['Eight','Iraqi','Kurds','killed','yesterday'] , ['One','of','the','victims','is','a','child'] ]
The reason if you ask why I want to do this is because I have a very large corpus that has been annotated on the token level and I do not know which tokenizer they used. It is really hard to customize a tokanizer that can produce the same result. For example, they consider "al-qaeda-linked" as one token where spacy tokenizer considers this as five tokens. This is just an example. Is there any way to add the tokenized document to spaCy pipeline.

Your Environment

Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

Answered by ines

Jul 27, 2018

Yes, you can always initialise a Doc object directly with the shared vocab, and pass in a list of words:

import spacy
from spacy.tokens import Doc

nlp = spacy.load('en_core_web_sm')
doc = Doc(nlp.vocab, words=['Eight', 'Iraqi', 'Kurds', 'killed', 'yesterday'])

The spaces keyword argument lets you pass in a list of boolean values that indicate whether the token is followed by whitespace or not. Here's an example:

doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, True, True])
print(doc.text)
# hello world !

doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, False, False])
print(doc.text)
# hello world!

Of course, this only works if your data actually contains…

View full answer

ines · 2018-07-27T09:13:26Z

ines
Jul 27, 2018
Maintainer

Yes, you can always initialise a Doc object directly with the shared vocab, and pass in a list of words:

import spacy
from spacy.tokens import Doc

nlp = spacy.load('en_core_web_sm')
doc = Doc(nlp.vocab, words=['Eight', 'Iraqi', 'Kurds', 'killed', 'yesterday'])

The spaces keyword argument lets you pass in a list of boolean values that indicate whether the token is followed by whitespace or not. Here's an example:

doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, True, True])
print(doc.text)
# hello world !

doc = Doc(nlp.vocab, words=['hello', 'world', '!'], spaces=[True, False, False])
print(doc.text)
# hello world!

Of course, this only works if your data actually contains the whitespace information as well. Unfortunately, many tokenizers destroy that information and once it's gone, it's difficult to restore. (One approach is to use a statistical model to predict the whitespace, which has worked well in the past.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding already tokenized document to spaCy pipeline #2606

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Adding already tokenized document to spaCy pipeline #2606

ghost Jul 26, 2018

Your Environment

Replies: 1 comment

ines Jul 27, 2018 Maintainer

ghost
Jul 26, 2018

ines
Jul 27, 2018
Maintainer