-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sentence context greater than 512 character #64
Comments
Thanks @xei for reporting this issue. I know BERT has a limit of 512 characters and the model currently being used for inference was trained with maximum 512 characters REF. Also, I am not sure how For Example:>>> import spacy
>>> import contextualSpellCheck
>>> spacy_nlp = spacy.load(
'en_core_web_sm',
# disable=['ner']
disable=['parser', 'ner'] # disable extra componens for efficiency
)
>>> spacy_nlp.add_pipe('sentencizer')
<spacy.pipeline.sentencizer.Sentencizer object at 0x7fb10c509f40>
>>> contextualSpellCheck.add_to_pipe(spacy_nlp)
>>> corpus_raw="""The train from the west that bore Bert Bryant to New York was two
hours late, for all the way from Clinton, Ohio, where Bert lived, the
snow had been from four inches to a foot in depth. Consequently he had
missed the one o’clock train for Mt. Pleasant and had spent an hour
with his face glued to a waiting-room window watching the bustle and
confusion of New York. Now, at four o’clock, he was seated in a sleigh,
his suit-case between his feet, winding up the long, snowy road to Mt.
Pleasant Academy. In the front seat was the fur-clad driver and beside
him was Bert’s small trunk.
It was very cold and fast growing dark. It seemed to Bert that they
had been driving for miles and miles, and he wanted to ask the driver
how much farther they had to go. But the man in the old bearskin coat
was cross and taciturn, and so Bert buried his hands still deeper in
his pockets and wondered whether his nose and ears were getting white.
And just when he had decided that they were the sleigh left the main
road with a sudden lurch, that almost toppled the trunk off, and turned
through a gate and up a curving drive lined with snow-laden evergreens.
Then the academy came into view, a rambling, comfortable-looking
building with many cheerfully lighted windows looking out in welcome.
At one of the windows two faces appeared in response to the warning
of the sleigh bells and peered curiously down. The sleigh pulled up
in front of a broad stone step and Bert clambered out, bag in hand.
The driver lifted the trunk, opened the big oak door without ceremony,
deposited his burden just inside and growled: “Fifty cents.”"""
>>> doc = spacy_nlp(corpus_raw)
>>> doc._.suggestions_spellCheck
{Bert: 'Bert', Bryant: 'back', York: 'York', Clinton: 'Canton', Ohio: 'Ohio', Bert: 'he', bustle: 'noise', York: 'York', sleigh: 'seat', snowy: 'dusty', Bert: 'Ben', Bert: 'Bond', bearskin: 'black', taciturn: 'stern', Bert: 'he', sleigh: 'pair', lurch: 'turn', toppled: 'ripped', evergreens: 'trees', rambling: 'big', cheerfully: 'carefully', lighted: 'painted', sleigh: 'church', sleigh: 'coach', Bert: 'Ben', clambered: 'climbed'} As you can see above the entire text moved through the spacy pipeline without any error. There is another thing which I wanted to point was contextualSpellCheck would require both
Please let me know if you have any questions. I think your suggestion is great, and I will have to try to think of a solution to either split a large sentence (> max_position_embeddings) or bypass spell check altogether. If you would like to contribute this feature feel free to create a PR! |
I tried to correct spelling mistakes in a large text.
At first, I faced this error:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe('sentencizer')
. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by settingdoc[i].is_sent_start
.So, I added the
sentencizer
component to the pipeline.This time I faced this error:
RuntimeError: The expanded size of the tensor (837) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 837]. Tensor sizes: [1, 512]
I guess this is due to the limitations of BERT. However, I believe that there should be a way to catch this error and bypass the spell check.
The text was updated successfully, but these errors were encountered: