Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'x' is tagged PUNCT #2834

Closed
bittlingmayer opened this issue Oct 9, 2018 · 4 comments
Closed

'x' is tagged PUNCT #2834

bittlingmayer opened this issue Oct 9, 2018 · 4 comments
Labels
feat / tagger Feature: Part-of-speech tagger lang / en English language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy

Comments

@bittlingmayer
Copy link
Contributor

Displacy tags x in What are some Spanish words that start with x? as PUNCT.

https://explosion.ai/demos/displacy?text=What%20are%20some%20Spanish%20words%20that%20start%20with%20x%3F&model=en_core_web_sm&cpu=0&cph=0

screen shot 2018-10-09 at 23 00 23

@bittlingmayer
Copy link
Contributor Author

Interestingly, although often the errors are correlated, this time Google gets it right, which leads me to think it is not just from the training data but an actual bug.

screen shot 2018-10-09 at 23 01 53

@ines ines added lang / en English language data and models models Issues related to the statistical models feat / tagger Feature: Part-of-speech tagger perf / accuracy Performance: accuracy labels Oct 9, 2018
@ines
Copy link
Member

ines commented Dec 14, 2018

Well, I assume Google's model is likely trained on a different corpus, using a different process and a different model architecture. While the comparison is always interesting, I'm not sure this really indicates anything deeper about spaCy's pre-trained models.

I'm merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@bittlingmayer
Copy link
Contributor Author

I assume Google's model is likely trained on a different corpus

From my observation (caveat: a sort of rolling average over the past few years and biased towards English), often relatively peculiar errors on parsing specific sentences are so suspiciously similar between spaCy, Google and to some degree Stanford that my perhaps naïve conclusion is that there must be significant overlap in the training data.

@lock
Copy link

lock bot commented Jan 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tagger Feature: Part-of-speech tagger lang / en English language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants