Skip to content
This repository has been archived by the owner on Jun 7, 2023. It is now read-only.

Wrong case information for ngram model? #55

Open
hiroshinoji opened this issue Sep 7, 2020 · 0 comments
Open

Wrong case information for ngram model? #55

hiroshinoji opened this issue Sep 7, 2020 · 0 comments

Comments

@hiroshinoji
Copy link

Hi,

First, thank you for releasing this. Great work!

I was trying to run some model with syntaxgym, and found that the ngram model fails for syntaxgym run command. And this seems to be caused due to incorrect spec information in the model (cased information), probably defined here.

This should probably be false? Because the ngram tokenizer outputs uncased tokens, this mismatch seems to cause a problem in alignment in tokenize_regions method in Sentence class. The error message looks like:

File "/.../lib/python3.7/site-packages/syntaxgym/agg_surprisals.py", line 58, in aggregate_surprisals
    raise utils.TokenMismatch(token, sent_tokens[t_idx], t_idx+2)
syntaxgym.utils.TokenMismatch:
tokens "painting" and "the" do not match (line 2 in surprisal file)

Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant