Wrong case information for ngram model? #55

hiroshinoji · 2020-09-07T11:36:23Z

Hi,

First, thank you for releasing this. Great work!

I was trying to run some model with syntaxgym, and found that the ngram model fails for syntaxgym run command. And this seems to be caused due to incorrect spec information in the model (cased information), probably defined here.

lm-zoo/models/ngram/spec.template.json

Line 31 in 5c72f5a

"cased": true

This should probably be false? Because the ngram tokenizer outputs uncased tokens, this mismatch seems to cause a problem in alignment in tokenize_regions method in Sentence class. The error message looks like:

File "/.../lib/python3.7/site-packages/syntaxgym/agg_surprisals.py", line 58, in aggregate_surprisals
    raise utils.TokenMismatch(token, sent_tokens[t_idx], t_idx+2)
syntaxgym.utils.TokenMismatch:
tokens "painting" and "the" do not match (line 2 in surprisal file)

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong case information for ngram model? #55

Wrong case information for ngram model? #55

hiroshinoji commented Sep 7, 2020

Wrong case information for ngram model? #55

Wrong case information for ngram model? #55

Comments

hiroshinoji commented Sep 7, 2020