Question about reported perplexities

I've been reading your paper, interesting work.

I have a question about how you compute perplexities, especially over datasets that are already tokenized (e.g., wikitext-103). I understand that your encoding can assign probabilities to any string, but I'd expect the LM to do poorly when fed pre-tokenized input. For example, the tokenized wikitext-103 input looks like `M @-@ 82 begins at a junction with M @-@ 120 and B @-@ 96 west of Fremont .` How do you report perplexity in this case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about reported perplexities #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about reported perplexities #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions