Skip to content
This repository has been archived by the owner on Jul 9, 2020. It is now read-only.

Too long token sequence #2

Open
McSinyx opened this issue Nov 12, 2019 · 4 comments
Open

Too long token sequence #2

McSinyx opened this issue Nov 12, 2019 · 4 comments

Comments

@McSinyx
Copy link
Owner

McSinyx commented Nov 12, 2019

As raised from transformers' tokenization utils:

Token indices sequence length is longer than the specified maximum sequence
length for this model ([some number] > 512). Running this sequence through
the model will result in indexing errors.

This is because some paragraphs are too long. I am unsure how we should be splitting those.

@Huy-Ngo
Copy link
Collaborator

Huy-Ngo commented Nov 13, 2019

Here is one solution of mine as for long paragraphs classified as true:

  1. Split each of them into sentences.
  2. Run and see the result.
  3. (a) If one sentence is true, consider all the results as good evaluations. (b) Otherwise, mark all that as bad evaluations.

This approach has some drawbacks, such as for case (3b), some sentences in that para that are correctly marked as unrelated, are marked as bad evaluations. Not sure if we can implement that.

McSinyx added a commit that referenced this issue Nov 14, 2019
@McSinyx
Copy link
Owner Author

McSinyx commented Nov 14, 2019

Turns out 99.92% (18094/18108) of the data is shorter than 512 (Bert) tokens, hence I assume it is OK to ignore them. Handling long input while using the model is another story.

@trahoa
Copy link
Collaborator

trahoa commented Nov 14, 2019

This is very likely to happen in the test set (both public and private). Here is one suggestion as I wrote in the email:

  1. Split the long paragraph [ABCD] into multiple 512-token-long and overlapping subparagraph: [AB] + [BC] + [CD]
  2. Evaluate each subparagraph with the same question Q, and take out the maximum output
  3. Consider that max value as the evaluation of Q and [ABCD]

McSinyx added a commit that referenced this issue Nov 15, 2019
@McSinyx
Copy link
Owner Author

McSinyx commented Nov 22, 2019

For quick experimenting, we're using < 128 tokens sequences. Length of 512 might be reconsidered if the test case requires this limit.

Edit: For the pre-tests, 256 would be sufficient (only test_0236 p4 exceed this limit a little so trimming it is alright).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants