Too long token sequence #2

McSinyx · 2019-11-12T15:01:10Z

As raised from transformers' tokenization utils:

Token indices sequence length is longer than the specified maximum sequence
length for this model ([some number] > 512). Running this sequence through
the model will result in indexing errors.

This is because some paragraphs are too long. I am unsure how we should be splitting those.

Huy-Ngo · 2019-11-13T13:12:25Z

Here is one solution of mine as for long paragraphs classified as true:

Split each of them into sentences.
Run and see the result.
(a) If one sentence is true, consider all the results as good evaluations. (b) Otherwise, mark all that as bad evaluations.

This approach has some drawbacks, such as for case (3b), some sentences in that para that are correctly marked as unrelated, are marked as bad evaluations. Not sure if we can implement that.

McSinyx · 2019-11-14T15:24:52Z

Turns out 99.92% (18094/18108) of the data is shorter than 512 (Bert) tokens, hence I assume it is OK to ignore them. Handling long input while using the model is another story.

trahoa · 2019-11-14T22:10:27Z

This is very likely to happen in the test set (both public and private). Here is one suggestion as I wrote in the email:

Split the long paragraph [ABCD] into multiple 512-token-long and overlapping subparagraph: [AB] + [BC] + [CD]
Evaluate each subparagraph with the same question Q, and take out the maximum output
Consider that max value as the evaluation of Q and [ABCD]

McSinyx · 2019-11-22T03:00:34Z

For quick experimenting, we're using < 128 tokens sequences. Length of 512 might be reconsidered if the test case requires this limit.

Edit: For the pre-tests, 256 would be sufficient (only test_0236 p4 exceed this limit a little so trimming it is alright).

McSinyx added a commit that referenced this issue Nov 14, 2019

Filter out too long text (#2)

f299429

McSinyx added a commit that referenced this issue Nov 15, 2019

Filter out too long text (#2)

71973a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too long token sequence #2

Too long token sequence #2

McSinyx commented Nov 12, 2019

Huy-Ngo commented Nov 13, 2019

McSinyx commented Nov 14, 2019

trahoa commented Nov 14, 2019

McSinyx commented Nov 22, 2019 •

edited

Loading

Too long token sequence #2

Too long token sequence #2

Comments

McSinyx commented Nov 12, 2019

Huy-Ngo commented Nov 13, 2019

McSinyx commented Nov 14, 2019

trahoa commented Nov 14, 2019

McSinyx commented Nov 22, 2019 • edited Loading

McSinyx commented Nov 22, 2019 •

edited

Loading