-
Notifications
You must be signed in to change notification settings - Fork 0
Too long token sequence #2
Comments
Here is one solution of mine as for long paragraphs classified as
This approach has some drawbacks, such as for case (3b), some sentences in that para that are correctly marked as unrelated, are marked as bad evaluations. Not sure if we can implement that. |
Turns out 99.92% (18094/18108) of the data is shorter than 512 (Bert) tokens, hence I assume it is OK to ignore them. Handling long input while using the model is another story. |
This is very likely to happen in the test set (both public and private). Here is one suggestion as I wrote in the email:
|
For quick experimenting, we're using < 128 tokens sequences. Length of 512 might be reconsidered if the test case requires this limit. Edit: For the pre-tests, 256 would be sufficient (only test_0236 p4 exceed this limit a little so trimming it is alright). |
As raised from transformers' tokenization utils:
This is because some paragraphs are too long. I am unsure how we should be splitting those.
The text was updated successfully, but these errors were encountered: