You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #435 contains a script, cleanup_scripts/separate_test_set.py that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted from the training set into the eval set. Unfortunately there's a bug in seperate_test_set.py that causes 60 of the 10000 extracted articles to be empty.
The seperate_test_set.py script was later adopted in PR #470, as the file input_preprocessing/seperate_test_set.py, so it needs to be fixed there as well.
The problem is that on line 75 (here in PR 435 and here in PR 470 ) the boundaries between articles are found by using pythons split('\n\n'). But this produces an empty entry at the end of the resulting array. Then when a random array entry is selected on line 79 there's an approximately 1/2 of 1% chance of selecting the last (empty) entry.
Two choices of fixing the problem would be (a) call pop() on line 75 or (b) change line 79 num_articles to (num_articles-1) (so that the last entry can't be selected).
The text was updated successfully, but these errors were encountered:
The fix is straightforward but recreating the eval set would require: (1) updating google drive and (2) checking that RCPs are not affected. Since this benchmark is pretty old I think we should keep it as is and add this as a known bug to the documentation.
Yes, unfortunately this was left unfixed in the churn before v1.0 submission 1.5 years ago, so it is what it is, and it would not be productive to change the benchmark at this point. I'll submit a PR adding a small note to the bottom of the README.md for the benchmark.
PR #435 contains a script,
cleanup_scripts/separate_test_set.py
that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted from the training set into the eval set. Unfortunately there's a bug inseperate_test_set.py
that causes 60 of the 10000 extracted articles to be empty.The
seperate_test_set.py
script was later adopted in PR #470, as the fileinput_preprocessing/seperate_test_set.py
, so it needs to be fixed there as well.The problem is that on line 75 (here in PR 435 and here in PR 470 ) the boundaries between articles are found by using pythons
split('\n\n')
. But this produces an empty entry at the end of the resulting array. Then when a random array entry is selected on line 79 there's an approximately 1/2 of 1% chance of selecting the last (empty) entry.Two choices of fixing the problem would be (a) call
pop()
on line 75 or (b) change line 79num_articles
to(num_articles-1)
(so that the last entry can't be selected).The text was updated successfully, but these errors were encountered: