Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with get_state in download.py #22

Open
JohnGiorgi opened this issue Apr 14, 2020 · 1 comment
Open

Error with get_state in download.py #22

JohnGiorgi opened this issue Apr 14, 2020 · 1 comment

Comments

@JohnGiorgi
Copy link
Contributor

JohnGiorgi commented Apr 14, 2020

Hi, I downloaded the pre-filtered URL list from here, and then tried to extract the text with download.py as per the readme

python download.py url_dumps_deduped/RS_2018-07.xz.deduped.txt \
    --n_procs 40 \
    --scraper bs4 \
    --chunk_size 100000 \
    --compress \
    --timeout 30

For plenty of .txt files, I face this error

Traceback (most recent call last):
  File "download.py", line 235, in <module>
    completed_uids, state_fp, prev_cid = get_state(month, args.output_dir)
  File "download.py", line 210, in get_state
    latest_cid = max([int(a.split("-")[-1].split("_")[0]) for a in archives])
ValueError: max() arg is an empty sequence

Is this a known error? I am planning to dig through the code to try and debug this but I first wanted to see if anyone else is facing this issue and knows the fix/cause

@vipulraheja
Copy link

Removing the file(s) in the state folder fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants