Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When processing TSV file from ~260GB zip file, mlr stops processing abruptly while filtering stream #1251

Open
jhpoelen opened this issue Mar 29, 2023 · 1 comment
Labels

Comments

@jhpoelen
Copy link

bio-guoda/preston#228

In processing the retrieved zip with hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 using unzip tools, I was able to count 2.30 billion records

unzip -p c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 0015281-230224095556074.csv\
 | pv -l\
 > /dev/null

yielded

2.30G 1:43:31 [ 370k/s]

This alternate method to count lines in a known resource (e.g., content id is known) suggests that the alternate method used in https://gist.github.com/jhpoelen/569a3a787f6da542c8202ecddbacf580 or

preston cat 'zip:hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97!/0015281-230224095556074.csv'\
 | pv -l\
 | mlr --tsvlite filter '$collectionCode == "CASTYPE"'

which initially yielded

2.07G 5:16:05 [ 109k/s]

in the workflow described by https://gist.github.com/jhpoelen/569a3a787f6da542c8202ecddbacf580

@jhpoelen
Copy link
Author

possibly related to #1088 ?

@johnkerl johnkerl changed the title when processing tsv file from ~260GB zip file, mlr stops processing abruptly while filtering stream When processing TSV file from ~260GB zip file, mlr stops processing abruptly while filtering stream Mar 31, 2023
@johnkerl johnkerl added the bug label Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants