Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel download of many files #882

Open
senderista opened this issue Mar 2, 2017 · 0 comments
Open

Support parallel download of many files #882

senderista opened this issue Mar 2, 2017 · 0 comments
Assignees

Comments

@senderista
Copy link
Contributor

Right now UriSource implicitly supports downloading all files directly under the input URL (by appending a wildcard character * to the URL), but downloads are strictly sequential (and go through the coordinator). We should support this at least in the REST API (if not MyriaL), by adding a new endpoint parallelIngestDatasets, which would take either a URL wildcard expression (which would be evaluated by org.apache.hadoop.fs.FileSystem.globStatus() as in UriSource) or a list of URLs (possibly in a separate endpoint), and distribute the downloads over all available workers (using the file sizes reported by org.apache.hadoop.fs.FileSystem.getFileStatus().getLen() and some greedy bin packing heuristic). We could then replace the parallel ingest API in myria-python by a call to this REST API. Eventually we could consider supporting parallel downloads directly in MyriaL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants