Support parallel download of many files #882

senderista · 2017-03-02T23:27:31Z

Right now UriSource implicitly supports downloading all files directly under the input URL (by appending a wildcard character * to the URL), but downloads are strictly sequential (and go through the coordinator). We should support this at least in the REST API (if not MyriaL), by adding a new endpoint parallelIngestDatasets, which would take either a URL wildcard expression (which would be evaluated by org.apache.hadoop.fs.FileSystem.globStatus() as in UriSource) or a list of URLs (possibly in a separate endpoint), and distribute the downloads over all available workers (using the file sizes reported by org.apache.hadoop.fs.FileSystem.getFileStatus().getLen() and some greedy bin packing heuristic). We could then replace the parallel ingest API in myria-python by a call to this REST API. Eventually we could consider supporting parallel downloads directly in MyriaL.

The text was updated successfully, but these errors were encountered:

senderista added the Enhancement label Mar 2, 2017

senderista assigned jortiz16 Mar 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel download of many files #882

Support parallel download of many files #882

senderista commented Mar 2, 2017

Support parallel download of many files #882

Support parallel download of many files #882

Comments

senderista commented Mar 2, 2017