Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate large files download using concurent chunks for lakectl download recursive #8613

Merged
merged 9 commits into from
Feb 12, 2025

Conversation

nopcoder
Copy link
Contributor

@nopcoder nopcoder commented Feb 5, 2025

The current lakectl fs download supports downloading large files by splitting the files and downloading the content in parallel.

This change add the same support to the recursive option with the support to run multiple downloads (of parallel chunks) in parallel.

Close #8383

…kectl fs download recursive

The curent `lakectl fs download` supports downloading large files by
splitting the files and downloading the content in parallel.

This change add the same support to the recursive option with the
support to run multiple downloads (of parallel chunks) in parallel.

TODO

New downloader do not uses sync infrastructure, currently support the same flags: use-presign and parallel downloads.
Need to look into progress bar with chunks support.
Support no-progress flag as sync does.
@nopcoder nopcoder self-assigned this Feb 5, 2025
@nopcoder nopcoder added the include-changelog PR description should be included in next release changelog label Feb 5, 2025
Copy link

github-actions bot commented Feb 5, 2025

E2E Test Results - DynamoDB Local - Local Block Adapter

14 passed

Copy link

github-actions bot commented Feb 5, 2025

E2E Test Results - Quickstart

11 passed

@nopcoder nopcoder requested a review from a team February 6, 2025 17:17
@nopcoder nopcoder removed the request for review from a team February 7, 2025 10:58
@nopcoder nopcoder requested a review from a team February 7, 2025 21:05
if !recursive {
src := uri.URI{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says that we are modifying download with the recursive option. However there appear to be significant changes to the !recursive code branch. What is the nature of these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src was a copy of the remote - I've removed it

@@ -644,10 +642,6 @@ func TestLakectlFsDownload(t *testing.T) {
require.Contains(t, sanitizedResult, "download ro/ro_1k.2")
require.Contains(t, sanitizedResult, "download ro/ro_1k.3")
require.Contains(t, sanitizedResult, "download ro/ro_1k.4")
require.Contains(t, sanitizedResult, "Download Summary:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the user still want to see these summaries? Shouldn't they be supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history - the fs download recursive was replaced with Sync functionality used by lakectl local. This was part of the output done by the common code used by sync. It was also reporting Uploaded and Removed - which is not relevant to download command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the local sync summary to the recursive download

@@ -690,7 +680,6 @@ func TestLakectlFsUpload(t *testing.T) {
t.Run("single_file_with_recursive", func(t *testing.T) {
vars["FILE_PATH"] = "data/ro/ro_1k.0"
RunCmdAndVerifySuccessWithFile(t, Lakectl()+" fs upload --recursive -s files/ro_1k lakefs://"+repoName+"/"+mainBranch+"/"+vars["FILE_PATH"]+" -s files/ro_1k", false, "lakectl_fs_upload", vars)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ultimately a performance improvement issue. Therefore some benchmarks should be taken to verify not just correctness but comparative performance between new/old implementation in various cases i.e. few small files, lots of small files, few large files, lots of large files...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The download functionality in chunks already implemented as part of single file download
  • The performance improvement is done for large files only, we don't trigger chunks for smaller files. The existing download object is used for non-large files.
  • The code parallel the downloads as sync used to parallel

Copy link
Contributor

@guy-har guy-har left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

@nopcoder nopcoder merged commit 87c7e0a into master Feb 12, 2025
38 checks passed
@nopcoder nopcoder deleted the feat/download-mutipart branch February 12, 2025 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lakectl fs recursive download to use multipart
3 participants