-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Explain how sync between buckets work; slow overall speed #917
Comments
Hey, replication is a cool new feature of b2 that sounds like it might be perfect for this case. Your objects are very small and you are processing a ton of them, which might (I guess) result in server throttling your operations, so threads wait until the limit is decreased. Perhaps you can try that? It should run much faster than sync. With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time? |
Hey @ppolewicz 👋
Well I've tried replication before by setting it up from the UI, but I've found it very unintuitive. It gives almost no sense on howlong it actually takes before it's done replicating, and after watching the tutorial, it is said that for existing files "it can take up from a few minutes to a few hours". As my experience was also that it takes hours for it to replicate, I then changed to using the CLI instead with Replication also doesn't really fit my use case, because in the example the The
Every 1 to 2 months I run the sync command above, and in the last 6 months the production bucket accumulated 6k additional images (24k before), so you can say roughly 1k images are added to the production bucket each month.
If you refer to the endpoint/region, it is I've performed a sync earlier today (with the same command I've used above), which took again roughly 1.5-2 hours. So now if the buckets contents are basically identical, it shouldn't take much time to sync again right? But as I'm currently running it again, it gives similar speeds in the range of +/- After diving a bit deeper in the documentation I eventually found So what I did next was try Now looking back, maybe I had the unwittingly assumed that the sync would make a complete copy of the file and it's properities (like modified time). But I guess it makes sense the "modified time" for the new file in the destination bucket is newer than the one from the destination. TLDR b2 sync --threads <10|100|500> --delete --replaceNewer --compareVersions <none|size> b2://sourceBucket b2://destinationBucket @ppolewicz final question 🏁 I've now completely wiped my development bucket clean and started a new sync. It currently only performs |
In order to determine if the server is throttling, you'll have to enable logs (passing On any storage system based on erasure coding and HDDs, performance of reading of small files is not going to be great. If you'd sync a few bigger files, the speed would go way up. There is a performance bottleneck somewhere, either the server is throttling you or python threading is not doing a very good job with all those copy operations. 6/s is way below what I'd expect to see though, so my bet would be on the throttling. I'm not familiar enough with the throttling subsystem Backblaze B2 eu-central is currently running on, but from the client perspective you should be able to observe the retries and threads backing off. If you'll confirm it's not the retries and throttling, then I'll take a look at reproducing and analyzing performance of it - B2 and associated tools are supposed to handle 10TB objects and buckets with a billion objects, so not being able to deal with 30k files in a timely manner could be a bug. What are you running this on? Windows, Liunux? How did you install, from pip or binary? |
Hey @ppolewicz I've tried adding verbose but I can't say I see any keywords related to "throttling", "retry" or "back off" limits. Here's a gist with a portion of the logging (only ran it for a couple of seconds and truncated it to 2300 lines + redacted some information). Maybe you can spot things that are out of the ordinary. I've been running it on the following systems:
|
The log only shows scanning and 18 transfers - server wouldn't throttle you so early. You'd have to run it longer and then show a tail of the log (2k lines would be ok). Since you are running ubuntu, it would be easy to |
@ppolewicz But if it already sticks to the I will try your suggestions later today. |
@ppolewicz Installed it with pip, ran the same initial command, roughly same performance +/- Ran it for roughly 30 minutes and then pasted (2135 lines) it in the gist. |
We'll be setting up an environment to test performance of large files this week and after that happens we'll circle back to this one to test performance of small files too. Thank you for the detailed bug report. |
Problem
I have a
sourceBucket
with 30.000 images (50-100 KB each), roughly 1.4 GB storage in total, and I want to "sync" it from to adestinationBucket
.As this concern many files, or even a large filesize, I'm stumped that it takes 1.5 to 2 hours to complete. The output of the command above shows between
70-100 kB/s
(deteriorating over time), which seems kind of low, even when I run this on a production server with 10 Gbps up/down.Question
As it seems the network speed, or the amount of threads, don't really have much impact on the overall performance, so could you provide me answers on the following questions:
threads
argument, but it only differs in magnitude of +/- 10 kB/s.The text was updated successfully, but these errors were encountered: