-
Notifications
You must be signed in to change notification settings - Fork 29
api: optimize sync process with batch task execution #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
prevent Github page timeout with 504 after sync on Zenodo Fixes inveniosoftware#176
thus, we give the asynchronous tasks some time to finish updating the github hooks
thanks to @yarikoptic for the suggestion! Co-authored-by: Yaroslav Halchenko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thawn, thank you so much for this PR! Indeed the sync is a big issue on Zenodo, so these changes are highly appreciated. And using parallel jobs is indeed quite a nice way of solving it. I just have a few queries to make sure we optimise the UX as much as possible.
Also, this repository is currently undergoing a large-scale rewrite (see #188) which will eventually change much of the frontend UI. It currently keeps the old syncing behaviour though, so this is something I'd be happy to include in my changes. It will take a few months for this rewrite to be fully production-ready and ready for release in Zenodo. So we could either merge this PR now and include it in the new rewrite, or just wait for the rewrite to be released if it's not too urgent. Do you have a preference?
Thanks again for taking the time to create this fix!
Thank you very much for your kind words and for taking this up so quickly!
see the answers to your questions above
Since you were not planning to change the syncing behavior, I would consider this PR independent of your rewrite. If it is all the same to you, I would prefer if it got merged before the rewrite, so that the affected users (including myself) can benefit from it as soon as possible. |
@yarikoptic @palkerecsenyi I was just wondering if I need to still do anything before this PR can be merged? |
no further comments from me -- I would be happy to try on my account to see if helped! ;) meanwhile I checked that walrus operator could be used also within `for` but not sure if it would help here besides making assignment/use of `batch_size` more explicit for such folks as I am, but there might be come "too much" in that line❯ cat /tmp/testpy.py
import sys
print(f"Python {sys.version}")
repos = list(range(20))
for i in range(0, len(repos), (batch_size := 5)):
print(repos[i : i + batch_size])
❯ uv run --python 3.8 /tmp/testpy.py
Python 3.8.20 (default, Oct 2 2024, 16:34:12)
[Clang 18.1.8 ]
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19] |
@yarikoptic the walrus operator is another new one for me. Thanks again for introducing a new concept to me! We now get the batch size from a configuration variable, which means the code with the walrus operator would look like: for i in range(0, len(repos), (batch_size := current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"]):
sync_hooks_task.delay(self.user_id, repos[i : i + batch_size]) Which for me is less readable than the current code: batch_size = current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"]
for i in range(0, len(repos), batch_size):
sync_hooks_task.delay(self.user_id, repos[i : i + batch_size]) Please let me know which variant you prefer. |
keep as is please since as I mentioned it is indeed "too much" in this case. for i in range(0,
len(repos),
(batch_size := current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"])
):
sync_hooks_task.delay(self.user_id, repos[i : i + batch_size]) but it might also be "too much" to some. As you mentioned, it is only with py 3.12 and itertools it could indeed become ultimately concise: for batch in itertools.batched(repo, current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"]):
sync_hooks_task.delay(self.user_id, batch) |
@thawn I think that this is ready to be merged. |
I added myself to the list of authors - I hope that was the right thing to do. if not, please let me know. |
If you are affiliated to an organization which needs to be the copyright holder, you add it to the headers of the files that you have changes, see doc here. |
No additional changes to copyright are required from my side. |
should I remove myself from the authors list again? The changes to the main code are minor. Most of the code I contributed were unit tests. |
Description
This pull request optimizes the sync process when synchronization is triggered manually via the
Sync now
button.In particular, the synchronization of the github hooks is now always performed asynchronously in the background.
Moreover, github hooks are synchronized in parallel in batches of 20 repos at a time, to ensure that each process finishes within less than 30s, which is the timeout on Zenodo.
This should prevent Github page timeout with 504 error after pressing the
Sync now
button on Zenodo.Fixes #176
Checklist
Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:
Frontend
Reminder
By using GitHub, you have already agreed to the GitHub’s Terms of Service including that: