Skip to content

Conversation

thawn
Copy link
Contributor

@thawn thawn commented Oct 2, 2025

Description

This pull request optimizes the sync process when synchronization is triggered manually via the Sync now button.

In particular, the synchronization of the github hooks is now always performed asynchronously in the background.

Moreover, github hooks are synchronized in parallel in batches of 20 repos at a time, to ensure that each process finishes within less than 30s, which is the timeout on Zenodo.

This should prevent Github page timeout with 504 error after pressing the Sync now button on Zenodo.

Fixes #176

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Frontend

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.

thawn added 3 commits October 2, 2025 17:19
prevent Github page timeout with 504 after sync on Zenodo
Fixes inveniosoftware#176
thus, we give the asynchronous tasks some time to finish updating the github hooks
thawn and others added 2 commits October 3, 2025 13:04
@palkerecsenyi palkerecsenyi moved this to In review 🔍 in Sprint Q4/2025 Oct 3, 2025
@palkerecsenyi palkerecsenyi self-assigned this Oct 3, 2025
Copy link
Member

@palkerecsenyi palkerecsenyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thawn, thank you so much for this PR! Indeed the sync is a big issue on Zenodo, so these changes are highly appreciated. And using parallel jobs is indeed quite a nice way of solving it. I just have a few queries to make sure we optimise the UX as much as possible.

Also, this repository is currently undergoing a large-scale rewrite (see #188) which will eventually change much of the frontend UI. It currently keeps the old syncing behaviour though, so this is something I'd be happy to include in my changes. It will take a few months for this rewrite to be fully production-ready and ready for release in Zenodo. So we could either merge this PR now and include it in the new rewrite, or just wait for the rewrite to be released if it's not too urgent. Do you have a preference?

Thanks again for taking the time to create this fix!

@palkerecsenyi palkerecsenyi removed their assignment Oct 3, 2025
@palkerecsenyi palkerecsenyi moved this from In review 🔍 to In progress in Sprint Q4/2025 Oct 3, 2025
@thawn
Copy link
Contributor Author

thawn commented Oct 3, 2025

Hi @thawn, thank you so much for this PR! Indeed the sync is a big issue on Zenodo, so these changes are highly appreciated. And using parallel jobs is indeed quite a nice way of solving it.

Thank you very much for your kind words and for taking this up so quickly!

I just have a few queries to make sure we optimise the UX as much as possible.

see the answers to your questions above

Also, this repository is currently undergoing a large-scale rewrite (see #188) which will eventually change much of the frontend UI. It currently keeps the old syncing behaviour though, so this is something I'd be happy to include in my changes. It will take a few months for this rewrite to be fully production-ready and ready for release in Zenodo. So we could either merge this PR now and include it in the new rewrite, or just wait for the rewrite to be released if it's not too urgent. Do you have a preference?

Since you were not planning to change the syncing behavior, I would consider this PR independent of your rewrite. If it is all the same to you, I would prefer if it got merged before the rewrite, so that the affected users (including myself) can benefit from it as soon as possible.

@palkerecsenyi palkerecsenyi self-assigned this Oct 6, 2025
@palkerecsenyi palkerecsenyi moved this from In progress to In review 🔍 in Sprint Q4/2025 Oct 6, 2025
@thawn
Copy link
Contributor Author

thawn commented Oct 13, 2025

@yarikoptic @palkerecsenyi I was just wondering if I need to still do anything before this PR can be merged?

@yarikoptic
Copy link
Contributor

no further comments from me -- I would be happy to try on my account to see if helped! ;)

meanwhile I checked that walrus operator could be used also within `for` but not sure if it would help here besides making assignment/use of `batch_size` more explicit for such folks as I am, but there might be come "too much" in that line
❯ cat /tmp/testpy.py
import sys
print(f"Python {sys.version}")
repos = list(range(20))
for i in range(0, len(repos), (batch_size := 5)):
    print(repos[i : i + batch_size])

❯ uv run --python 3.8 /tmp/testpy.py
Python 3.8.20 (default, Oct  2 2024, 16:34:12) 
[Clang 18.1.8 ]
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19]

@thawn
Copy link
Contributor Author

thawn commented Oct 14, 2025

@yarikoptic the walrus operator is another new one for me. Thanks again for introducing a new concept to me!

We now get the batch size from a configuration variable, which means the code with the walrus operator would look like:

for i in range(0, len(repos), (batch_size := current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"]):
        sync_hooks_task.delay(self.user_id, repos[i : i + batch_size])

Which for me is less readable than the current code:

batch_size = current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"]
for i in range(0, len(repos), batch_size):
         sync_hooks_task.delay(self.user_id, repos[i : i + batch_size])

Please let me know which variant you prefer.

@yarikoptic
Copy link
Contributor

keep as is please since as I mentioned it is indeed "too much" in this case.
The only way to help would be with formatting like

for i in range(0,
               len(repos),
               (batch_size := current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"])
              ):
        sync_hooks_task.delay(self.user_id, repos[i : i + batch_size])

but it might also be "too much" to some.

As you mentioned, it is only with py 3.12 and itertools it could indeed become ultimately concise:

for batch in itertools.batched(repo, current_app.config["GITHUB_WEBHOOK_SYNC_BATCH_SIZE"]):
    sync_hooks_task.delay(self.user_id, batch)

@ntarocco
Copy link

@thawn I think that this is ready to be merged.
One last question: do you want to update the copyright headers?

@thawn
Copy link
Contributor Author

thawn commented Oct 16, 2025

I added myself to the list of authors - I hope that was the right thing to do. if not, please let me know.

@ntarocco
Copy link

I added myself to the list of authors - I hope that was the right thing to do. if not, please let me know.

If you are affiliated to an organization which needs to be the copyright holder, you add it to the headers of the files that you have changes, see doc here.

@thawn
Copy link
Contributor Author

thawn commented Oct 17, 2025

No additional changes to copyright are required from my side.

@thawn
Copy link
Contributor Author

thawn commented Oct 17, 2025

should I remove myself from the authors list again? The changes to the main code are minor. Most of the code I contributed were unit tests.

@ntarocco ntarocco merged commit abedae5 into inveniosoftware:master Oct 18, 2025
2 checks passed
@github-project-automation github-project-automation bot moved this from In review 🔍 to To release 🤖 in Sprint Q4/2025 Oct 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To release 🤖

Development

Successfully merging this pull request may close these issues.

Github page timeout with 504 after sync

4 participants