-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[epic] Speed of Github API Calls #17
Comments
This has been a concern of mine for some time, I think it will be slightly better when deployed to Azure because the round-trip time to GitHub for HTTP requests will most likely be faster than on your machine, but it is still a problem. We may be able to optimize a bit by caching some of the requests made to GitHub, on the ckanext-versioning level or metastore-lib level. We may be able to optimize a bit by switching some of the calls, especially for requests that do multiple API calls to fetch related data, to the newer GitHub 4.0 GraphQL API (see #3 for example). But ultimately, calling the GitHub API to fetch data will be slower than calling a local DB or the file system. |
I have created this: #16 as a possible quick win. |
@shevron where's the profiling you were planning to do? |
I'm attaching some cProfile dumps. They can be viewed nicely by several tools, I use I recommend zooming in on the 1st call immediately after This file includes two profiles (had to tarball them otherwise GitHub won't allow to upload): one for a create action and one for a fetch action. I will add some insight into this soon. |
I also recommend using the "sunburst" view if using snakeviz, I find it much easier to work with. |
Calling What we can notice in the profile:
Here is a PNG of the call graph for |
There are also 2 times where an HTTP request seems to be triggered from PyGithub's |
@shevron 👏 great analysis. We have 3 areas i guess:
My understanding is this refers to Create Dataset. Overall, i'm actually not that concerned if Create Dataset is slow since it happens rarely and we could always provide feedback to user to improve UX (e.g. spinner and updates like creating dataset, adding metadata ...). If Update Dataset is slow that may be a bigger issue.
I'm not sure what _get_owner does. Rather than caching (which adds complexity) could this be explicitly be provided by the instantiator? |
Correct, this was just a first dump of data (I am adding "fetch" now) just had to leave so I wanted to save it here before I do. Agree about UX, but a lot of what is being done for
Sorry, you are right - it is a bit vague... Perhaps we don't really need to fetch the repo object from GitHub, but somehow "generate" it. It is one of the biggest classes in PyGithub and has a lot of properties, and all except the URL (I think) could be lazy-loaded when needed. Maybe we can create this object and manually set some properties we need, and it will just work, but I need to further investigate this. |
Analysis for Fetch is actually quite fast on my machine compared to |
I noticed that in the OP, @pdelboca reported that |
Analysis for update (non-partial, and not referencing a base commit, as performed by ckanext-versioning):
|
@rufuspollock I think there is enough here to create some concrete tasks and get fixing. Any thoughts on how to proceed or should I just create a few additional tickets and we can prioritize them? |
This is highly likely, as we know tag listing (aka "versions") will slow down linearly with the number of tags. |
@shevron one question here: i thought that revision tags are loading in javascript (and hence async) and should not block page load (so being slow is not great but should not block the page) (?) |
@rufuspollock I thought so too, but I will verify. |
@rufuspollock actually no - version list is rendered on the server. I think it was AJAX at some point but it is not the case anymore for a long time. |
I created this ticket to make it async: datopian/ckanext-versioning#38 |
@shevron ok great re async rendering. I think that will help solve a lot of the issue. I note that github only shows the number of releases/revision tags and that the actual rendering is a separate page (i think we may want to go that direction frankly). |
Specific tasks created off this analysis (and others related created previously):
|
@shevron great - can you move this to top of issue description and give an assessment (next to them or rough payoff and rough cost in days). |
Subtasks created post analysis
Specific tasks created off this analysis (and others related created previously), in order of fix priority:
update
if not doing a partial update #21 (super easy, high impact on fetch)_completeIfNeeded()
is triggered to see if we can avoid them by providing defaults: [performance] Eliminate calls to PyGithub__completeIfNeeded
#23 (low complexity, impact unsure, anywhere between non existent to medium)create
but not sure it can be done).Original Bug Report
I'm using the extension with the Github backend and the time it takes to do the Github API calls to fetch/create/update is really long and it is not good for the overall UX of the extension (I might say quite unusable).
The overall delay only happens when using
github
as backend since using thefilesystem
backend doesn't has this overhead of time. For now I'm executing it locally but I don't see too much reasons for this to be better when deploying on services.For
package_update
:For
package_show
:The text was updated successfully, but these errors were encountered: