Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] Speed of Github API Calls #17

Open
pdelboca opened this issue Jul 15, 2020 · 22 comments
Open

[epic] Speed of Github API Calls #17

pdelboca opened this issue Jul 15, 2020 · 22 comments
Assignees

Comments

@pdelboca
Copy link
Contributor

pdelboca commented Jul 15, 2020

Subtasks created post analysis

Specific tasks created off this analysis (and others related created previously), in order of fix priority:

Original Bug Report

I'm using the extension with the Github backend and the time it takes to do the Github API calls to fetch/create/update is really long and it is not good for the overall UX of the extension (I might say quite unusable).

The overall delay only happens when using github as backend since using the filesystem backend doesn't has this overhead of time. For now I'm executing it locally but I don't see too much reasons for this to be better when deploying on services.

For package_update:
image

For package_show:
image

@shevron
Copy link
Contributor

shevron commented Jul 16, 2020

This has been a concern of mine for some time, I think it will be slightly better when deployed to Azure because the round-trip time to GitHub for HTTP requests will most likely be faster than on your machine, but it is still a problem.

We may be able to optimize a bit by caching some of the requests made to GitHub, on the ckanext-versioning level or metastore-lib level.

We may be able to optimize a bit by switching some of the calls, especially for requests that do multiple API calls to fetch related data, to the newer GitHub 4.0 GraphQL API (see #3 for example).

But ultimately, calling the GitHub API to fetch data will be slower than calling a local DB or the file system.

@shevron
Copy link
Contributor

shevron commented Jul 16, 2020

I have created this: #16 as a possible quick win.

@shevron shevron transferred this issue from datopian/ckanext-versioning Jul 19, 2020
@shevron shevron self-assigned this Jul 19, 2020
@rufuspollock
Copy link
Member

@shevron where's the profiling you were planning to do?

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

I'm attaching some cProfile dumps. They can be viewed nicely by several tools, I use snakeviz which is pip installable and runs everywhere.

I recommend zooming in on the 1st call immediately after t.py in the call stack (which should be the 1st call made from storage.py and starting any analysis from there. What we want to look at are the parts between this function (the top level API action called in storage.py, e.g. create, fetch, etc.) and requestJsonAndCheck in Requester.py which is where every HTTP request done by PyGithub is performed.

metastore-profiles.tar.gz

This file includes two profiles (had to tarball them otherwise GitHub won't allow to upload): one for a create action and one for a fetch action.

I will add some insight into this soon.

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

I also recommend using the "sunburst" view if using snakeviz, I find it much easier to work with.

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

Calling create is quite slow from my machine (in this profile almost 7 seconds). It may be less critical than fetch which is much faster`, but it is still quite slow.

What we can notice in the profile:

  1. Call to _get_owner is 0.8 seconds. The owner rarely changes in some use cases (esp the ckanext-versioning case) so we can probably cache it as suggested in [optimization] GitHub: Reduce number of API calls by caching  #16 to reduce this time to close to 0 (will still happen on the 1st use depending on what caching backend we use)

  2. Most of the time is spent on _create_commit which is our own code calling 4 different GitHub endpoints (create commit, create tree, get ref for master, edit ref for master). IIRC we have to do all of them but perhaps this can be optimized (maybe there is some kind of "shortcut API" or maybe GraphQL can do it in one mutation or...). I need to study this further to understand.

  3. 2nd place is create_repo at about 25% of the time which we just have to call, no other way around it.

  4. GitHub requires you to call a high level API create_file on the repo after creating it and before you are calling any low level Git APIs such as create_commit. I'm not sure why this is the case but I have not been able to work around it. They say the repo can't be empty if you want to access the Git API (?). I'm using this to create a stub README.md file then create the actual first commit of datapackage.json

  5. We're also calling get_branch to get the sha of the tip of the default branch before creating the next commit. This is about 11% of the time, and may be dropped i.e. if there is a way to work around need - will need to dig some more in the GitHub API.

Here is a PNG of the call graph for create:

github-create-callgraph

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

There are also 2 times where an HTTP request seems to be triggered from PyGithub's _completeIfNeeded. I'm guessing that this is only called if some variable was not provided but has some sensible default, e.g. the default branch name. Maybe we can reduce this (~11% of time) if we provide some values explicitly.

@rufuspollock
Copy link
Member

@shevron 👏 great analysis. We have 3 areas i guess:

  • Create Dataset
  • Update Dataset
  • Get Dataset

My understanding is this refers to Create Dataset. Overall, i'm actually not that concerned if Create Dataset is slow since it happens rarely and we could always provide feedback to user to improve UX (e.g. spinner and updates like creating dataset, adding metadata ...). If Update Dataset is slow that may be a bigger issue.

Call to _get_owner is 0.8 seconds. The owner rarely changes in some use cases (esp the ckanext-versioning case) so we can probably cache it as suggested in #16 to reduce this time to close to 0 (will still happen on the 1st use depending on what caching backend we use)

I'm not sure what _get_owner does. Rather than caching (which adds complexity) could this be explicitly be provided by the instantiator?

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

My understanding is this refers to Create Dataset. Overall, i'm actually not that concerned if Create Dataset is slow since it happens rarely and we could always provide feedback to user to improve UX (e.g. spinner and updates like creating dataset, adding metadata ...). If Update Dataset is slow that may be a bigger issue.

Correct, this was just a first dump of data (I am adding "fetch" now) just had to leave so I wanted to save it here before I do. Agree about UX, but a lot of what is being done for create also happens for update so this analysis is still mostly relevant.

I'm not sure what _get_owner does. Rather than caching (which adds complexity) could this be explicitly be provided by the instantiator?

Sorry, you are right - it is a bit vague... _get_owner() is a private method that fetches either an Organization object or an AuthenticatedUser object from GitHub, depending on whether the owner is the current user (owner of the access token) or an organization. This object regardless of it's type is then used solely for the purpose of calling get_repo(name) to get a PyGitHub Repository object describing the repository. This is needed for pretty much every action we do.

Perhaps we don't really need to fetch the repo object from GitHub, but somehow "generate" it. It is one of the biggest classes in PyGithub and has a lot of properties, and all except the URL (I think) could be lazy-loaded when needed. Maybe we can create this object and manually set some properties we need, and it will just work, but I need to further investigate this.

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

Analysis for fetch:

github-fetch-callgraph

Fetch is actually quite fast on my machine compared to create - under 1 second. A lot of the time is spent on _get_owner which as discussed may be optimized out or cached. It is possible that get_repo may also be worked around (another ~15% of time) using the same strategy discussed above. I don't think the other calls (get_git_ref or get_tag to get the revision sha, get_commit to get the git commit and get_contents to get the contents of datapackage.json from that commit) can be avoided, but perhaps using the GraphQL API could allow us to send only one or two requests instead of 3.

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

I noticed that in the OP, @pdelboca reported that package_show took ~7 seconds. I need to further analyze this because from what I can tell metastore-lib's fetch is not the culprit and maybe package_show is doing other things. Or we just got lucky / unlucky with a single test.

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

Analysis for update (non-partial, and not referencing a base commit, as performed by ckanext-versioning):

github-update-callgraph

  1. This calls get_owner / get_repo twice, once when fetching and then when updating. Surely in this case we can just fetch the repo once and pass it around / cache it on the instance level instead. May reduce about 9% right off the bat.

  2. We call fetch (37%) even though this is not a partial update and we do not need the data. This we can just not do 🤦 (and if we don't call fetch we don't really need the previous fix even though it would be nicer to do it anyway if we can to speed up partial updates as well a bit).

  3. We call _create_commit (40%) which is similar to how things are done for create so same analysis as above.

  4. We call get_branch to get the tip commit sha of the latest branch. We could save this somewhere in CKAN DB and pass it in but I really don't think this is worth optimizing (less than 5% of time) and it can create a lot of consistency headaches.

@shevron
Copy link
Contributor

shevron commented Jul 21, 2020

@rufuspollock I think there is enough here to create some concrete tasks and get fixing. Any thoughts on how to proceed or should I just create a few additional tickets and we can prioritize them?

@pdelboca
Copy link
Contributor Author

@shevron when I reported this with the screenshots I was testing with a dataset that had 4 or 5 versions created, so the ~7 secs in package_show may be related to #3 . (Although I'm not sure)

@shevron
Copy link
Contributor

shevron commented Jul 22, 2020

@shevron when I reported this with the screenshots I was testing with a dataset that had 4 or 5 versions created, so the ~7 secs in package_show may be related to #3 . (Although I'm not sure)

This is highly likely, as we know tag listing (aka "versions") will slow down linearly with the number of tags.

@rufuspollock
Copy link
Member

@shevron one question here: i thought that revision tags are loading in javascript (and hence async) and should not block page load (so being slow is not great but should not block the page) (?)

@shevron
Copy link
Contributor

shevron commented Jul 23, 2020

@rufuspollock I thought so too, but I will verify.

@shevron
Copy link
Contributor

shevron commented Jul 23, 2020

@rufuspollock actually no - version list is rendered on the server. I think it was AJAX at some point but it is not the case anymore for a long time.

@shevron
Copy link
Contributor

shevron commented Jul 23, 2020

@rufuspollock actually no - version list is rendered on the server. I think it was AJAX at some point but it is not the case anymore for a long time.

I created this ticket to make it async: datopian/ckanext-versioning#38

@rufuspollock
Copy link
Member

@shevron ok great re async rendering. I think that will help solve a lot of the issue. I note that github only shows the number of releases/revision tags and that the actual rendering is a separate page (i think we may want to go that direction frankly).

@shevron
Copy link
Contributor

shevron commented Jul 23, 2020

Specific tasks created off this analysis (and others related created previously):

@rufuspollock
Copy link
Member

@shevron great - can you move this to top of issue description and give an assessment (next to them or rough payoff and rough cost in days).

@pdelboca pdelboca changed the title Speed of Github API Calls [epic] Speed of Github API Calls Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants