-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh Capability for Downloading Github Data #282
Comments
Related issue with endpoint docs: #253 |
@Ssunoo2 You are correct in that the current endpoints are different. This is precisely the motivation for this refresher. You need to use a new endpoint. With our current endpoints, you run into issue #253. So you have more work to do here, compare to JIRA in regards to the endpoint and (possibly) the parser, but less work on the logic since you can rely on gh. Could you elaborate with a 1 issue example how the two differ? Are the fields the exact same? Please remember to create a branch asap on Kaiaulu, so you don't forget and start one from the fork instead.
I don't understand the problem here. Can you elaborate? |
@carlosparadis Thanks for confirming I am on the right path! I made a branch here: https://github.com/sailuh/kaiaulu/tree/282-github-downloader-refresh so hopefully it is visible to you as well. The fields do appear to be the exact same; I'm just commenting that the files are formatted differently and I discovered this after writing the code to name the files. I would access the created_at field with something like data[i]$created_at which doesn't work with the raw downloaded data from the search endpoint because it would need to be accessed via data[i]$item$created_at or something similar. But this should be a non-issue once I find out how to convert the file to the original schema |
Perfect. When you make a commit to this branch, it should start offering to make a PR against master. At that point, we should be able to do code reviews as usual. The main difference is that everyone else will be able to commit against it. Bear in mind you will want to coordinate a bit more by sharing the same branch to avoid merge conflicts all the time (e.g. you may want to have a local branch where you make most commits, and only push to this branch when ready for review, etc). If not planning to collaborate, then you can just commit as usual to it. I see on the differences! My suggestion is that you use a functional approach: You can use lapply(list_variable,function_passed_as_parameter) You can then code the function assuming each element of its parameter is an element of the list after accessing it (e.g. data[[i]] (note the use of double [[, it has a different meaning on what it returns than using data[i]). For the function, you can transform the element data to match the existing GitHub parser. The only thing I'd caution you about is on what fields are actually available. For example, some fields instead of containing the information, will contain an URL of where the information is. Make sure that the existing downloader and the downloader you implement not only has the right fields, but also the values we would otherwise obtain from them. To that effect, running the parser against a downloaded data for issues and comments to have a quick look on how the table looks like could be helpful! |
To download issue comments, I found that the search endpoint does not allow us to retrieve just comments. However, the issue/comments endpoint does allow us to download just comments. The issue endpoint does not have a 'created' parameter that lets us search by created_at values like the search endpoint does. It does, however, have a 'since' parameter:. The since parameter downloads will download all comments created or updated after a certain date. gh::gh("GET /repos/{owner}/{repo}/issues/comments",
owner=owner,
repo=repo,
since=created, # Pass the `since` parameter in the API request
page=1,
per_page=100,
.token=token) |
As discussed on discord: The refresher should save the data as is, without modifying it. The parser of the refresher should not attempt to modify the file to reuse the existing parser. Copy and paste the existing function if needed and make the adjustments needed even if small. In regards to whether the github refresher can maintain issues on the same folder as the issue downloader: The answer is likely no since they have different endpoints under the hood. Let's move with the assumption that to us the refresher, user has to specify a different folder. There are a few things I would like you to check:
|
The current difference in the .json format is mentioned in this comment. An easy way to return just the issue data is by writing to file issues$items but if the data is not to be altered, then this won't do. I had assumed that this comment urged me to alter the data downloaded to fit the parsers. The altered format does also work currently with the parsers in the download_github_comments.rmd notebook. If we were to download only the raw data, then one of the issues that stems from this is that the github_iterate_pages() function that writes the pages to file along the save path has a naming convention in which the 'created_at' value is checked to adhere to the naming convention that uses the unix times of the oldest/newest created dates in the file will have to be changed. Since this is differently nested in the .json files returned by the search endpoint, likely an if/else statement that check what level of nesting would need to be applied. The difference in nesting also affects the parsers. |
Added a check to both issue and comment refresher that checks if the directory is empty before running. If the directory is empty, it will run the regular downloader because otherwise it will break when trying to parse file names. issue refresh now downloads raw data and does not alter it but github_iterate pages cannot write it to file yet because it depends on an altered format
Does the issue endpoint download both issues and pull requests
Does the refresh use the same amount of api calls as existing downloader
|
The issue downloader uses the search endpoint and specifies the 'created' parameter to retrieve issue data for issues created after a certain date. Since there is not a good way to retrieve comments with the search endpoint, the issue comment downloader uses the issues/comment endpoint using the 'since' parameter, which downloads all data updated AND created after the specified date/time. Because the thin documentation of the 'since' parameter resembles the wording of the 'updated' parameter from the search endpoint, I will download issue data from the search endpoint twice; once using the created parameter and once using the updated parameter in order to compare the results. |
Issue data downloader discrepencies using issue versus search endpointsThe github issues refresher uses the search endpoint to download all issue data that was created after a certain date. The nesting of the results of the issue data from the search endpoint is different from the data downloaded from the original issue downloader. Refer to this comment for details on the differences. Initial solutionIt is imperative that the original data be downloaded in its UNALTERED form. The initial solution for this difference in data structure is to create a new folder that houses the refresh data. This adds several complexities: the refresh function as well as github_api_iterate_pages() which writes the data to the file would need a new way to differentiate the different nesting in the file. In order to parse out any values, separate code must be written. This includes parsing the created_at dates to retrieve the minimum and maximum values for the naming convention schema as well as creating the search queries for github API Get requests. This also means that to construct the search query, both files must be searched for the greatest date. Further, the parser would need to be rewritten so that it can parse data in both forms. A Potential better solutionIf the issue data downloaded from the issue endpoint is identical to that downloaded from the search endpoint except for nesting, then perhaps the better solution is not use the issue endpoint in the first place. This means that many of the parsing functionalities written amongst the downloaders and parser may need to be edited but it removes the complexity of having data in two different folders and in two different forms. |
Unaltered data downloadingI noticed when reviewing the github_api_iterate_page() function that writes the data to JSON files along a filepath that this code is being used: # Write to file
write_json(gh_response,file_name,
pretty=TRUE,auto_unbox=TRUE) auto_unbox=TRUE does alter the data so I will look into a better way to write the data; perhaps something like writeLines() |
Since query for issues/comments endpointThe since query downloads all data updated after a certain date. This includes all data created after a certain date because the updated_at parameter is initialized to the created_at parameter when the issue or comment is made. To test if the since parameter in the issues/comments endpoint will download comments created before the date/time specified but updated after this timestamp, I found a comment that was updated a day after it was created: gh::gh("GET /repos/{owner}/{repo}/issues/comments",
owner=owner,
repo=repo,
since="2023-03-16T11:01:54Z", # Pass the `since` parameter in the API request
page=1,
per_page=100,
.token=token) The comment was downloaded again. Difference in issue data downloaded from issue versus search endpointAside from the difference in nesting described in this comment, the data downloaded for each issue is identical except that the search query has an additional 'score' parameter: |
We decided that we will keep a separate folder for the github issue refresher for now as the format is different. In the future, this can be revisited to determine if the search endpoint can be used for the regular downloader as well. I will add a function that allows downloading issues based on upper and lower bound created_at ranges. I will focus on finishing the refreshers/downloaders and then request code review. I will then work on making a separate parser function that runs on the data from the search endpoint. I will triple check that there is no other good way to download comment data from other search endpoints. Then I will check if there is any type of indicator in the issue comment downloader that identifies duplicates of the same comment. For example if a comment was downloaded and then updated and subsequently downloaded again (the refresher uses the 'since' parameter). If there is some sort of identifier, then we may be able to filter the downloaded data to feed to the parsers. |
@Ssunoo2 Thanks for adding all the notes! In regards to this comment: #282 (comment) Are we unable to just download the data as is, as we did with JIRA downloader, before turning into a list because of the use of |
I noticed this recently so I am going to test downloading the raw data by itself. If it works with the parses, great! Otherwise, potentially the parsers can alter the data later |
For future refreshers:Pull requestscan be retrieved using the search endpoint. Adding is:pr to the search/issues should allow this: query <- sprintf("repo:%s/%s is:pr created:>%s", owner, repo, created)
# Use the Search API endpoint to search for pull requests
pull_requests <- gh::gh("/search/issues",
q = query,
.token = token,
.limit = 100) CommitsCommits, like the issue endpoint also uses the 'since' parameter
|
Regarding this comment discussing the use of writeJSON() in the github downloader. From my research, the github api downloads the data formatted as JSON. The gh package we are using appears to alter this data into an R object, likely a list. From the gh github page: The JSON result sent by the API is converted to an R object. So using the gh package may violate our requirement that the data from the API call remain unaltered. This is likely, why the writeJSON() function is being used in the current github downloader. If the github api data were unaltered, we should be able to use writeLines() to transfer the raw data to a JSON file, but in my experimentation this does not work and I believe the gh package to be the culprit. |
@Ssunoo2 thanks for confirming. If that is the case just proceed with the data object and the parameter all_box to. make as close as possible to the raw data. Replacing gh at this point is not viable. Can you please point me here to what comment on the issue tracker we discuss the effect of using auto_box? I don't remember where it is. If you can't find, or it is faster, you may just want to comment it here for future reference. Thanks! |
The comment that discussed the effect of auto_unbox begins here edit test |
@cp I found another issue with the 'since' parameter. It seems that since downloads anything updated greater than or equal to the given date so it is inclusive of the given date. This means that running the comment refresher when it is up to date will continue to download the last comment into a single file. This also likely means that there may be duplicate comment data when the refresher is called. I am brainstorming solutions to this problem and perhaps adding a second to the time passed to the since parameter may work? The only problem with this is if another comment was made or updated at the exact date down to the second, it may be skipped. |
For the comments part, I have a similar request: The Also, we should ensure all these functions explicitly say in the parameter name whether it is The |
Further clarifications from Discord: github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/ The The issue_or_pr_comment should be storing data of this endpoint: https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#list-issue-comments-for-a-repository the one we have been calling I'd actually like you to rename the field to You are not touching the You are also creating the On the long run (not this project timeline), I may either remove the issues endpoint altogether, or modify it to also collect data using the since (as we are with comments. However, since search endpoint also has an update parameter, it seems to me the /issues endpoint is redundant and less capable so my inclination is just to get rid of it on the long run and call the |
Function documentation has also been updated. Notebook documentation I am still working on.
Yes, de-coupling the refresher into two functions is totally doable and I agree that it will add more potential utility. Your understanding that the folders are essentially for different endpoints is also correct, so I will change the directory to be 'issue_search'. This does make more sense I think, especially for others who will contribute to this at a later time.
Yes, we can specify time zones UTC Time: YYYY-MM-DDTHH:MM:SSZ
Positive Offset (east of UTC): YYYY-MM-DDTHH:MM:SS+HH:MM
Negative Offset (west of UTC): YYYY-MM-DDTHH:MM:SS-HH:MM
For example, if you are specifying a time in Eastern Standard Time (UTC-5 hours),
it would be written as 2024-04-18T12:00:00-05:00. |
Refreshers and download by date functions for issues are now wrappers around this function. Modified github_api_project_issue_or_pr_comments() to optionally take a 'since' parameter. refresher and download by date for comments are now wrappers round this function. Changed the names of date_lower_bound and similarly named functions to updated_lower bound or created_lower_bound.
Reorganized sections and code blocks to facilitate showcasing: - How to download issues using search endpoint - Seeing the results immediately after - How to to download issue comments - Seeing the results immediately after - Moving the older endpoints that do not use refresh to a separate section. Signed-off-by: Carlos Paradis <[email protected]>
By testing the notebook and comparing outputs, found search endpoint only download issues. Also added a new notebook to display the communication network for github. Minor changes to graph display in a few notebooks so graph is enlarged in interactive form in the notebook. Signed-off-by: Carlos Paradis <[email protected]>
… whether issue or pull requests should be retrieved
This is the comment on how the endpoint uses time zone: #282 (comment) |
The commit 2d42d1a adds a new parameter to the search endpoint so we can also download PRs only. This means the parser for the search endpoint for issues can be used for the PR. The notebook add an example on how to request either and an IF to decide on the folder path. Note the search endpoint only allows downloading issues XOR PRs. There is no way to download both at the same time. For that you have to rely on the /issues endpoint (which does not offer the way to use a created_at attribute, only the updated at via the since parameter). Currently, as the GitHub refresher is concerned the following endpoints now offer refresher:
Other endpoints have not been added to the refresh capability (all that did are cited in this comments already which is the search endpoint and download_issues_or_pr_comments function). The ones still missing are the commits endpoint, the events endpoint (which there is no concept of duplicates, because events are unique, we can't edit an event), and if needed the /issues/ endpoint (which is a bit redundant to the search endpoint, other than it can download both issues and pr in one go but will introduce duplicates). Final note here, is that also adding the commits endpoint may need to change the iterate function if and else that test different cases of levels of nesting. The check search, issues and commits endpoint (commits is already done on the iterate_pages so it is backwards compatible, even if it is not a refresher yet). |
- Three notebooks were refactored to expect the use of the getter functions from R/config.R (i #230 contains the getter functions in R/config.R).
- The project configuration sections of each notebook were incorrectly using the project directory (kaiaulu/) as their working directory rather than the directory that they reside in (/vignettes/) as their working directory.
Adds GitHub refresh and re-enable functionality for parsing replies, social smells, motifs and more. Signed-off-by: Carlos Paradis <[email protected]>
Adds GitHub refresh and re-enable functionality for parsing replies, social smells, motifs and more. Signed-off-by: Carlos Paradis <[email protected]> --------- Signed-off-by: Carlos Paradis <[email protected]> Co-authored-by: Sean Sunoo <[email protected]> Co-authored-by: Carlos Paradis <[email protected]> Co-authored-by: Nicholas Beydler <[email protected]>
It's been 8 months!! Glad this is finally making to master. Thank you for @beydlern your efforts in updating this to the new config specification so we could integrate everything without breaking. I also fixed the parse_github_replies() if you're interested to see the diff. You already had the paths in the new specification. I just exposed the paths hardcoded inside as parameters. The function is a bit counter intuitive because it takes commit messages when it suggests it is to combine issue and pr comments. However, commit is needed to find who is who on git log: The GitHub API only gives GitHub usernames, not the name and e-mail we find on gitlog. Commit API gives both username, name and e-mail, so the bridge can be performed. Pointing this out because it is another hidden linkage on the tables we should point out on the data schema #319 #226 . Closing this issue! |
1. Purpose
The purpose of this issue is to create a refresh capability for the download_github_comments.rmd. Currently, the issue data is downloaded along a filepath, 100 issues per page, and does not have the capability of refreshing. Calling the downloader again will redownload all files starting from the first issue, which adds overhead and also does not allow the downloading of any issues that are beyond the reach of the API limit.
2. Endpoints
On github, every pull request is an issue but not every issue is a pull request. Therefore 'issues' endpoint may return both issues and pull requests in the response. Pull requests from the issue endpoint will have an issue ID. To obtain a pull request ID, we need to use the "list pull requests" endpoint to download up to 100 issues per page.
Further, the first message of each issue and each pull request is not considered a Github comment.
Therefore, the original notebook relies on three endpoints: the 'Issue endpoint' to get the first comment of every issues, the 'Pull Request endpoint' to get the first comment of every pull request and the 'Issue and Pull request Comment endpoint' to get comments for the issues and pull requests together.
Like Jira, Github has a search endpoint:
2.1 Search by date
Github API has the ability to search by certain parameters. One of these is the ability to search by when an issue was created or last updated:
Unlike the JIRA API, this also allows us to search by hours, minutes and seconds, so the downloading of duplicate data is less likely, though not nonzero.
Understanding the search syntax:
The search endpoint cannot specifically download comments and so it may not be used for the comment refresh. Instead, the issue/comments endpoint has a 'since' parameter that will download all comments updated after the date. Because the 'updated_at' parameter is initialized to the 'created_at' value when the comment is created, it also downloads all comments created after this date. More information can be found here:
2.2 Search by issue range
Unlike the Jira downloader in #275 , the Github API does not currently have a way to download issues based on range. Instead, they allow the specification of individual issue numbers to collect data from. This makes searching by issue range a more difficult task and harder to implement as the vehicle for our refresh functionality.
Initially, one would think that a way to implement the refresh function with issue number specification could be (1) find the highest issue number already downloaded, (2) download that issue number +1 and (3) loop through this, incrementing the issue number by one each loop until all issues are downloaded. The problem with this is that often some issue numbers do not exist or are not accessible, which would require error handling and overhead. Another issue is the function may not know when the last issue was successfully downloaded. Further, this would require an API call for each individual issue instead of the original method in which up to 100 issues can be downloaded per page.
For this reason, we will use the search by date for the refresh capability.
3. Refresh
3.1 Naming Convention
We will need a reliable way to parse filenames in filepath to find the most recent 'created_at' field among files in order to pass this value as a search parameter to the endpoint discussed in 2.1.
Similar to #275, we will use a naming convention for the filenames that ends with [UNIXTIME of lowest 'created_at' value]_[UNIXTIME of greatest 'created_at' value] so that we can use
parse_jira_latest_date(filepath)
from #276 to return the filename that contains the most recent 'created_at' value. The issues are downloaded in these files such that the most recently created issue happens to be in the first index and so we will retrieve this value (accurate to seconds) and append it to the github api call discussed in 2.1 to retrieve only issue data after this date/time. We then will call github_api_iterate_pages() to save the files in .json format along the designated filepath.dates from the issue and issue/comments endpoint are accessed via issues$created_at. dates from the search endpoint are accessed via issues$items$created_at. dates from the commits endpoint are accessed via issue$commit$author$date
3.2 File storage organization
The issue of syncing the file storage organization is discussed in #286. Previously, the file storage would follow the form of ../../rawdata/[project_management_tool]/[project_name]/. e.g.
A superior way would be to have the project name be a higher level directory so that users can locate data based on project name ../..rawdata/[project_name]/[project_management_tool]/issues
The logic behind this is that most people are concerned with data referring to a specific project and so it would be easier to navigate. Refer to #286 for more information on the data storage organization.
3.3 Github refresh
Refresh Issues
To refresh issues, we will use the search endpoint. The search endpoint allows us to retrieve issue data and filter by 'created_at' values. As such, the logic is similar to that in #275, in which we parse the name of the file with the most recent date, retrieve that date and then pass it as a lower bound to the search endpoint, effectively only downloading issues created after that point.
Issues retrieved from the search endpoint are differently nested than those retrieved from the issue endpoint and thus many of the functionalities that rely on the original formatting break. For this reason, we will create a new folder: 'issue_refresh' and download the issue data from the search endpoint in this file. This means that to now obtain the latest date already downloaded, the refresh function must iterate through both the issues and issues_refresh file, find the greatest date in each and then compare those to see which is the greatest.
Refresh Comments
The search endpoint does not allow us to filter specifically by comments and as such, we will use the issue/comments endpoint. Specifically, the 'since' parameter allows us to retrieve any comments updated or created after a certain date, accurate to the second. More details: . This still uses the issue/comments endpoint and as such the data from the refresher is downloaded in the original folder.
4. Task List
repo,
token,
date_lower_bound,
date_upper_bound,
verbose)
The text was updated successfully, but these errors were encountered: