Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh Capability for Downloading Github Data #282

Closed
9 tasks done
Ssunoo2 opened this issue Mar 2, 2024 · 31 comments
Closed
9 tasks done

Refresh Capability for Downloading Github Data #282

Ssunoo2 opened this issue Mar 2, 2024 · 31 comments
Assignees
Labels
module:download Anything associated to the download module for data acquisition.

Comments

@Ssunoo2
Copy link
Collaborator

Ssunoo2 commented Mar 2, 2024

1. Purpose

The purpose of this issue is to create a refresh capability for the download_github_comments.rmd. Currently, the issue data is downloaded along a filepath, 100 issues per page, and does not have the capability of refreshing. Calling the downloader again will redownload all files starting from the first issue, which adds overhead and also does not allow the downloading of any issues that are beyond the reach of the API limit.

2. Endpoints

On github, every pull request is an issue but not every issue is a pull request. Therefore 'issues' endpoint may return both issues and pull requests in the response. Pull requests from the issue endpoint will have an issue ID. To obtain a pull request ID, we need to use the "list pull requests" endpoint to download up to 100 issues per page.

Further, the first message of each issue and each pull request is not considered a Github comment.

Therefore, the original notebook relies on three endpoints: the 'Issue endpoint' to get the first comment of every issues, the 'Pull Request endpoint' to get the first comment of every pull request and the 'Issue and Pull request Comment endpoint' to get comments for the issues and pull requests together.

Like Jira, Github has a search endpoint:

2.1 Search by date

Github API has the ability to search by certain parameters. One of these is the ability to search by when an issue was created or last updated:
Unlike the JIRA API, this also allows us to search by hours, minutes and seconds, so the downloading of duplicate data is less likely, though not nonzero.

Understanding the search syntax:

  • This also specifies how to search for date ranges

The search endpoint cannot specifically download comments and so it may not be used for the comment refresh. Instead, the issue/comments endpoint has a 'since' parameter that will download all comments updated after the date. Because the 'updated_at' parameter is initialized to the 'created_at' value when the comment is created, it also downloads all comments created after this date. More information can be found here:

2.2 Search by issue range

Unlike the Jira downloader in #275 , the Github API does not currently have a way to download issues based on range. Instead, they allow the specification of individual issue numbers to collect data from. This makes searching by issue range a more difficult task and harder to implement as the vehicle for our refresh functionality.

Initially, one would think that a way to implement the refresh function with issue number specification could be (1) find the highest issue number already downloaded, (2) download that issue number +1 and (3) loop through this, incrementing the issue number by one each loop until all issues are downloaded. The problem with this is that often some issue numbers do not exist or are not accessible, which would require error handling and overhead. Another issue is the function may not know when the last issue was successfully downloaded. Further, this would require an API call for each individual issue instead of the original method in which up to 100 issues can be downloaded per page.

For this reason, we will use the search by date for the refresh capability.

3. Refresh

3.1 Naming Convention

We will need a reliable way to parse filenames in filepath to find the most recent 'created_at' field among files in order to pass this value as a search parameter to the endpoint discussed in 2.1.

Similar to #275, we will use a naming convention for the filenames that ends with [UNIXTIME of lowest 'created_at' value]_[UNIXTIME of greatest 'created_at' value] so that we can use parse_jira_latest_date(filepath) from #276 to return the filename that contains the most recent 'created_at' value. The issues are downloaded in these files such that the most recently created issue happens to be in the first index and so we will retrieve this value (accurate to seconds) and append it to the github api call discussed in 2.1 to retrieve only issue data after this date/time. We then will call github_api_iterate_pages() to save the files in .json format along the designated filepath.

dates from the issue and issue/comments endpoint are accessed via issues$created_at. dates from the search endpoint are accessed via issues$items$created_at. dates from the commits endpoint are accessed via issue$commit$author$date

3.2 File storage organization

The issue of syncing the file storage organization is discussed in #286. Previously, the file storage would follow the form of ../../rawdata/[project_management_tool]/[project_name]/. e.g.

../../rawdata/github/kaiaulu

A superior way would be to have the project name be a higher level directory so that users can locate data based on project name ../..rawdata/[project_name]/[project_management_tool]/issues

../../rawdata/kaiaulu/github/sailuh_kaiaulu

The logic behind this is that most people are concerned with data referring to a specific project and so it would be easier to navigate. Refer to #286 for more information on the data storage organization.

3.3 Github refresh

Refresh Issues

To refresh issues, we will use the search endpoint. The search endpoint allows us to retrieve issue data and filter by 'created_at' values. As such, the logic is similar to that in #275, in which we parse the name of the file with the most recent date, retrieve that date and then pass it as a lower bound to the search endpoint, effectively only downloading issues created after that point.
Issues retrieved from the search endpoint are differently nested than those retrieved from the issue endpoint and thus many of the functionalities that rely on the original formatting break. For this reason, we will create a new folder: 'issue_refresh' and download the issue data from the search endpoint in this file. This means that to now obtain the latest date already downloaded, the refresh function must iterate through both the issues and issues_refresh file, find the greatest date in each and then compare those to see which is the greatest.

Refresh Comments

The search endpoint does not allow us to filter specifically by comments and as such, we will use the issue/comments endpoint. Specifically, the 'since' parameter allows us to retrieve any comments updated or created after a certain date, accurate to the second. More details: . This still uses the issue/comments endpoint and as such the data from the refresher is downloaded in the original folder.

4. Task List

  • Fix naming convention to include UNIXSTARTTIME_UNIXENDTIME
  • Use parse_jira_latest_date(json_path) to find the filename with most recent date
  • Parse the file and find the latest created_at
  • format_created_at_from_file <- function(filename)
  • This function takes a filename along a path and extracts the greatest 'created_at' value
  • github_api_project_issue_refresh <- function(owner,repo,token,save_path_issue,save_path_issue_refresh,verbose)
  • This function accesses search/issues endpoint using 'is:issue' and 'created>date' to download data only after this date
  • github_api_project_issue_or_pr_comment_refresh <- function(owner,repo,token,file_save_path,verbose)
  • Since the search endpoint cannot pull comments, we will use the issues/comments endpoint and pass the date as 'since' value
  • iterate over that retrieved data and save it to file
  • github_parse_search_issues_refresh(api_responses)
  • this function parses the data in the issue_refresh folder
  • github_api_project_issue_by_date(owner,
    repo,
    token,
    date_lower_bound,
    date_upper_bound,
    verbose)
  • This function allows the specification of dates to retrieve issue data from. The results are stored in the issue_refresh folder.
github_api_iterate_pages(token, gh_response,save_path_issue,prefix=issue)
@Ssunoo2 Ssunoo2 changed the title Refresh Capability for Downloading Github Data Refresh Capability for Downloading Github Data (Milestone 2) Mar 2, 2024
@carlosparadis
Copy link
Member

Related issue with endpoint docs: #253

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Mar 20, 2024

I'm running into an issue in which the original downloaders use different endpoints. For instance, the issue data downloader uses the issue endpoint and the pull request data downloaders uses the pull request endpoint, the commit data downloader uses the commit endpoint, etc.

I am opting to use the search endpoint so that I can retrieve issues by their 'created_at' value. I noticed, however, that the .json file organization of the search endpoint is slightly different than the others, mainly that the issues are embedded similar to the results from the JIRA search endpoint.
Screenshot 2024-03-19 at 5 47 09 PM

Whereas this is the original unembedded .json file organization from the original downloaders:
Screenshot 2024-03-19 at 5 47 25 PM

I'm trying to find out if there is a way to unembed them for now so that their format will mimic the original so they work with other functions like the parsers that rely on this format. This also caused problems as the new naming convention that I am using (replacing the 'page number' with [unixtime start]_[unixtime end]

@carlosparadis
Copy link
Member

@Ssunoo2 You are correct in that the current endpoints are different. This is precisely the motivation for this refresher. You need to use a new endpoint. With our current endpoints, you run into issue #253. So you have more work to do here, compare to JIRA in regards to the endpoint and (possibly) the parser, but less work on the logic since you can rely on gh.

Could you elaborate with a 1 issue example how the two differ? Are the fields the exact same? Please remember to create a branch asap on Kaiaulu, so you don't forget and start one from the fork instead.

This also caused problems as the new naming convention that I am using (replacing the 'page number' with [unixtime start]_[unixtime end]

I don't understand the problem here. Can you elaborate?

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Mar 20, 2024

@carlosparadis Thanks for confirming I am on the right path!

I made a branch here: https://github.com/sailuh/kaiaulu/tree/282-github-downloader-refresh so hopefully it is visible to you as well.

The fields do appear to be the exact same; I'm just commenting that the files are formatted differently and I discovered this after writing the code to name the files. I would access the created_at field with something like

data[i]$created_at

which doesn't work with the raw downloaded data from the search endpoint because it would need to be accessed via

data[i]$item$created_at

or something similar. But this should be a non-issue once I find out how to convert the file to the original schema

@carlosparadis
Copy link
Member

Perfect. When you make a commit to this branch, it should start offering to make a PR against master. At that point, we should be able to do code reviews as usual. The main difference is that everyone else will be able to commit against it. Bear in mind you will want to coordinate a bit more by sharing the same branch to avoid merge conflicts all the time (e.g. you may want to have a local branch where you make most commits, and only push to this branch when ready for review, etc). If not planning to collaborate, then you can just commit as usual to it.

I see on the differences! My suggestion is that you use a functional approach:

You can use lapply(list_variable,function_passed_as_parameter)

You can then code the function assuming each element of its parameter is an element of the list after accessing it (e.g. data[[i]] (note the use of double [[, it has a different meaning on what it returns than using data[i]).

For the function, you can transform the element data to match the existing GitHub parser.

The only thing I'd caution you about is on what fields are actually available. For example, some fields instead of containing the information, will contain an URL of where the information is. Make sure that the existing downloader and the downloader you implement not only has the right fields, but also the values we would otherwise obtain from them. To that effect, running the parser against a downloaded data for issues and comments to have a quick look on how the table looks like could be helpful!

Ssunoo2 added a commit that referenced this issue Mar 20, 2024
Ssunoo2 added a commit that referenced this issue Mar 22, 2024
@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Mar 29, 2024

To download issue comments, I found that the search endpoint does not allow us to retrieve just comments. However, the issue/comments endpoint does allow us to download just comments. The issue endpoint does not have a 'created' parameter that lets us search by created_at values like the search endpoint does. It does, however, have a 'since' parameter:. The since parameter downloads will download all comments created or updated after a certain date.

gh::gh("GET /repos/{owner}/{repo}/issues/comments",
         owner=owner,
         repo=repo,
         since=created,  # Pass the `since` parameter in the API request
         page=1,
         per_page=100,
         .token=token)

@carlosparadis
Copy link
Member

As discussed on discord: The refresher should save the data as is, without modifying it.

The parser of the refresher should not attempt to modify the file to reuse the existing parser. Copy and paste the existing function if needed and make the adjustments needed even if small.

In regards to whether the github refresher can maintain issues on the same folder as the issue downloader: The answer is likely no since they have different endpoints under the hood. Let's move with the assumption that to us the refresher, user has to specify a different folder.

There are a few things I would like you to check:

  • GitHub has a confusing definition for issues, where both Pull Requests and actual Issues are considered "Issues". So much so the issue ID increment is shared among them. Could you double check the search endpoint download both issues and Pull Requests?
  • How many API calls we need to use using the refresh, when compared to the existing issue / PR downloader?

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 9, 2024

The current difference in the .json format is mentioned in this comment.
The difference is that the search endpoint data has the issue data nested in 'items'.

An easy way to return just the issue data is by writing to file issues$items but if the data is not to be altered, then this won't do. I had assumed that this comment urged me to alter the data downloaded to fit the parsers. The altered format does also work currently with the parsers in the download_github_comments.rmd notebook.

If we were to download only the raw data, then one of the issues that stems from this is that the github_iterate_pages() function that writes the pages to file along the save path has a naming convention in which the 'created_at' value is checked to adhere to the naming convention that uses the unix times of the oldest/newest created dates in the file will have to be changed. Since this is differently nested in the .json files returned by the search endpoint, likely an if/else statement that check what level of nesting would need to be applied. The difference in nesting also affects the parsers.

Ssunoo2 added a commit that referenced this issue Apr 10, 2024
Added a check to both issue and comment refresher that checks if the directory is empty before running. If the directory is empty, it will run the regular downloader because otherwise it will break when trying to parse file names. issue refresh now downloads raw data and does not alter it but github_iterate pages cannot write it to file yet because it depends on an altered format
@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 10, 2024

Does the issue endpoint download both issues and pull requests

  • Yes, since every pull request is an issue but not every issue is a pull request, the search endpoint for issues does retrieve all issue and pull requests
    To specify or exclude pull request information, is:pr or -is:pr can be appended to the search query.

Does the refresh use the same amount of api calls as existing downloader

  • Yes, the initial page is retrieved through the function similar to the regular downloader. Subsequent api calls are made through the gh pacakge using gh:gh_next in the github_iterate_page() function

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 12, 2024

The issue downloader uses the search endpoint and specifies the 'created' parameter to retrieve issue data for issues created after a certain date. Since there is not a good way to retrieve comments with the search endpoint, the issue comment downloader uses the issues/comment endpoint using the 'since' parameter, which downloads all data updated AND created after the specified date/time.

Because the thin documentation of the 'since' parameter resembles the wording of the 'updated' parameter from the search endpoint, I will download issue data from the search endpoint twice; once using the created parameter and once using the updated parameter in order to compare the results.

Documentation on the search (created/updated) endpoint:
Screenshot 2024-04-12 at 12 35 03 PM

Documentation on the issue (since) endpoint:
Screenshot 2024-04-12 at 12 33 47 PM

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 12, 2024

The issues downloader DOES download associated tags for each issue. Here is an image of the nesting to locate these tags:
Screenshot 2024-04-12 at 12 41 03 PM

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 12, 2024

Issue data downloader discrepencies using issue versus search endpoints

The github issues refresher uses the search endpoint to download all issue data that was created after a certain date. The nesting of the results of the issue data from the search endpoint is different from the data downloaded from the original issue downloader. Refer to this comment for details on the differences.

Initial solution

It is imperative that the original data be downloaded in its UNALTERED form. The initial solution for this difference in data structure is to create a new folder that houses the refresh data. This adds several complexities: the refresh function as well as github_api_iterate_pages() which writes the data to the file would need a new way to differentiate the different nesting in the file. In order to parse out any values, separate code must be written. This includes parsing the created_at dates to retrieve the minimum and maximum values for the naming convention schema as well as creating the search queries for github API Get requests. This also means that to construct the search query, both files must be searched for the greatest date.

Further, the parser would need to be rewritten so that it can parse data in both forms.

A Potential better solution

If the issue data downloaded from the issue endpoint is identical to that downloaded from the search endpoint except for nesting, then perhaps the better solution is not use the issue endpoint in the first place. This means that many of the parsing functionalities written amongst the downloaders and parser may need to be edited but it removes the complexity of having data in two different folders and in two different forms.

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 12, 2024

Unaltered data downloading

I noticed when reviewing the github_api_iterate_page() function that writes the data to JSON files along a filepath that this code is being used:

# Write to file
      write_json(gh_response,file_name,
               pretty=TRUE,auto_unbox=TRUE)

auto_unbox=TRUE does alter the data so I will look into a better way to write the data; perhaps something like writeLines()

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 13, 2024

Since query for issues/comments endpoint

The since query downloads all data updated after a certain date. This includes all data created after a certain date because the updated_at parameter is initialized to the created_at parameter when the issue or comment is made. To test if the since parameter in the issues/comments endpoint will download comments created before the date/time specified but updated after this timestamp, I found a comment that was updated a day after it was created:
Screenshot 2024-04-12 at 2 04 25 PM
and queried between the created_at and updated_at values:

 gh::gh("GET /repos/{owner}/{repo}/issues/comments",
         owner=owner,
         repo=repo,
         since="2023-03-16T11:01:54Z",  # Pass the `since` parameter in the API request
         page=1,
         per_page=100,
         .token=token)

The comment was downloaded again.

Difference in issue data downloaded from issue versus search endpoint

Aside from the difference in nesting described in this comment, the data downloaded for each issue is identical except that the search query has an additional 'score' parameter:
Screenshot 2024-04-12 at 1 47 26 PM
I'm not sure at this point if this will disrupt the usage of kaiaulu, but because all other data is the same, I am leaning towards using the search parameter for the original downloader that seeks to download all data as well.

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 13, 2024

We decided that we will keep a separate folder for the github issue refresher for now as the format is different. In the future, this can be revisited to determine if the search endpoint can be used for the regular downloader as well.

I will add a function that allows downloading issues based on upper and lower bound created_at ranges.

I will focus on finishing the refreshers/downloaders and then request code review. I will then work on making a separate parser function that runs on the data from the search endpoint.

I will triple check that there is no other good way to download comment data from other search endpoints.

Then I will check if there is any type of indicator in the issue comment downloader that identifies duplicates of the same comment. For example if a comment was downloaded and then updated and subsequently downloaded again (the refresher uses the 'since' parameter). If there is some sort of identifier, then we may be able to filter the downloaded data to feed to the parsers.

@carlosparadis
Copy link
Member

@Ssunoo2 Thanks for adding all the notes!

In regards to this comment: #282 (comment)

Are we unable to just download the data as is, as we did with JIRA downloader, before turning into a list because of the use of gh here?

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 16, 2024

Are we unable to just download the data as is, as we did with JIRA downloader, before turning into a list because of the use of gh here?

I noticed this recently so I am going to test downloading the raw data by itself. If it works with the parses, great! Otherwise, potentially the parsers can alter the data later

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 16, 2024

For future refreshers:

Pull requests

can be retrieved using the search endpoint. Adding is:pr to the search/issues should allow this:

query <- sprintf("repo:%s/%s is:pr created:>%s", owner, repo, created)

  # Use the Search API endpoint to search for pull requests
  pull_requests <- gh::gh("/search/issues",
                          q = query,
                          .token = token,
                          .limit = 100)

Commits

Commits, like the issue endpoint also uses the 'since' parameter

since string
Only show results that were last updated after the given time. This is a timestamp in ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ.

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 17, 2024

Regarding this comment discussing the use of writeJSON() in the github downloader. From my research, the github api downloads the data formatted as JSON. The gh package we are using appears to alter this data into an R object, likely a list.

From the gh github page:

The JSON result sent by the API is converted to an R object.

So using the gh package may violate our requirement that the data from the API call remain unaltered. This is likely, why the writeJSON() function is being used in the current github downloader. If the github api data were unaltered, we should be able to use writeLines() to transfer the raw data to a JSON file, but in my experimentation this does not work and I believe the gh package to be the culprit.

@carlosparadis
Copy link
Member

@Ssunoo2 thanks for confirming. If that is the case just proceed with the data object and the parameter all_box to. make as close as possible to the raw data. Replacing gh at this point is not viable.

Can you please point me here to what comment on the issue tracker we discuss the effect of using auto_box? I don't remember where it is. If you can't find, or it is faster, you may just want to comment it here for future reference. Thanks!

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 17, 2024

The comment that discussed the effect of auto_unbox begins here

edit test

@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 17, 2024

@cp I found another issue with the 'since' parameter. It seems that since downloads anything updated greater than or equal to the given date so it is inclusive of the given date. This means that running the comment refresher when it is up to date will continue to download the last comment into a single file. This also likely means that there may be duplicate comment data when the refresher is called. I am brainstorming solutions to this problem and perhaps adding a second to the time passed to the since parameter may work? The only problem with this is if another comment was made or updated at the exact date down to the second, it may be skipped.

@carlosparadis
Copy link
Member

For the comments part, I have a similar request:

The github_api_project_issue_or_pr_comment_refresh() should call github_api_project_issue_or_pr_comments(), and the github_api_project_issue_or_pr_comments() should have a new parameter, date_updated_lower_bound=NULL that has as default NULL (so it doesn't interfere with existing functions calling it without that new parameter.

Also, we should ensure all these functions explicitly say in the parameter name whether it is date_updated_lower_bound OR date_created_lower_bound. The distinction is very subtle, and I'd rather have it obvious on the parameter and also emphasized on the description.

The github_api_project_issue_or_pr_comments() would be the one in charge of the file naming formatting, similar to my request above.

@carlosparadis
Copy link
Member

Further clarifications from Discord:

  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/

The issue field should be storing data for: https://docs.github.com/en/rest/issues/issues?apiVersion=2022-11-28#list-repository-issues

The issue_or_pr_comment should be storing data of this endpoint: https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#list-issue-comments-for-a-repository

the one we have been calling issue_refresh, given all that we talked about since, should be requesting data from the search endpoint: https://docs.github.com/en/search-github/searching-on-github/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated

I'd actually like you to rename the field to issue_search. instead of issue_refresh.

You are not touching the issue endpoint for your milestone. So that means the files stored there will not respect the convention you made. You are modifying the issue_or_pr_comment endpoint so the data in that folder now respects the refresh convention and take as parameter since.

You are also creating the issue_search endpoint so it respects the refresh convention.

On the long run (not this project timeline), I may either remove the issues endpoint altogether, or modify it to also collect data using the since (as we are with comments. However, since search endpoint also has an update parameter, it seems to me the /issues endpoint is redundant and less capable so my inclination is just to get rid of it on the long run and call the issue_search as issue, but for now easier to go with what I said above.

Ssunoo2 added a commit that referenced this issue Apr 18, 2024
Function documentation has also been updated. Notebook documentation I am still working on.
@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 19, 2024

Yes, de-coupling the refresher into two functions is totally doable and I agree that it will add more potential utility.

Your understanding that the folders are essentially for different endpoints is also correct, so I will change the directory to be 'issue_search'. This does make more sense I think, especially for others who will contribute to this at a later time.

Lastly, are we able to specify timezones here? What tz is GitHub using? On JIRA this was not an issue since the data is stored based on the tz it downloads, and we only request by Issue Key, but here this is important.

Yes, we can specify time zones
Github uses UTC time by default in ISO 8601 format so you can specify timezones using an offset of UTC time.

UTC Time: YYYY-MM-DDTHH:MM:SSZ
Positive Offset (east of UTC): YYYY-MM-DDTHH:MM:SS+HH:MM
Negative Offset (west of UTC): YYYY-MM-DDTHH:MM:SS-HH:MM
For example, if you are specifying a time in Eastern Standard Time (UTC-5 hours), 
it would be written as 2024-04-18T12:00:00-05:00.

Ssunoo2 added a commit that referenced this issue Apr 20, 2024
Refreshers and download by date functions for issues are now wrappers around this function. Modified github_api_project_issue_or_pr_comments() to optionally take a 'since' parameter. refresher and download by date for comments are now wrappers round this function. Changed the names of date_lower_bound and similarly named functions to updated_lower bound or created_lower_bound.
carlosparadis added a commit that referenced this issue Apr 23, 2024
Reorganized sections and code blocks to facilitate showcasing:

- How to download issues using search endpoint
- Seeing the results immediately after

- How to to download issue comments
- Seeing the results immediately after

- Moving the older endpoints that do not use refresh
to a separate section.

Signed-off-by: Carlos Paradis <[email protected]>
@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 25, 2024

Issue Comment 'since'

The issue comment's 'since' parameter downloads all comments updated after the given date. If there were multiple iterations of a comment (e.g. it was updated several times), calling this endpoint with the since parameter only downloads the most recent update. This means that on occasion, there will be multiple data with the same issue id.

For example:
comment created on day 1
comment retrieved on day 2 since day 1
comment updated on day 3
comment retrieved on day 4 since day 3

This results in two data points that share the same ID but contain different body text. Because the issue/comment endpoint does not have a way to retrieve based on 'created' date only, remedying this situation may include significant overhead. A possible solution is parsing through all comment data and grouping all data that contains the same issue. In this case, all data except for the data with the most recent 'updated' value could be discarded

For Example:
id: 121, body: x, updated 02/01/24
id: 121, body: xx, updated 02/02/24
id: 121, body: xxx, updated 02/03/24

As the three comment data share the same id, the first and second data could be discarded as their 'updated' value is less than the third's.

Reiteration from our meeting:

Because the endpoint this function relies on is based on the updated timestamp, running the refresher will download the most recent version of the comment changes. Only the most recent version of the comment will be downloaded, not all copies. However, if the same comment was modified before the next refresh call, then if the refresher function was executed again, then this would result in two comments with the same comment id being present in the table. This can be addressed by performing a group by over the comment_id in the generated parsed table, and selecting to return the max(updated_at) comment, resulting in a table that only the most recent comment version as of the latest time the refresher was executed.

Further Demonstration:
Here is an example retrieving this issue (create 4/25/24) using since="2024-04-25" in which I updated the comment by appending "TESTEST" to the end on 4/27/24 and retrieved again using since="2024-04-27". In this case, the original comment body was downloaded from the first query and ONLY the updated comment was retrieved on the second query. Both comments have the same created_at value, though the updated at values differ. Here is what they look like parsed into a table (notice the issue ids are the same):

Screenshot 2024-04-26 at 4 21 58 PM

carlosparadis added a commit that referenced this issue Apr 29, 2024
By testing the notebook and comparing outputs, found
search endpoint only download issues.

Also added a new notebook to display the communication
network for github.

Minor changes to graph display in a few notebooks so
graph is enlarged in interactive form in the notebook.

Signed-off-by: Carlos Paradis <[email protected]>
Ssunoo2 added a commit that referenced this issue Apr 30, 2024
… whether issue or pull requests should be retrieved
Ssunoo2 added a commit that referenced this issue Apr 30, 2024
Ssunoo2 added a commit that referenced this issue Apr 30, 2024
@Ssunoo2
Copy link
Collaborator Author

Ssunoo2 commented Apr 30, 2024

This is the comment on how the endpoint uses time zone: #282 (comment)

@carlosparadis
Copy link
Member

carlosparadis commented Apr 30, 2024

The commit 2d42d1a adds a new parameter to the search endpoint so we can also download PRs only. This means the parser for the search endpoint for issues can be used for the PR. The notebook add an example on how to request either and an IF to decide on the folder path.

Note the search endpoint only allows downloading issues XOR PRs. There is no way to download both at the same time. For that you have to rely on the /issues endpoint (which does not offer the way to use a created_at attribute, only the updated at via the since parameter).

Currently, as the GitHub refresher is concerned the following endpoints now offer refresher:

  • Search endpoint
    • Download only issues
    • Download only PRs
  • The issue/comments endpoint will download both issues or PR comments using the since attribute (update at). This means that the downloaded data could have duplicates on the comments (for every time the refresh runs, and the already downloaded comment was modified, the newer (not all changes) comment will also be downloaded. This needs to be pre-processed afterwards via group by on comment id, or the same comment will be counted twice or more!

Other endpoints have not been added to the refresh capability (all that did are cited in this comments already which is the search endpoint and download_issues_or_pr_comments function). The ones still missing are the commits endpoint, the events endpoint (which there is no concept of duplicates, because events are unique, we can't edit an event), and if needed the /issues/ endpoint (which is a bit redundant to the search endpoint, other than it can download both issues and pr in one go but will introduce duplicates).

Final note here, is that also adding the commits endpoint may need to change the iterate function if and else that test different cases of levels of nesting. The check search, issues and commits endpoint (commits is already done on the iterate_pages so it is backwards compatible, even if it is not a refresher yet).

beydlern added a commit that referenced this issue Oct 6, 2024
- Three notebooks were refactored to expect the use of the getter functions from R/config.R (i #230 contains the getter functions in R/config.R).
beydlern added a commit that referenced this issue Oct 18, 2024
- The project configuration sections of each notebook were incorrectly using the project directory (kaiaulu/) as their working directory rather than the directory that they reside in (/vignettes/) as their working directory.
carlosparadis added a commit that referenced this issue Nov 12, 2024
Adds GitHub refresh and re-enable functionality for
parsing replies, social smells, motifs and more.

Signed-off-by: Carlos Paradis <[email protected]>
carlosparadis added a commit that referenced this issue Nov 12, 2024
Adds GitHub refresh and re-enable functionality for
parsing replies, social smells, motifs and more.

Signed-off-by: Carlos Paradis <[email protected]>

---------

Signed-off-by: Carlos Paradis <[email protected]>
Co-authored-by: Sean Sunoo <[email protected]>
Co-authored-by: Carlos Paradis <[email protected]>
Co-authored-by: Nicholas Beydler <[email protected]>
@carlosparadis carlosparadis changed the title Refresh Capability for Downloading Github Data (Milestone 2) Refresh Capability for Downloading Github Data Nov 12, 2024
@carlosparadis carlosparadis added this to the ics496-fall24-m1 milestone Nov 12, 2024
@carlosparadis carlosparadis added the module:download Anything associated to the download module for data acquisition. label Nov 12, 2024
@carlosparadis
Copy link
Member

It's been 8 months!! Glad this is finally making to master. Thank you for @beydlern your efforts in updating this to the new config specification so we could integrate everything without breaking.

I also fixed the parse_github_replies() if you're interested to see the diff. You already had the paths in the new specification. I just exposed the paths hardcoded inside as parameters. The function is a bit counter intuitive because it takes commit messages when it suggests it is to combine issue and pr comments. However, commit is needed to find who is who on git log: The GitHub API only gives GitHub usernames, not the name and e-mail we find on gitlog. Commit API gives both username, name and e-mail, so the bridge can be performed.

Pointing this out because it is another hidden linkage on the tables we should point out on the data schema #319 #226 . Closing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:download Anything associated to the download module for data acquisition.
Projects
None yet
Development

No branches or pull requests

3 participants