Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling Request Comment via GitHub REST API #323

Open
4 tasks done
RavenMarQ opened this issue Oct 21, 2024 · 17 comments · May be fixed by #330
Open
4 tasks done

Pulling Request Comment via GitHub REST API #323

RavenMarQ opened this issue Oct 21, 2024 · 17 comments · May be fixed by #330
Assignees

Comments

@RavenMarQ
Copy link
Collaborator

RavenMarQ commented Oct 21, 2024

Purpose


As of now, users are able 2 different endpoints to extract data from pull requests (pull_request and issue_or_pr_request). However, neither of the endpoints extract the in-line comments from pull requests, which can be a point of interest for the user to study.

While the GET /repos/{owner}/{repo}/issues/comments/ endpoint does retrieve the comments for pull requests, it does not retrieve the in-line code.

Process


To extract this data, we can use an endpoint from Github's REST API.
The particular endpoint being used is: GET /repos/{owner}/{repo}/pulls/comments
The documentation for this endpoint can be found on the Github Page.
The field "diff_hunk": is a field unique to this particular endpoint and is what contains the in-line code that the pull request comment is referencing. The diff_hunk field refers to the hunk of code in pull requests that contain the +/- changes made to a particular file. If the comment is referencing a line of code within a hunk, the diff_hunk will contain all lines of code from the start of the hunk, ending with the line of code the comment is referencing.
The diff_hunk field will return null if the review comment was not referencing any lines of code.
The pull_request_review_id field returns an integer value that ties the comment to the the Review endpoint.
A parser function will also be needed to turn the resulting JSON file into a more readable table.

Task List


  • Retrieve the pull request comment from the api endpoint
  • Create a downloader for the JSON files
  • Create a refresher function for the endpoint
  • Create a parser the JSON for diff_hunk field to retrieve the in-line code

Limitations


This issue implements the pull request Comments endpoint url
We noticed that the process of requesting changes or accepting review, the associated comment with accept/reject is not captured as a review comment in this endpoint.
The Review Endpoint comment for accept/reject groups the comments obtained by the Review Comment endpoint.

Image

@RavenMarQ
Copy link
Collaborator Author

Quick questions on the specification of this issue:

Is this similar to parsing using SciTools' Understand to create data tables?
Looking through the API, there are four data formats:
- Raw Markdown
- text representation of raw markdown
- HTML render of raw markdown
- Full json = all three data sets above (larger)
Are there any specific ones I need focus on or should I just use the json?
I can make the request for the data via cURL in command line (similar to what I have done for Understand), JavaScript and Octokit, or GitHub CLI. Which of these should I focus on? Or should I do research on all of them?
With the API, I can GET a review comment and also GET a list. Should I be first retrieving a list then getting the specified review comments from the list?

@RavenMarQ RavenMarQ self-assigned this Oct 21, 2024
@carlosparadis
Copy link
Member

No, this is an API request to download data. It is closer to @daomcgill if anything, but she is not using API. You should look at the pull request for GitHub. It has many examples on how to write R code to download data. Please follow the format already used on github.R (i'd recommend trying the notebook from a user perspective too and downloading some data). Again see the PR, the GitHub code has not been merged yet). @crepesAlot pinging you here so you are aware too.

@RavenMarQ
Copy link
Collaborator Author

RavenMarQ commented Oct 21, 2024

Taking a glance at the github.R PR, it's not well-documented but I'll report back what I can scrounge up of what I understand. If this is not the file you were meaning for me to look at, let me know.

Also, I have also been in communication with @daomcgill about making this refresher function, so I'm currently forwarding questions to her via Discord.

@carlosparadis
Copy link
Member

not exec/github.R, look at R/github.R and the associated notebook. @beydlern should have revised it.

@carlosparadis
Copy link
Member

also please collaborate closely with @crepesAlot once either of you understand how it has been done, explaining to the other will save a lot of your time.

@beydlern
Copy link
Collaborator

@RavenMarQ
The notebooks that I refactored are located in PR #289.

  • vignettes/download_github_comments.Rmd
  • vignettes/gitlog_entity_showcase.Rmd
  • vignettes/gitlog_showcase.Rmd

@carlosparadis
Copy link
Member

Look at vignettes/download_github_comments.Rmd. The effort in this task should be very minimal and non code related, since you are not writing the GitHub API interface from scratch like @beydlern is doing with OpenHub. There is already a lot of code written to interact with GitHub API. Just see how the API endpoint looks like in R/github.R for one endpoint, and literally copy and paste the code and change the url and adapt the documentation.

Then you need to write the parser for the downloaded json, which is something you're familiar with from having to do the XML. In hindsight, @beydlern is the closest task equivalent to yours, since he is also using an API. I'd hope the R/github.R code suffices to give a template on how to do that.

@RavenMarQ RavenMarQ assigned crepesAlot and unassigned RavenMarQ Oct 26, 2024
@crepesAlot crepesAlot assigned crepesAlot and unassigned crepesAlot Oct 28, 2024
@carlosparadis
Copy link
Member

@crepesAlot when you can please ask Raven to update this issue specification. For now i'd recommend at least keep track on a comment here, and I can hyperlink the first comment until the final version is done.

@carlosparadis carlosparadis added this to the ics496-fall24-m2 milestone Nov 11, 2024
@crepesAlot
Copy link
Collaborator

crepesAlot commented Dec 3, 2024

@carlosparadis Here's the main points that are retrieved from this endpoint.
Using the following Pull Request: https://github.com/sailuh/kaiaulu/pull/327/files/144d076fe0d1c9096d4a8091c45d9ec7adfa2c52
Image
The three columns are retrieved as such:
Image
The data seems to be retrieved from the conversation tab of the Pull Request rather than the Files Changed tab

@crepesAlot
Copy link
Collaborator

@carlosparadis I have tried testing what you asked about how the information for the diff_hunk field changes when clicking a single line and selecting multiple lines.
Somehow testing this on my own repo has only further confused me.
In this screenshot, I selected lines 22 to 30 when making the comment.
Image
Yet the response in the diff_hunk field return every line in the entire file until line 30. It responded like this regardless of whether I selected multiple lines or clicked once.
Image
I'm not sure why this occurs, as it doesn't do this for any of the responses from Kaiaulu

@carlosparadis
Copy link
Member

Ah yes, another GitHub mystery!

#323 (comment)

Why here my comment appear twice in the table?

And #323 (comment)

What is the docs definition for this diff hunk?

@crepesAlot
Copy link
Collaborator

crepesAlot commented Dec 4, 2024

@carlosparadis Those are 2 different comments, the comment in the screenshot of the pull request is the last row in the table.
Github docs description for diff_hunk is: "The diff of the line that the comment refers to."
It looks like the diff_hunk field isn't just the selected lines of code referred to by the comment, but grabs all lines of code within the hunk (block of code changes) that the comment refers to.
So if its a new file where every line is new, then it grabs every line of code until the referred line in the comment.

So I looked into the response JSON some more and I found 4 fields that may be of interest: start_line, line, original_start_line, and original_line

  • start_line: returns the first line number if multiple lines are selected when making the comment, or null if only 1 line is selected.
  • line: returns the last line number if multiple lines are selected, or the line number if only 1 line is selected.

The start_line and line fields will return null if any subsequent commits delete the line the comment was referring to.
The original_start_line and original_line fields will then return the old line numbers before any changes.

Here's an example screenshot.
Image
Here you selected lines 42-45 when making the comment, so the table looks like:
Image
So the original_start_line is 42 and the original_line is 45. And since download_mail.Rmd was a new file, the diff_hunk field grabbed lines 1-45.
Image

crepesAlot added a commit that referenced this issue Dec 5, 2024
- Updated `vignetts/download_github_comments.Rmd`
- Added pull request comments to parse_github_replies function to `R/github.R`
- Added new getter function to `R/config.R`
- Updated configuration files with new file paths to save data from new endpoint
@carlosparadis
Copy link
Member

@crepesAlot remind me again, was this ready for review or was there anything pending?

@crepesAlot
Copy link
Collaborator

@carlosparadis I just need to update the function documentation for more in depth explanation and details on the different fields.

@carlosparadis
Copy link
Member

Sounds good, send me the request for code review when ready!

crepesAlot added a commit that referenced this issue Dec 9, 2024
- Updated notebook to include section on where to place created tokens to use endpoints.
- Updated pr_comments function documentation
crepesAlot added a commit that referenced this issue Dec 9, 2024
@crepesAlot
Copy link
Collaborator

@carlosparadis I updated the function documentations and NEWS.md, the code review should be good to go!

crepesAlot added a commit that referenced this issue Dec 10, 2024
- Made refresh function to use function parameter for the save_folder_path instead of a pre-assigned variable.
- Updated various verbose print statements
crepesAlot added a commit that referenced this issue Dec 11, 2024
- Changed the way the function obtains filepaths from config file.
@crepesAlot
Copy link
Collaborator

crepesAlot commented Dec 11, 2024

@carlosparadis I was able to find what connects the Pull Request Comment endpoint (this issue's endpoint) with the Pull Request Review endpoint.
The Pull Request: #295 (review)
Implemented Comment Endpoint: https://api.github.com/repos/sailuh/kaiaulu/pulls/comments (github_api_project_pr_comment).
The Review Endpoint: https://api.github.com/repos/sailuh/kaiaulu/pulls/295/reviews
Image
Image
I've updated the Pull Request Comments parser to add the pull_request_review_id field as review_id in the table.

crepesAlot added a commit that referenced this issue Dec 11, 2024
- Retrieves `pull_request_review_id` from JSON file to refer to the review comment made when creating the review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants