Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Discussions with GraphQL API #324

Open
5 tasks done
crepesAlot opened this issue Oct 23, 2024 · 18 comments
Open
5 tasks done

Github Discussions with GraphQL API #324

crepesAlot opened this issue Oct 23, 2024 · 18 comments
Assignees

Comments

@crepesAlot
Copy link
Collaborator

crepesAlot commented Oct 23, 2024

Purpose


GitHub Discussions is a public forum that allows for collaborative communication without needing to be tied to a specific project or related to code. It provides a more centralized space to hold discussions.
The data that can be mined from the discussions and discussion comments can be of great interest to anyone interested in the relationship between users and a project's community.
As such, we now need a way to retrieve any comments from this new endpoint.

About Github Discussions: https://docs.github.com/en/discussions/quickstart

Process


To do this, we're using the GraphQL API
There is only a single endpoint: https://api.github.com/graphql
Instead of GET requests with the REST API, GraphQL uses queries.
The queries will only return the specified data.
gh is a client Kaiaulu relies on to access the Github's REST and GraphQL APIs; this is what will be used to access GraphQL's single endpoint.

Limitations


As of now, there are several points of interest that cannot be retrieved with the GraphQL API endpoint.

  • Data from polls, namely the questions and answer choices. (Title and comments under polls in discussion can still be retrieved).
  • Number of upvotes for discussions, nor can the order of retrieved comments be sorted by the number of upvotes.
    Refresh functionality:
    While the queries allow the discussions be fetched in order of date, there is no way to filter or download discussions by the created at dates.
    It can only be filtered by a cursor variable (end_cursor), which is what the query uses to paginate discussions.

Task List


  • Get an API Key
  • Understand how the queries are structured
  • Learn what can be retrieved for discussions and discussion comments
  • How to use gh to access the endpoint
  • Test to ensure downloader works properly
@crepesAlot crepesAlot self-assigned this Oct 23, 2024
@crepesAlot
Copy link
Collaborator Author

crepesAlot commented Oct 23, 2024

@carlosparadis I have some questions on the creating a parser for the downloaded file.
How should I determine what should and should not be saved?
Using the GitHub Issue Events endpoint as an example, the parser looks like:

kaiaulu/R/github.R

Lines 150 to 187 in 810c183

github_parse_project_issue_events <- function(api_responses){
parse_response <- function(api_response){
parsed_response <- list()
parsed_response[["id"]] <- api_response[["id"]]
parsed_response[["created_at"]] <- api_response[["created_at"]]
parsed_response[["commit_id"]] <- ifelse(length(api_response[["commit_id"]]) == 0,
NA,
api_response[["commit_id"]])
parsed_response[["event"]] <- api_response[["event"]]
parsed_response[["actor_login"]] <- api_response[["actor"]][["login"]]
parsed_response[["actor_id"]] <- api_response[["actor"]][["id"]]
parsed_response[["actor_type"]] <- api_response[["actor"]][["type"]]
parsed_response[["issue_number"]] <- api_response[["issue"]][["number"]]
parsed_response[["issue_title"]] <- api_response[["issue"]][["title"]]
parsed_response[["issue_user_login"]] <- api_response[["issue"]][["user"]][["login"]]
parsed_response[["issue_user_id"]] <- api_response[["issue"]][["user"]][["id"]]
parsed_response[["issue_user_site_admin"]] <- api_response[["issue"]][["user"]][["site_admin"]]
parsed_response[["issue_state"]] <- api_response[["issue"]][["state"]]
parsed_response[["issue_author_association"]] <- api_response[["issue"]][["author_association"]]
parsed_response[["issue_body"]] <- api_response[["issue"]][["body"]]
parsed_response[["issue_assignee_login"]] <- api_response[["issue"]][["assignee"]][["login"]]
parsed_response[["issue_assignee_id"]] <- api_response[["issue"]][["assignee"]][["id"]]
parsed_response[["issue_body"]] <- api_response[["issue"]][["body"]]
assignees_list <- api_response[["issue"]][["assignees"]]
assignees_list <- lapply(assignees_list,function(x){
data.table(data.frame(issue_assignees_login=x[["login"]],issue_assignees_id=x[["id"]]))
})
assignees_list <- rbindlist(assignees_list,fill=TRUE)
parsed_response[["issue_assignees_login"]] <- stringi::stri_c(assignees_list$issue_assignees_login,collapse = ";")
parsed_response[["issue_assignees_id"]] <- stringi::stri_c(assignees_list$issue_assignees_id,collapse = ";")
parsed_response <- as.data.table(parsed_response)
return(parsed_response)
}
rbindlist(lapply(api_responses,parse_response),fill=TRUE)
}

Looking at the example response on the REST API endpoint for issue events, there is quite a lot left out.
https://docs.github.com/en/rest/issues/events?apiVersion=2022-11-28#get-an-issue-event
Is this up to my own discretion of what is and isn't important?

@carlosparadis
Copy link
Member

There are refresher endpoints in R/github.R if you want to learn how that is done (i believe it is the search endpoint). There may be another notebook too other than comments. However, to implement refresh you need an api endpoint that lets you select at least a starting date for the comments. Does this endpoint gives that?

Second, have you looked through GitHub to see if this is the only way to download Discussion comments? GitHub sometimes offer multiple API endpoints, so you want to be careful here you don't end up in the wrong endpoint.

Third, you may want to just try and do on the browser the request for a JSON (you can construct the request as a URL --- please don't paste the URL here with your API key, but do place the URL here as an example with a PLACEHOLDER as @beydlern did.

What gets parsed depends on what we discuss here may be relevant for the various analysis Kaiaulu does, so the easiest way would be for you to suggest for us to agree (i do need you to make sure you considering all possible endpoints).

Also, the motivation on your issue specification sounds a bit strange for me (purpose section). That sounds more GitHub motivation than our own. Our own motivation ties closer to @daomcgill work. Dao downloads mailing list data. And mailing lists can be about developers communication, users communication or more. Back in the day, a lot of people used mailing lists for both. This goes back to before issue trackers even existed, let alone GitHub.

Nowadays, issues exist, so "mailing list dev" in a lot of projects moved on to issue trackers. The equivalent for the user mailing list is discussions on GitHub (but other projects may use something else). Therefore, the purpose of the capability in Kaiaulu being available is so we can mine user interaction in projects on GitHub with projects. Some research may be interested in understanding how projects interact with users for analysis and community health for example (there are hundreds of studies that analyze StackOverflow questions!)

@carlosparadis
Copy link
Member

I also suggest you take a look on the user side of Discussions so you understand the data you are getting (or not):

(Please don't create random questions, it will pollute kaiaulu repo, but you can always create a sandbox repo on your own account to play with it and delete your sandbox repo later):

https://github.com/sailuh/kaiaulu/discussions/new/choose

Notice how there are 5 types of categories. It is easier for us to discuss what data makes sense if you explain here from what I can see already.

@carlosparadis
Copy link
Member

One last note: Before you spend too much time on code and API, you should make sure the endpoint is the correct one: I just noticed the API asks for a Team Slug. I have no idea what that is. If you go to the "Discussions" tab on Kaiaulu, you will notice there is no notion of Teams. It is just plain and simple discussions. So the URL I gave may be for another type of Discussions.

https://docs.github.com/en/search?query=discussions

You should check GitHub Docs and google to see if it is even possible to obtain the data in the first place!

@crepesAlot
Copy link
Collaborator Author

crepesAlot commented Oct 23, 2024

I'll likely have to rename this issue as well as a lot of the process, as after looking into it, REST API isn't actually for the discussions, but github Teams Discussions, a completely different thing, which was my mistake.
I'm currently looking into GraphQL API
https://docs.github.com/en/graphql/guides/using-the-graphql-api-for-discussions

@carlosparadis
Copy link
Member

Sounds good!

@crepesAlot
Copy link
Collaborator Author

@RavenMarQ @carlosparadis Just mentioning this as I look into GraphQL API.
I'm still trying to look into it, but it looks like it might be better to use GraphQL API rather than REST API for the pull request too.
https://docs.github.com/en/graphql/guides/migrating-from-rest-to-graphql
This is talking about how you can replace multiple REST calls with fewer GraphQL queries, but this is applicable when attempting to retrieve pull requests, commits, non-review comments, and reviews.

@crepesAlot crepesAlot changed the title Github Discussions with REST API Github Discussions with GraphQL API Oct 23, 2024
@crepesAlot
Copy link
Collaborator Author

@carlosparadis I've become more familiar with how the query works for GraphQL, and have started creating the functions. While I work on the functions, I wanted to run past you the information I am retrieving with the query.
The information I am getting from the discussions:

  • Title of discussion
  • Body text of discussion
  • Author of discussion
  • Date discussion post was created (ex: 2024-10-01T22:01:58Z)
  • The category name of the discussion post
  • Id of the comment chosen as an answer
  • Comments under the discussion post
  • Author of the comments
  • Id of the comment
  • Date the comments were created

Is there some other information that is either missing or isn't needed from this query?
My current plan is to create 1 function to retrieve all the information, then one parser for the discussion post and another parser for the comments. However, I may change this as I work on it, as the query only gets the first/last x number of discussions/comments, so I may need to work around so that the function can get all of the discussions and comments.


Here is an example response I got for the first discussion post listed in Kaiaulu's disucssions:

{
  "data": {
    "repository": {
      "discussions": {
        "edges": [
          {
            "node": {
              "title": "Extracting features from a git repo with Kaiaulu",
              "bodyText": "Which of the following data points can be extracted from a git repo with Kaiaulu?  And for those that can be extracted, could you provide the instructions and/or a link to them?\n• total # of outstanding bugs\n• total # of outstanding non-bugs (typically feature requests)\nThe following are for a given period of time:\n• # of new bug issues\n• # of new non-bug issues\n• average bug-resolution time\n• average non-bug resolution time\n• # of active contributors\n• # of new contributors\n• # of bug-fixing commits\n• # of non-bug-fixing commits\n• # of LOC committed for bug resolution\n• # of LOC committed for non-bug resolution\n• # of emails on the project mailing list",
              "author": {
                "login": "BenjyNStrauss"
              },
              "createdAt": "2024-10-01T22:01:58Z",
              "category": {
                "name": "Q&A"
              },
              "answer": {
                "id": "DC_kwDOD0xXC84ApQNr"
              },
              "comments": {
                "edges": [
                  {
                    "node": {
                      "bodyText": "Hi Beni, @rnkazman\nThe way to go about Kaiaulu is asking yourself first \"where is the data coming from?\"\nFor bug data, that means you need to collect issue tracker data first. Then the question is, from which issue tracker? Kaiaulu can get you data from JIRA, GItHub, and Bugzilla.\nOnce you decide on that, you can go to the respective menu on the \"Reference\" page for any of these:\nhttp://itm0.shidler.hawaii.edu/kaiaulu/reference/index.html#-jira-\nand subsequently, see the Notebook to obtain the data. From the table obtained, you can then calculate any metrics you wish.\nSince you are interested in calculating a Metric out of the data above, then you can click on the \"Metric\" menu on the right:\nhttp://itm0.shidler.hawaii.edu/kaiaulu/reference/index.html#-metrics-\nYou will find there is a Bug Count Notebook there.\n\nThe same process can be used for your other metrics. For example, if you want to calculate contributor metrics, you can again ask yourself \"What is the data that I need to obtain this metric?\"\nThat would be \"Git\". Again, you can go to the docs page above and click \"Git\" and see the associated Git Log table. Same with Mail, etc. Same with \"Mail\".\n\nAs for how to link them: Depends on what you want to link them on, and what granularity. If you can give me something more specific, I can give you pointers.\nThe bottomline is: Kaiaulu will give you tables, and for most of them you will be making inner joins out of them to link. If you plan to connect people, see the \"Identity\" section and the associated Notebook.\nIf you are looking for a one button press solution to create the metrics above, we do not have that. But it should be relatively simple (group by, subset, inner joins) to get to them from the tables Kaiaulu gives you.",
                      "author": {
                        "login": "carlosparadis"
                      },
                      "id": "DC_kwDOD0xXC84ApQNr",
                      "createdAt": "2024-10-01T22:56:50Z"
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}

@carlosparadis
Copy link
Member

@crepesAlot thank you for the update! I guess one question here is, when you attempt to create a discussion it can be Q&A but also others. How does this affects the data format?

Could you create (not on Kaiaulu), a fork and on your fork experiment with the discussion tabs as example to see what you get out of the API? For instance, the Poll, Q&A and the others looks like their JSON would be different.

Screen Shot 2024-11-05 at 9 53 17 PM

@crepesAlot
Copy link
Collaborator Author

@carlosparadis I actually found that the format doesn't change at all. It still retrieves the title, body and any comments under the discussion without any issues regardless of its category. The field for answers simply returns null.
The only thing would be for polls, it doesn't get the question and answers for the poll. But it looks as though the polls is just too new and none of the APIs have anything to support it yet.

I'm also hopeful that the refresher function will be relatively easy to create as not only can I get the time a discussion was created but also filter them more easily.

A list of discussions signatures from the documentation:

discussions(
  after: String,
  before: String,
  first: Int,
  last: Int,
  categoryId: ID = null,
  answered: Boolean = null,
  orderBy: DiscussionOrder = {field: UPDATED_AT, direction: DESC}
) : Discussion

They also list all the information you can pull from Github Discussions here:
https://docs.github.com/en/graphql/guides/using-the-graphql-api-for-discussions#discussion

@carlosparadis
Copy link
Member

You can go ahead and proceed with the code for this! I guess one open question if this is the same output for all responses is, what about the upvoted answers? We can't obtain the number of upvotes?

@crepesAlot
Copy link
Collaborator Author

Unfortunately there doesn't seem to be a way to get the number of upvotes, the closest thing would be getting the reactions to comments, such as a thumbs up, but that is separate from upvotes.

@crepesAlot
Copy link
Collaborator Author

@carlosparadis Having some difficulties with the github_api_showcase.Rmd
I'm trying to use the code in the notebook to download commits so that I could see what the end result of the parse functions should look like for my own parser function, but I've been getting some errors.

> github_api_iterate_pages(token,gh_response,save_path_commit,prefix="commit")
Warning: cannot open file '../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json': No such file or directoryError in file(con, "w") : cannot open the connection

I haven't been able to figure out what the problem is. I ran the following lines to try to find it.

> file.exists("../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json")
[1] TRUE
> writeLines("test", con= "../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json")
> readLines("../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/sailuh_kaiaulu_commit_p_1.json")
[1] "test"

Shouldn't this mean that the file exists and that it has write permissions? Or have I made some major misunderstanding with how the functions work?

@carlosparadis
Copy link
Member

Did you try opening the function definition and running one line of it at a time? I think the filepath constructed inside or relative to from where you are running may just be incorrect.

crepesAlot added a commit that referenced this issue Nov 8, 2024
For github.R:
- github_api_discussions
- github_parse_discussions
- github_parse_discussion_comments
For config.R:
- get_github_discussions_path
For conf/kaiaulu.yml:
- Added new discussion field in github issue_tracker for a save filepath for discussions JSON file
For github_api_showcase.Rmd:
- Demonstrating usage for the newly added functions.
@carlosparadis carlosparadis added this to the ics496-fall24-m2 milestone Nov 11, 2024
@crepesAlot
Copy link
Collaborator Author

@carlosparadis Thankfully, looks like we solved the problems with the notebooks and gh tool.
After re-creating the file directory and including require(gh) in the notebook, the functions were able to run and download data perfectly fine. The problem didn't lie in the version of the gh tool afterall.

crepesAlot added a commit that referenced this issue Nov 13, 2024
Changed the function to paginate as needed to download all available entries.
crepesAlot added a commit that referenced this issue Nov 21, 2024
- Added refresher function for discussions
- Non-fixed warning, refresher function could cause error if JSON file is improperly named.
crepesAlot added a commit that referenced this issue Dec 5, 2024
- Moved the Download Discussions function and explanations from `vignettes/github_api_showcase.Rmd` to the `vignettes/download_github_comments.Rmd` notebook
crepesAlot added a commit that referenced this issue Dec 5, 2024
- Updated filepaths to unify formatting
- Added discussion save filepath to github issue tracker
crepesAlot added a commit that referenced this issue Dec 8, 2024
- Updated `vignettes/download_github_comments.Rmd` notebook
- Added documentation for refresh function
- Set create_file_directory function verbose param to FALSE as default
crepesAlot added a commit that referenced this issue Dec 11, 2024
- Reconfigured how `create_file_directory` function obtains paths from config file
@crepesAlot
Copy link
Collaborator Author

@carlosparadis I was unable to create a refresh function for the discussions, so I removed the function from R/github.R.
The problem lies with how discussions can only order by dates, but cannot filter them. We can sort the discussions by the most recently created dates, but it will still list all discussions, which would lead to duplicate discussions being downloaded.
I added this to the listed limitations in the initial issue description.

@carlosparadis
Copy link
Member

Can you give me the endpoint where you found the limitation? Or was it through trial and error? If so, what is the closest documentation you can find about this?

@crepesAlot
Copy link
Collaborator Author

I found this limitation through trial and error.
The documentation doesn't explicitly state the limitations, but discusses pagination and how to order discussions on those two links.
I actually just got 1 more idea on how to form a refresher which might be worth testing, so I'm going to try to see if I can do that right now. I think if I can have the refresher find the most recent discussion, retrieve the start_cursor for that, then filter for discussions before the retrieved start_cursor, it might be a way to see if there is more recent discussions to download.

crepesAlot added a commit that referenced this issue Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants