-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloaders data storage organization #286
Comments
I'll start by just posting what the current storage organization is: Jira downloader:
Changed from
I just make new directories for project name and issues or issue_comments respectively Github Downloader:Unchanged:
|
Bugzilla DownloaderCurrently, the bugzilla_showcase notebook uses 3 different methods to download data: Traditional Perceval, Perceval's REST API, and Bugzilla's REST API. The one that I will be using for the refresher is Bugzilla's REST API. Bugzilla's REST API Downloader Storage OrganizationI will be using the current storage organization used for bugzilla issues as it is the same format as the GitHub version above.
|
Specification change@Ssunoo2 there is something wrong with your filepath. I remember we agreed we should include in the path In addition to that, and the primary reason why I wanted to create this issue to compare side by side, is that the project organization is counter-intuitive as it is on Kaiaulu (and I believe there were even some confusion of your group early on why the files were organized in this manner). We should organize the information at project level, i.e.: BugzillaInstead of:
We would have:
JiraAnd instead of:
We would have:
MotivationThe reason for that, is generally someone running multi-project analysis is thinking the data "per project" rather than "per data source". In addition, if we are discussing about a particular project, and I would like to reproduce your analysis, I may need to ask you to send me "the data of the project". In the current organization, you would need to check every folder to fish for data the project has. Whereas in the new organization you simply zip the folder with said project name and send it over. Lastly, in the project folder organization, you can very quickly assess what data you have by opening the project folder. As it is, you also have to go check each folder. Anomaly Case 1There are some strange cases out there, that I want to make sure you give proper consideration as you write the refresher of these downloaders. The first one is the HADOOP project. @Ssunoo2 this affects you the most since this is a JIRA project. If you look at Hadoop on GitHub (https://github.com/apache/hadoop), particularly the commits, you will see it can have multiple JIRA IDs. You can imagine the mess it turned out to be trying to manage that in the current folder organization. Let's assume the proposed new organization with the downloader logic you currently have for JIRA. I will focus on the issue folder since what works for issues would work for comments folder. You would then have:
All in one folder. Would your refresher function work in this case? Or would it break assuming all files in there are from a single issue id? If it will break, then we need to add some logic to discern based on your issue key. That being the case, notice how much saner is that this data is contained inside the hadoop folder. We could technically make a sub-folder for every issue key, something like:
However I worry this may complicate the folder hierarchy too much due to its depth. Anomaly Case 2The other anomaly case is the Spring Framework. You can read it about it here: https://spring.io/blog/2019/01/15/spring-framework-s-migration-from-jira-to-github-issues Here's Spring GitHub: https://github.com/spring-projects/spring-framework/commits/main/ Basically, Spring used to have JIRA, and moved on to managing issues on GitHub (e.g.
Please give some thought to the above in one of your internal meetings. This is why I crated a separate issue, as it affects all of you. I'd also recommend you (@Ssunoo2) edit your post with how GitHub saves, and that @ian-lastname make a post on how the mailing list downloader saves here. You want to have them all side by side to make sure the organization is consistent. |
Mbox uses the helix.yml config. Going by how the Jira save file path is now done, i'll make the storage organization for mbox as follows: |
@ian-lastname your mbox architecture will likely be a bit more complex than that. I want you to take a look on OpenSSL as a reference point: https://www.openssl.org/community/mailinglists.html As you can see, OpenSSL (and in general any open source project), generally have multiple mailing lists. One for users, other for developers, and so goes on. In addition to that, a single mailing list may have multiple archives. See for example:
Has 3 archives. Now you may wonder why would someone download data from 3 archives for the same mailing list. This s because sometimes the archives cover different periods of a mailing list existence. E.g. Google Groups could be from 2009-2013, MARC from 2008-2016, and Google groups some overlap of both. Your folder organization has to accommodate this. I'd argue your situation is a bit similar to the case of HADOOP, having multiple JIRA issues into a single project. So please discuss this with your group too and afterwards edit your proposal on how OpenSSL would look like as a folder organization. |
|
As we discussed on call, for projects that have multiple project keys (Anomaly Case 1), we will be using this format to organize the folder structure:
We are using this structure because we don't have to make edits to our existing functions that look for files. For Anomaly Case 2, we decided that we do not need to worry about it because it should not affect the current structure. For the Mbox folder structure, we will use this structure:
This structure separates each list and further separates the archive in each list. |
Here is my suggested change in the config file format. Multiple project keys
Mbox changes
|
Concerning the mbox, there is more than just the paths that needs to be changed. This is the full extent of the mailing list information: Lines 47 to 54 in 2bc8d14
Contrast to openssl: Lines 47 to 55 in 2bc8d14
Minimally, you may have a mbox fil that you acquired from another project. But alternatively you may need to use one of Kaiaulu downloaders to get the data. Check what Kaiaulu functions need to execute (that's @ian-lastname current task to modify it to a refresher), and try to update the specification above before proceeding. Issue TrackerIn the off-chance the project migrated the domain of their JIRA issue tracker, your config file proposal will break, since it assumes one domain for all the issue keys. Another concern I have is that if you mimic the enumeration you have done on project_key, issues, and issue_comments, there is this implicit assumption of order across them. Could you propose a different template here, under |
@Ssunoo2 You will face the same consideration for your GitHub config file: Lines 65 to 70 in 2bc8d14
The anomaly case it is most likely for you to experience on GitHub would be project issues scattered across different GitHub projects. I have not encountered that yet, but I would not be surprised if they existed. Regardless, the solution would mimic what is decided for the JIRA config file. |
Here is the updated version of the jira data storage: issue_tracker:
# each field in Jira will be a project key
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira/hdfs
project_key: HDFS
# Download using download_jira_data.Rmd
issues: ../../rawdata//hadoop/jira/issues/hdfs
issue_comments: ../../rawdata//hadoop/jira/issues_comments/hdfs
project_key_2:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira/mapreduce
project_key: MAPREDUCE
# Download using download_jira_data.Rmd
issues: ../../rawdata//hadoop/jira/issues/mapreduce
issue_comments: ../../rawdata//hadoop/jira/issues_comments/mapreduce
project_key_3:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira/yarn
project_key: YARN
# local folder path
issues: ../../rawdata//hadoop/jira/issues/yarn
issue_comments: ../../rawdata//hadoop/jira/issues_comments/yarn |
Carlos Edit: I modified the config below. mailing_list:
mod_mbox:
mail_key_1:
archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
mail_key_2:
archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
pipermail:
mail_key_1:
archive_url: http://some/pipermail/url
mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/ |
I modified the config above so it tries to stay consistent with the folder depth of the other downloaders and account for the information needed for the functions. I also changed from @ian-lastname try to work with this and post here if for some reason it doesn't work with the functions you are using to refresh. |
mailing_list:
mod_mbox:
domain: http://mail-archives.apache.org/mod_mbox/geronimo-user
mail_key_1:
key: geronimo-dev
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
mail_key_2:
key: geronimo-user
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
pipermail:
mail_key_1:
archive_url: http://some/pipermail/url
mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/ So, I modified the mod_mbox config. The reason why I changed it to this is because the downloader function was already made to put together the full url for the download using a base domain (domain) and a mailing list (key). Plus, with the way I changed it, I can easily attain the name of the mailing list so that I can put it into the file name of the downloaded mbox file. Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please. |
@ian-lastname "because the function already does it" is not a good rationale: I modified the config so both pipermail and mod_mbox are consistent in the way the user uses the information. It is also more clear for someone to see a URL that they can post on the browser than figuring out what a The other point of concern is domain. I am not sure if there will be a case a project's mailing list can end up in two domains for mod-mbox. So it is better to keep it flexible per project_key so we do not have to modify in the future. Unless you made any other change, stick to #286 (comment). You can modify to be a url in this line: Line 148 in 2bc8d14
Just replace the
Seems not. Please add it to: https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_mod_mbox.Rmd When you are done with the changes! |
As far as the key is concerned: Before you worry about that in https://mail.python.org/pipermail/mailman-users/ I believe Python can be used as an example. In fact, that's where the pipermail code originated in 2021: https://mail.python.org/pipermail/mailman-users/2012-October/074208.html Let me know how running this goes. Note you will need to modify the pipermail function to also allow to control the See #92 for context. |
Here is the format for the jira and github config files: issue_tracker:
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://github.com/sailuh/kaiaulu
project_key: KAIAULU
# Download using `download_jira_data.Rmd`
issues: ../../rawdata/geronimo/jira/issues/
issue_comments: ../../rawdata/geronimo/jira/issue_comments/
github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/
issue: ../../rawdata/kaiaulu/github/issue/
pull_request: ../../kaiaulu/github/pull_request/
commit: ../../rawdata/kaiaulu/github/commit/ Please feel free to comment on anything that is formatted incorrectly |
Just post a new comment below with the corrected version instead of editing your existing one so it is not confusing to follow-up later: The domain information for Kaiaulu's JIRA is wrong:
This should be it instead. Try your downloader against it to see if it works. Note Kaiaulu domain is different than the other JIRAs that uses apache. Also, did you modify the existing end points in GitHub (commit, pr, etc) so they are folders and can refresh? I don't remember. Could you add another project to github for Kaiaulu, including your fork information to see how it looks like? Also I think the endpoints on your config do not agree with what Anthony put here: #286 (comment) There should be another folder at the end of the endpoints. For JIRA that is named after the JIRA project key. For GitHub, the equivalent is the owner_repo combination. So in Kaiaulu config you would have:
for the main repo, but if I was also downloading and tracking a fork, then that would be:
You can include your fork as an example of project_key_2 here so we can discuss, but don't include in your actual commit since we do not need to download anything from there. So we have a realistic example, please create a codeface.conf And edit so it include on project_key_1: https://github.com/siemens/codeface And on project_key_2 Nicole's fork: https://github.com/lfd/codeface/tree/nicole-updates Note on the Codeface config file, under the branch region: Lines 43 to 44 in 2bc8d14
You will include an additional line below master called |
Is this looking right? issue_tracker:
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://sailuh.atlassian.net
project_key: SAILUH
# Download using `download_jira_data.Rmd`
issues: ../../rawdata/kaiaulu/jira/issues/sailuh
issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh
# project_key_2:
# Obtained from the project's JIRA URL
# domain: https://sailuh.atlassian.net
# project_key: ssunoo2
# Download using `download_jira_data.Rmd`
# issues: ../../rawdata/kaiaulu/jira/issues/ssunoo2
# issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/ssunoo2
github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
# project_key_2:
# # Obtained from the project's GitHub URL
# owner: sailuh
# repo: kaiaulu
# # Download using `download_github_comments.Rmd`
# issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
# issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
# pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
# commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/ For JIRA, I appended project_key to the end of the file path. For Github, I appended owner_repo to the end of the file path. I'll work on testing and make the codeface config file. Regarding the refresh for the pull requests and commits, I had originally thought I was supposed to but you corrected me and specified issues and comments only during week 11 |
No. There is no For project_key_2 on GitHub is also wrong... the fork is not owned by sailuh and kaiaulu, but rather the owner is ssunoo2 and the repo is kaiaulu. I am a bit worried the config file may not be making any sense to you at this point. Should we go over this briefly on call if it helps? |
Here is the updated config format for the issue_trackers: issue_tracker:
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://sailuh.atlassian.net
project_key: SAILUH
# Download using `download_jira_data.Rmd`
issues: ../../rawdata/kaiaulu/jira/issues/sailuh/
issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh/
github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/sailuh_kaiaulu/
pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
# project_key_2:
# # Obtained from the project's GitHub URL
# owner: ssunoo2
# repo: kaiaulu
# # Download using `download_github_comments.Rmd`
# issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu/
# issue: ../../rawdata/kaiaulu/github/issue/ssunoo2_kaiaulu/
# refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/ssunoo2_kaiaulu/
# pull_request: ../../kaiaulu/github/pull_request/ssunoo2_kaiaulu/
# commit: ../../rawdata/kaiaulu/github/commit/ssunoo2_kaiaulu/ Note that a new folder 'refresh_issues' is created as a result of #282 |
For discussion regarding the specification of the configuration file, let's use this issue. For discussion of conf.R, let's use #230. It suffices, however, for your task specification to just live on #230. The specification of the config file is indeed on this issue. It is extremely long at this point, but I'd like you to both to skim through it and find the comments that says "why do we have to make the specification this way?" e.g. of relevant comments: see also "anomaly case 1" and "anomaly case 2" sections in: #286 (comment) MboxWe went over an example of why mailing list specification has to convey multiple mail archives. @daomcgill I believe this is the information you need to know for your function input for mod_mbox and pipermail: Contrast how more comprehensive and realistic to what we spoke today this is versus the existing one: Lines 47 to 55 in c781106
Other DownloadersThe most current specification for the other downloader which are jira, bugzilla and github is on this comment: #286 (comment) With that being said, there are more than just downloaders with file paths being specified. For example, you also need to tell Kaiaulu where your .git files are. Your exercise is therefore as follows:
I suggest this workflow should be pasted on your task specification on #230 as checkbox, with the small difference you should indicate who is working on what. Remember: The specification I need to see here is not just combining the two comments above, but one that includes all information available across all existing confs, with the parts the two comments about refer to being updated. Not every conf contains the full specification, so we need to derive the master specification after looking at all of them (this will in turn be future documentation on the project wiki too). I hope by the time this milestone is over, you will have a better understanding of all types of data Kaiaulu can interface with, and have a better appreciation of using these configuration files to document information about a project, so all one needs do to re-analyze a project is share a config file to understand assumptions, and to re-run an analysis. One last time: For specification questions, follow up here, for questions on conf.R, do so on #230 : ) |
The issues #275 #282 #284 #285 are affected by this issue.
@Ssunoo2 @ian-lastname @anthonyjlau to centralize discussion, please use this issue to reach cosensus on how you plan to make the storage organization, file name, etc of your own refreshers + the JIRA refresher. Once we are clear on this here, you can move the final discussion to the first come of your respective issues.
The text was updated successfully, but these errors were encountered: