You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that newer data sources and tools are available in Kaiaulu, the project configuration file sections are becoming more confusing. There is also little comment guidance on what is needed or not, or what is part of Kaiaulu or what is not.
The overall format should be refactored to facilitate understand its purpose, where to obtain the field values and what uses them.
Fundamentally, these files take inspiration from Codeface, but go a bit further. The goal is to "distill" a project's website available infrastructure information, and therefore data, in a concise, reusable file that can be manually checked and reused for other analysis. For instance, here is a list of questions that often wastes manual time to be answered for every analyzed project:
Version Control:
1. What version control system is used? Git, SVN, something else?
2. Where is the data for analysis X?
Mailing List:
What mailing list archives the project use? (e.g. pipermail, mod_mbox, MARC, google groups, etc)
Do they contain the same data? (OpenSSL is a classic example where each archive contains a portion of the full mailing list!)
What mailing list should be used for analysis? Developer and User, or just developer? Any other mailing list?
What is the URL for the mailing lists we should use?
Where is the data for analysis X?
Issue Tracker
What issue tracker type is used (BugZilla, JIRA, GitHub, TRAC, etc)
Where is the data for analysis X?
Files
What language does the project use? (Needed as a parameter for static analysis tools, and to avoid including non src code files in your churn metrics!)
How do developers convention test files? (Need to also avoid bloating churn and other content metrics)
Do commit messages contain conventions for issues or software vulnerabilities? (Needed for bug count metric, or software vulnerability analysis).
These questions all require some manual time is spent googling the project website to find (which is not even always easy to locate), manual inspection of commit messages to curate a regular expression, etc. A project configuration file should make it easier for a person to quickly assess that, and more importantly, stop wasting time searching for it. This information is of course not immutable: An analysis of OpenSSL over its entire 20 years lifespan shown it started with mailing lists, overlapped with a bug tracker and mailing list, and eventually moved fully to GitHub. So some minimal validation the links in the files are current, or sufficient for the analysis of interest, is still important.
Of course, the information in this file can be capitalized by Kaiaulu to do something: Knowing the online source of a VCS, Mailing list or issue tracker can be used by a crawler to obtain the data, whereas the local path can be used to execute the pipeline in the R Notebooks, and eventually via a command-line interface.
Analysis
From here, the second concern is reproducibility in MSR and the analysis, which is often buried in online folders, or e-mails. For example, was only a portion of the project X mailing list used? What was the time window? What commits range was used in the git log? Or what types of file dependencies were used? These parameters are per analysis, but missing one of them dooms any future attempt to effectively compare results. These parameters have to nonetheless be decided so an analysis can occur, and deferred to the user instead of hardcoded and buried under functions, which is unlikely to be found on normal usage of the tool. Further thought on whether embed this on a project configuration file or R Notebook directly is still needed.
The text was updated successfully, but these errors were encountered:
Project configuration files now properly
account to the new downloaders added in
previous commits.
The overall format was also improved for
readability and to be more self-contained.
Signed-off-by: Carlos Paradis <[email protected]>
The parsing of the respective CLI interfaces now
conforms to the new project configuration file format.
Signed-off-by: Carlos Paradis <[email protected]>
Now that newer data sources and tools are available in Kaiaulu, the project configuration file sections are becoming more confusing. There is also little comment guidance on what is needed or not, or what is part of Kaiaulu or what is not.
The overall format should be refactored to facilitate understand its purpose, where to obtain the field values and what uses them.
Fundamentally, these files take inspiration from Codeface, but go a bit further. The goal is to "distill" a project's website available infrastructure information, and therefore data, in a concise, reusable file that can be manually checked and reused for other analysis. For instance, here is a list of questions that often wastes manual time to be answered for every analyzed project:
1. What version control system is used? Git, SVN, something else?
2. Where is the data for analysis X?
These questions all require some manual time is spent googling the project website to find (which is not even always easy to locate), manual inspection of commit messages to curate a regular expression, etc. A project configuration file should make it easier for a person to quickly assess that, and more importantly, stop wasting time searching for it. This information is of course not immutable: An analysis of OpenSSL over its entire 20 years lifespan shown it started with mailing lists, overlapped with a bug tracker and mailing list, and eventually moved fully to GitHub. So some minimal validation the links in the files are current, or sufficient for the analysis of interest, is still important.
Of course, the information in this file can be capitalized by Kaiaulu to do something: Knowing the online source of a VCS, Mailing list or issue tracker can be used by a crawler to obtain the data, whereas the local path can be used to execute the pipeline in the R Notebooks, and eventually via a command-line interface.
Analysis
From here, the second concern is reproducibility in MSR and the analysis, which is often buried in online folders, or e-mails. For example, was only a portion of the project X mailing list used? What was the time window? What commits range was used in the git log? Or what types of file dependencies were used? These parameters are per analysis, but missing one of them dooms any future attempt to effectively compare results. These parameters have to nonetheless be decided so an analysis can occur, and deferred to the user instead of hardcoded and buried under functions, which is unlikely to be found on normal usage of the tool. Further thought on whether embed this on a project configuration file or R Notebook directly is still needed.
The text was updated successfully, but these errors were encountered: