Improve Project Configuration Files Format #111

carlosparadis · 2021-07-17T02:22:52Z

Now that newer data sources and tools are available in Kaiaulu, the project configuration file sections are becoming more confusing. There is also little comment guidance on what is needed or not, or what is part of Kaiaulu or what is not.

The overall format should be refactored to facilitate understand its purpose, where to obtain the field values and what uses them.

Fundamentally, these files take inspiration from Codeface, but go a bit further. The goal is to "distill" a project's website available infrastructure information, and therefore data, in a concise, reusable file that can be manually checked and reused for other analysis. For instance, here is a list of questions that often wastes manual time to be answered for every analyzed project:

Version Control:
1. What version control system is used? Git, SVN, something else?
2. Where is the data for analysis X?
Mailing List:
1. What mailing list archives the project use? (e.g. pipermail, mod_mbox, MARC, google groups, etc)
  1. Do they contain the same data? (OpenSSL is a classic example where each archive contains a portion of the full mailing list!)
2. What mailing list should be used for analysis? Developer and User, or just developer? Any other mailing list?
3. What is the URL for the mailing lists we should use?
4. Where is the data for analysis X?
Issue Tracker
1. What issue tracker type is used (BugZilla, JIRA, GitHub, TRAC, etc)
2. Where is the data for analysis X?
Files
1. What language does the project use? (Needed as a parameter for static analysis tools, and to avoid including non src code files in your churn metrics!)
2. How do developers convention test files? (Need to also avoid bloating churn and other content metrics)
3. Do commit messages contain conventions for issues or software vulnerabilities? (Needed for bug count metric, or software vulnerability analysis).

These questions all require some manual time is spent googling the project website to find (which is not even always easy to locate), manual inspection of commit messages to curate a regular expression, etc. A project configuration file should make it easier for a person to quickly assess that, and more importantly, stop wasting time searching for it. This information is of course not immutable: An analysis of OpenSSL over its entire 20 years lifespan shown it started with mailing lists, overlapped with a bug tracker and mailing list, and eventually moved fully to GitHub. So some minimal validation the links in the files are current, or sufficient for the analysis of interest, is still important.

Of course, the information in this file can be capitalized by Kaiaulu to do something: Knowing the online source of a VCS, Mailing list or issue tracker can be used by a crawler to obtain the data, whereas the local path can be used to execute the pipeline in the R Notebooks, and eventually via a command-line interface.

Analysis

From here, the second concern is reproducibility in MSR and the analysis, which is often buried in online folders, or e-mails. For example, was only a portion of the project X mailing list used? What was the time window? What commits range was used in the git log? Or what types of file dependencies were used? These parameters are per analysis, but missing one of them dooms any future attempt to effectively compare results. These parameters have to nonetheless be decided so an analysis can occur, and deferred to the user instead of hardcoded and buried under functions, which is unlikely to be found on normal usage of the tool. Further thought on whether embed this on a project configuration file or R Notebook directly is still needed.

Example format for new project configuration file. Signed-off-by: Carlos Paradis <[email protected]>

Project configuration files now properly account to the new downloaders added in previous commits. The overall format was also improved for readability and to be more self-contained. Signed-off-by: Carlos Paradis <[email protected]>

The parsing of the respective CLI interfaces now conforms to the new project configuration file format. Signed-off-by: Carlos Paradis <[email protected]>

carlosparadis added the type:documentation Improvements or additions to documentation label Jul 17, 2021

carlosparadis added this to the 0.0.0.9500 Standardized API data columns milestone Jul 17, 2021

carlosparadis self-assigned this Jul 17, 2021

carlosparadis pinned this issue Jul 17, 2021

carlosparadis added a commit that referenced this issue Jul 17, 2021

i #111 Add new config file format

03c95e8

Example format for new project configuration file. Signed-off-by: Carlos Paradis <[email protected]>

carlosparadis closed this as completed Sep 27, 2021

carlosparadis added a commit that referenced this issue Oct 1, 2021

i #111 git and ml CLI now conforms to new conf

a86bd31

The parsing of the respective CLI interfaces now conforms to the new project configuration file format. Signed-off-by: Carlos Paradis <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Project Configuration Files Format #111

Improve Project Configuration Files Format #111

carlosparadis commented Jul 17, 2021

Improve Project Configuration Files Format #111

Improve Project Configuration Files Format #111

Comments

carlosparadis commented Jul 17, 2021

Analysis