Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Project Configuration Files Format #111

Closed
carlosparadis opened this issue Jul 17, 2021 · 0 comments
Closed

Improve Project Configuration Files Format #111

carlosparadis opened this issue Jul 17, 2021 · 0 comments
Assignees
Labels
type:documentation Improvements or additions to documentation

Comments

@carlosparadis
Copy link
Member

Now that newer data sources and tools are available in Kaiaulu, the project configuration file sections are becoming more confusing. There is also little comment guidance on what is needed or not, or what is part of Kaiaulu or what is not.

The overall format should be refactored to facilitate understand its purpose, where to obtain the field values and what uses them.

Fundamentally, these files take inspiration from Codeface, but go a bit further. The goal is to "distill" a project's website available infrastructure information, and therefore data, in a concise, reusable file that can be manually checked and reused for other analysis. For instance, here is a list of questions that often wastes manual time to be answered for every analyzed project:

  1. Version Control:
    1. What version control system is used? Git, SVN, something else?
    2. Where is the data for analysis X?
  2. Mailing List:
    1. What mailing list archives the project use? (e.g. pipermail, mod_mbox, MARC, google groups, etc)
      1. Do they contain the same data? (OpenSSL is a classic example where each archive contains a portion of the full mailing list!)
    2. What mailing list should be used for analysis? Developer and User, or just developer? Any other mailing list?
    3. What is the URL for the mailing lists we should use?
    4. Where is the data for analysis X?
  3. Issue Tracker
    1. What issue tracker type is used (BugZilla, JIRA, GitHub, TRAC, etc)
    2. Where is the data for analysis X?
  4. Files
    1. What language does the project use? (Needed as a parameter for static analysis tools, and to avoid including non src code files in your churn metrics!)
    2. How do developers convention test files? (Need to also avoid bloating churn and other content metrics)
    3. Do commit messages contain conventions for issues or software vulnerabilities? (Needed for bug count metric, or software vulnerability analysis).

These questions all require some manual time is spent googling the project website to find (which is not even always easy to locate), manual inspection of commit messages to curate a regular expression, etc. A project configuration file should make it easier for a person to quickly assess that, and more importantly, stop wasting time searching for it. This information is of course not immutable: An analysis of OpenSSL over its entire 20 years lifespan shown it started with mailing lists, overlapped with a bug tracker and mailing list, and eventually moved fully to GitHub. So some minimal validation the links in the files are current, or sufficient for the analysis of interest, is still important.

Of course, the information in this file can be capitalized by Kaiaulu to do something: Knowing the online source of a VCS, Mailing list or issue tracker can be used by a crawler to obtain the data, whereas the local path can be used to execute the pipeline in the R Notebooks, and eventually via a command-line interface.

Analysis

From here, the second concern is reproducibility in MSR and the analysis, which is often buried in online folders, or e-mails. For example, was only a portion of the project X mailing list used? What was the time window? What commits range was used in the git log? Or what types of file dependencies were used? These parameters are per analysis, but missing one of them dooms any future attempt to effectively compare results. These parameters have to nonetheless be decided so an analysis can occur, and deferred to the user instead of hardcoded and buried under functions, which is unlikely to be found on normal usage of the tool. Further thought on whether embed this on a project configuration file or R Notebook directly is still needed.

@carlosparadis carlosparadis added the type:documentation Improvements or additions to documentation label Jul 17, 2021
@carlosparadis carlosparadis self-assigned this Jul 17, 2021
@carlosparadis carlosparadis pinned this issue Jul 17, 2021
carlosparadis added a commit that referenced this issue Jul 17, 2021
Example format for new project configuration file.

Signed-off-by: Carlos Paradis <[email protected]>
carlosparadis added a commit that referenced this issue Sep 27, 2021
Project configuration files now properly
account to the new downloaders added in
previous commits.

The overall format was also improved for
readability and to be more self-contained.

Signed-off-by: Carlos Paradis <[email protected]>
carlosparadis added a commit that referenced this issue Oct 1, 2021
The parsing of the respective CLI interfaces now
conforms to the new project configuration file format.

Signed-off-by: Carlos Paradis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant