In the following we describe (i) how to use the dataset containing real-world ROS-based robotic systems and (ii) the steps needed to replicate the whole study (i.e., rebuilding the dataset, rerunning the analysis, etc.)
At the core of our study lies a reusable dataset including 598 GitHub repositories containing ROS-based robotic systems. The whole dataset is available as a single CSV file called repos_dataset_all.csv and it has the following fields:
- ID: the unique ID of the repository
- Source: whether the repository comes from one of this platforms**:
<bitbucket, github, gitlab>
- Default branch: name of the default branch of the repository (e.g.,
master
) - XML launch files: the number ROS launch files in XML
- Py launch files: the number ROS launch files in Python (useful for ROS2 projects)
- Language: the programming language mainly used in the repository as provided by the hosting platform (e.g., GitHub)
- Issues (total): the total number of issues
- Open issues: the number of open issues
- Closed issues: the number of cloosed issues
- PR (total): the total number of pull requests
- Open PRs: the number of open pull requests
- Closed PRs: the number of closed pull requests
- Commits: the number of commits in the default branch
- Branches: the number of branches
- Releases: the number of releases
- Contributors: the number of contributors who made at least one commit in the repository
- Description: the description of the repository as provided by the hosting platform
- URL: the public URL of the repository
- Categorized by: the name of the researcher who firstly classified the repository (other two researchers collaboratively double checked the initial classification)
- Batch: the batch in which the repository has been classified (repositories where classified in two batches)
- Included:
YES
if the repository is included in the final set of 335 real-world projects,NO
otherwise - Violated criterion: if not included, then this value contains the first selection criterion violated by the repository (criteria)
- Scope:
FULL_SYSTEM
if the repository contains the implementation of a whole system,SUBSYSTEM
otherwise - System type 1: the type of robots supported by the software in the repository (see here)
- System type 2: as
System type 1
, in case the repository supports more than one system type - System type 3: as
System type 1
, in case the repository supports more than one system type - Capability 1: the robotic capabilities supported by the software in the repository (see here)
- Capability 2: as
System Capability 1
, in case the repository supports more than one system capability - Capability 3: as
System Capability 1
, in case the repository supports more than one system capability - SA documented:
YES
if the software architecture is fully documented (e.g., all nodes, topics, and their connections are explicit),PARTIALLY
if the software architecture is partially documented (e.g., only the exposed topics are documented),NO
otherwise (see here) - SA documentation: the direct link to the documentation of the software architecture of the system (if
SA documented
is eitherYES
orPARTIALLY
) - Notes: additional notes taken during the data extraction process
The replication package contains also other two comma-separated files, which are proper subsets of the previous one, they are:
- repos_dataset_selected.csv: it contains all 335 repositories passing the last filtering step related to the manually filtering of irrelevant repositories (filtering step 10 in the paper)
- repos_dataset_selected_sadoc.csv: it contains the 115 repositories with either a fully or partially documented software architecture (i.e., those having a
YES
orPARTIALLY
in theSA documented
field)
Moreover, additional CSV and PDF files related to the dataset and the extracted guidelines are reported in the dataset and data_analysis folders, they are not meant for direct use by third-party researchers and are reported for transparency from a methodological perspective.
The steps for collecting the data on which the study is based are reported below.
The goal of the steps below is to build the dataset we provide in repos_dataset_all.csv. All the steps can executed on any UNIX-based machine and have been tested both on MacOS and Ubuntu. As a reference, in dataset/repos_mining_data/Archive.zip we provide a Zip archive containing all the intermediate artifacts generated along the steps below, so that the reader can double check what to expect at each step.
- Install Python 3.7 (see here)
- [optional] setup a Python virtual environment in order to keep all the modules always available and do not run into conflicts with other Python projects (see here)
- Install the following Python modules:
- git
- bs4
- ast
- urllib3
- certifi
- pickle
- configure and run rosmap (instructions) and collect its results into the following files:
dataset/repos_mining_data/intermediateResults0_rosmap_github.json
dataset/repos_mining_data/intermediateResults0_all_bitbucket.json
dataset/repos_mining_data/intermediateResults0_all_gitlab.json
- Configure GHTorrent (instructions) as a MySQL database instance, run all the queries in ghtorrent_queries.sql, and save the final result in
dataset/repos_mining_data/intermediateResults/2_ghtorrent_github.json
- run merge_counter.py
- run explorer.py
- run cloner.py
- run detector.py
- run metrics_manager.py
The execution of the steps above correspond to the first 9 steps reported in Figure 4 in the paper. Then, in order to obtain the final list of repositories (i.e., the one equivalent to our 335 repositories), the final manual filtering step (step 10 in Figure 4 in the paper) must be performed.
- Install Python 3.7 (see here)
- [optional] setup a Python virtual environment (see here)
- Install the following Python modules:
- pdb
- sendgrid
- Move into
online_questionnaire/online_questionnaire_scripts/repos_to_clone.csv
the ID column of the CSV file containing all the repositories to consider. In other words, therepos_to_clone.csv
file should contain only one column namedID
and have a row containing just the ID of the GitHub repository to clone. It is important to note that in this phase we obtained only GitHub repositories in the dataset, so this step assumes that the provided ID are about GitHub repositories only. - run email_detector.py; this will produce the list of all contributors to be targeted by the online questionnaire
- move the contents of the produced
/online_questionnaire/online_questionnaire_scripts/people_12_months.csv
file into [./online_questionnaire/online_questionnaire_scripts/Mail Sender/email.csv](https://github.com/S2-group/icse-seip-2020-replication-package/online_questionnaire/online_questionnaire_scripts/Mail Sender/emails.csv); this step is done in order to avoid to accidentally send thousands of emails to third-party developers - configure the ./online_questionnaire/online_questionnaire_scripts/mailSender.py script according to its readme (in the same folder)
- Prepare the questionnaire as a form in Google Drive and update the email template directly in the code of ./online_questionnaire/online_questionnaire_scripts/mailSender.py
- run ./online_questionnaire/online_questionnaire_scripts/mailSender.py
- Wait for the first results of the questionnaire!
It is important to note that in this study the data extraction and analysis are predominantly manual, so we refer the reader to the Study design section of the paper for knowing the methods we applied for those two phases.