Skip to content

Latest commit

 

History

History
97 lines (80 loc) · 10.5 KB

File metadata and controls

97 lines (80 loc) · 10.5 KB

Introduction

In the following we describe (i) how to use the dataset containing real-world ROS-based robotic systems and (ii) the steps needed to replicate the whole study (i.e., rebuilding the dataset, rerunning the analysis, etc.)

Using the dataset

At the core of our study lies a reusable dataset including 598 GitHub repositories containing ROS-based robotic systems. The whole dataset is available as a single CSV file called repos_dataset_all.csv and it has the following fields:

  • ID: the unique ID of the repository
  • Source: whether the repository comes from one of this platforms**: <bitbucket, github, gitlab>
  • Default branch: name of the default branch of the repository (e.g., master)
  • XML launch files: the number ROS launch files in XML
  • Py launch files: the number ROS launch files in Python (useful for ROS2 projects)
  • Language: the programming language mainly used in the repository as provided by the hosting platform (e.g., GitHub)
  • Issues (total): the total number of issues
  • Open issues: the number of open issues
  • Closed issues: the number of cloosed issues
  • PR (total): the total number of pull requests
  • Open PRs: the number of open pull requests
  • Closed PRs: the number of closed pull requests
  • Commits: the number of commits in the default branch
  • Branches: the number of branches
  • Releases: the number of releases
  • Contributors: the number of contributors who made at least one commit in the repository
  • Description: the description of the repository as provided by the hosting platform
  • URL: the public URL of the repository
  • Categorized by: the name of the researcher who firstly classified the repository (other two researchers collaboratively double checked the initial classification)
  • Batch: the batch in which the repository has been classified (repositories where classified in two batches)
  • Included: YES if the repository is included in the final set of 335 real-world projects, NO otherwise
  • Violated criterion: if not included, then this value contains the first selection criterion violated by the repository (criteria)
  • Scope: FULL_SYSTEM if the repository contains the implementation of a whole system, SUBSYSTEM otherwise
  • System type 1: the type of robots supported by the software in the repository (see here)
  • System type 2: as System type 1, in case the repository supports more than one system type
  • System type 3: as System type 1, in case the repository supports more than one system type
  • Capability 1: the robotic capabilities supported by the software in the repository (see here)
  • Capability 2: as System Capability 1, in case the repository supports more than one system capability
  • Capability 3: as System Capability 1, in case the repository supports more than one system capability
  • SA documented: YES if the software architecture is fully documented (e.g., all nodes, topics, and their connections are explicit), PARTIALLY if the software architecture is partially documented (e.g., only the exposed topics are documented), NO otherwise (see here)
  • SA documentation: the direct link to the documentation of the software architecture of the system (if SA documented is either YES or PARTIALLY)
  • Notes: additional notes taken during the data extraction process

The replication package contains also other two comma-separated files, which are proper subsets of the previous one, they are:

  • repos_dataset_selected.csv: it contains all 335 repositories passing the last filtering step related to the manually filtering of irrelevant repositories (filtering step 10 in the paper)
  • repos_dataset_selected_sadoc.csv: it contains the 115 repositories with either a fully or partially documented software architecture (i.e., those having a YES or PARTIALLY in the SA documented field)

Moreover, additional CSV and PDF files related to the dataset and the extracted guidelines are reported in the dataset and data_analysis folders, they are not meant for direct use by third-party researchers and are reported for transparency from a methodological perspective.


Full replication of the study

The steps for collecting the data on which the study is based are reported below.

Rebuilding the dataset of real-world open-source ROS systems

The goal of the steps below is to build the dataset we provide in repos_dataset_all.csv. All the steps can executed on any UNIX-based machine and have been tested both on MacOS and Ubuntu. As a reference, in dataset/repos_mining_data/Archive.zip we provide a Zip archive containing all the intermediate artifacts generated along the steps below, so that the reader can double check what to expect at each step.

  • Install Python 3.7 (see here)
  • [optional] setup a Python virtual environment in order to keep all the modules always available and do not run into conflicts with other Python projects (see here)
  • Install the following Python modules:
    • git
    • bs4
    • ast
    • urllib3
    • certifi
    • pickle
  • configure and run rosmap (instructions) and collect its results into the following files:
    • dataset/repos_mining_data/intermediateResults0_rosmap_github.json
    • dataset/repos_mining_data/intermediateResults0_all_bitbucket.json
    • dataset/repos_mining_data/intermediateResults0_all_gitlab.json
  • Configure GHTorrent (instructions) as a MySQL database instance, run all the queries in ghtorrent_queries.sql, and save the final result in dataset/repos_mining_data/intermediateResults/2_ghtorrent_github.json
  • run merge_counter.py
  • run explorer.py
  • run cloner.py
  • run detector.py
  • run metrics_manager.py

The execution of the steps above correspond to the first 9 steps reported in Figure 4 in the paper. Then, in order to obtain the final list of repositories (i.e., the one equivalent to our 335 repositories), the final manual filtering step (step 10 in Figure 4 in the paper) must be performed.

Online questionnaire administration

It is important to note that in this study the data extraction and analysis are predominantly manual, so we refer the reader to the Study design section of the paper for knowing the methods we applied for those two phases.