Skip to content

Replication package for the MSR'21 paper "Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data"

License

Notifications You must be signed in to change notification settings

unl-pal/msr21-timestudy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Usage

Tested on:

  • OSX Catalina 10.15.7 with zsh 5.7.1 (x86_64-apple-darwin19.0)
  • NixOS 21.05pre27280.f6b5bfdb470 with bash 4.4.23(1)-release

Required software

  • make
  • python3
  • perl5
  • curl
  • jq (JSON query tool)

Ubuntu

apt-get install curl jq

OSX

brew install curl
brew install jq

Config

  1. Create GitHub API token: https://docs.github.com/en/free-pro-team@latest/github/authenticating-to-github/creating-a-personal-access-token
  2. be sure it has public_repo access
  3. create the file github.token and place the GitHub API token in it

Boa Queries

All Boa queries were run on the 2019 October/GitHub dataset.

Suspiciously 'old' commits
Source: old-source.boa
Output: old-output.txt

Suspiciously 'future' commits
Source: future-source.boa
Output: future-output.txt

Out-of-order commits
Source: order-source.boa
Output: order-output.txt

Count revisions of all projects
Source: all-projects-revisions.boa
Output: all-projects-revisions-output.txt

Scripts

Note that all data generated by these scripts is already included, so you should not need to re-run any scripts. But they are here if you wish to inspect, modify, and/or re-run them yourself.

If you want to just re-run everything, you can run:

make all

Note that this could take a substantial amount of time!

To generate the list of all GitHub projects found:

Used in section IV onward.

make project-lists

Cache JSON files

Used in section IV onward.

Most of the remaining scripts rely on having JSON metadata for commits. These JSON files are already cached in the dataset, but this is how you download tem:

make cache-json

To generate commit.dates:

Used in section IV onward.

make gen-dates

This file is used to inspect the actual commit date of the commit, as reported by GitHub/SHA.

This also generates the file bad-commits-by-year.txt. This file is used to generate the filtering by year table for the bad commits.

To calculate the number of commits found:

Used in section IV on.

make gen-commit-loc

To generate git-svn.ids:

Used in Section IV.A.

make gen-gitsvn

To generate logs-**.txt:

Used in sections IV.A, IV.B, figures 2, 3.

make gen-logs

This grabs the other commit logs that are not git-svn related and puts into logs-old.txt.

This also generates the commit logs for 'bad' (out of order) commits in logs-order.txt.

To generate order-verified.txt:

Used in section IV.B.

make gen-verified

To calculate the number of commits by each user:

Used in IV.B for item 2, Common Users.

make gen-commit-users

To calculate the number of commits per project:

Used in IV.B for Item 3, Common Projects and Table V.

make gen-commit-proj

To generate list of good and bad (404) projects:

Used in section V.

make gen-goodbad

This generates good-[sha-]projects.txt and bad-[sha-]projects.txt

The 'bad' list is used to avoid asking GitHub API for info on deleted projects. The '-sha-' versions look for the missing GitHub projects on Software Heritage's archive.

To calculate the number of duplicated commits:

Used in section V.

make calc-dupes

To calculate the number of bad commits as percentage:

Used in section V.

make calc-percent-bad

To generate the filter by year table:

Used in section VI, table VI.

make gen-filteryear-table

About

Replication package for the MSR'21 paper "Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published