Skip to content

Git Theory

Dan Ibanez edited this page Aug 23, 2016 · 4 revisions

There is a lot of good documentation about Git, including its official documentation and Atlassian's documentation. This page attempts to describe the theoretical foundations of Git for the benefit of PUMI developers.

After reviewing the theory, please browse our list of useful Git Commands.

The Two Main Aspects

When compared with a system like Subversion, there are two main design decisions of Git that define everything else about it:

  1. Git uses communication between repositories instead of one central repository
  2. Git uses a graph history instead of a linear history

Historical reasons

The reason for these aspects is that Git was originally developed to accept changes from thousands of developers to the Linux project. The central maintainer of the Linux project needed to delegate the majority of acceptance work to his "lieutenants", so that he could just accept large batches of changes from his lieutenants instead of coordinating with thousands of developers at once.

This means the lieutenants must have their own repositories where they can locally merge changes from developers, and those repositories must be able to communicate with the central maintainer's repository. The hierarchy also means that history would look like a tree, with thousands of developers at the branches, combining branches into thicker ones at the lieutenants, and finally combining lieutenant branches into the central maintainer's "trunk". But developer branches actually started from an earlier point in the trunk, so this is actually a Directed Acyclic Graph.

A Repository

A git repository is really just the contents of the .git directory. It tracks the content of another directory (the "source directory"), usually the directory above it. A repository contains the full history the source, because there is not necessarily a separate server to store history. It also contains configuration settings, branches, remotes, etc.

There are two main kinds of repository:

  1. A "normal" repository which is a .git subdirectory of the source directory. Users run Git commands inside the source directory that have effects on the repository.
  2. A "bare" repository, which is just a .git directory. Bare repositories are what exists on servers like Github. They accept pushed changes from normal repositories and their contents can be fetched from normal repositories.

Repository History

The repository history is a collection of commits that forms a graph, with commits being graph nodes. Each commit represents the contents of the source at some point in time, along with a descriptive message and pointers to parent commits. A commit can have different numbers of parents:

  1. Zero parents: this is the "root commit", usually just the first commit ever made in the repository, describing the first version of the source.
  2. One parent: most commits are just changes from an old version (the parent commit) to a new version (this commit).
  3. Two parents: this is a merge commit, indicating that two versions of the source were combined together. Order is important: the first parent is the "trunk" commit, and the second parent is the "incoming" commit being merged into the trunk commit.

Commits

In Subversion, all commits are consecutively numbered, and identified by this number. In Git, there is no obvious way to assign such numbers to commits, because two repositories may combine their commits and at that point they would need to be renumbered and lose their identity. This is one reason why the identity of a commit is not a nice integer but rather a big random-looking string like 3fe9028a74b9ec12e3e8a78af2417d25887be6a8. Actually this string is not random, it is the result of running the SHA-1 hash algorithm with the following inputs:

  1. The current contents of the source
  2. The commit message, author name, and date of the change
  3. The SHA-1 of parent commits

Which is why this identifier is simply called "the SHA-1", or "the hash" of the commit. The SHA-1 algorithm has the nice property that there is a negligible probability of getting the same hash with different inputs (2^-160). This means commits from two repositories can be combined with negligible risk of two commits having the same hash.

Because it is unnatural for humans to remember such long pseudo-random strings, Git offers many other ways to identity commits. There is the short SHA-1, which is just the first 7 characters of the full SHA-1: 3fe9028. There are also graph traversal symbols, for example 3fe9028~ is the first parent of 3fe9028.

The Staging Area

Developers often make lots of changes in a hurry so it can be nice to have a way to commit parts of the changes you have made. Conversely, you may be making extensive changes to many files, and want some way to save your progress without making partial commits. Git has a staging area called the "index". This is temporary storage for the changes that will go into a commit. The git add commands will "accept" changes from the working directory into the index, which can then be committed as a whole when ready.

Branches

Branches are just pointers to commits. Branches have nice human-readable names, for example the default branch is called master. There is also another pointer, called HEAD, which usually points to the "current branch" the user is working on. When a user runs git commit, a new commit is created, and the current branch changes is pointer to point to the new commit.

HEAD really indicates what point in history the source directory currently reflects, and one can for example run git checkout 3fe9028 which will change the source directory contents to state they were at commit 3fe9028 and set HEAD to point to 3fe9028.

When indicating commits to Git commands, the name of a branch can be used instead of the hash of the commit it points to.

Merging

Git has two important modes of merging, which are invoked by the git merge command. The inputs are the current branch, and an incoming commit to merge into the current branch.

  1. Normal merges with a commit, i.e. a new commit is created with the current branch commit and incoming commit as parents. The current branch then points to the newly created merge commit.
  2. If there are no changes in the current branch that are not known in the incoming commit, then the current branch simply switches to point to the incoming commit. This is called a "fast-forward" merge, and represents just "catching up" to the state of a branch that is ahead.

Creating a merge commit is not always automatic. There can be merge conflicts, usually when two people change the same part of the same file and then try to merge their changes. There are many cases in which a developer must resolve conflicts by hand. Git leaves annotations in your source files that can be resolved with a text editor, like this:

double get_density() {
<<<<<<< HEAD
  return 6;
=======
  return 0;
>>>>>>> pauls-branch
}

Remotes

All the above holds true for a single repository: commits are stored in the local .git repository, completely independently of any server like Github. Communication between repositories is done using remotes, which are just pointers from one repository to another.

There are three protocols most commonly used to communicate with a remote repository:

  1. SSH protocol: use the same syntax as for an ssh command, which is [email protected]:path/repo.git. This protocol can use your SSH key or your password for authentication. Using your SSH key is good because you will not have to type anything in, and it is more secure.
  2. HTTP protocol: web-page syntax, for example https://github.com/SCOREC/core.git. This will use your Github password for authentication.
  3. local protocol: you can communicate with another repository that is in the same filesystem just by using its path: /home/another_user/their_repo.

The git clone command accepts a remote in one of these formats above and creates a new repository with the same contents. The new repository will have a remote pointer called origin, which points to the repository that was cloned.

There are three fundamental operations that can be done to a remote repository:

  1. fetch: Copy commits from the remote repository to your repository, if they are not in your repository yet. Also record the location of all branches in the remote repository.
  2. pull: This is just a fetch followed by a merge from one of the remote branches into your branch. If you have not made any changes, it is good for this to be a fast-forward merge, because you are just catching up to changes made in the remote. If you do have changes that are not in the remote, then a normal merge commit need to be created, and possibly conflicts resolved.
  3. push: This is most easily defined as a reverse pull, i.e. telling the remote to pull from you. Commits are copied from your repository to the remote if they don't exist there yet, and then a remote branch is forced to catch up to your branch. In other words, the remote branch will carry out a fast-forward merge from your branch into itself. The reason it must be fast-forward is that the remote repository may live on a Github server and have no way to resolve complex merge conflicts.

Your branches can be set up to "track" remote branches, i.e. a branch can know which remote branch it will push to or pull from by default. For example, when you clone a repository, your master branch already "tracks" the origin/master remote branch, meaning push and pull will operate between these branches by default.

Clone this wiki locally