-
Notifications
You must be signed in to change notification settings - Fork 63
Git Theory
There is a lot of good documentation about Git, including its official documentation and Atlassian's documentation. This page attempts to describe the theoretical foundations of Git for the benefit of PUMI developers.
After reviewing the theory, please browse our list of useful Git Commands.
When compared with a system like Subversion, there are two main design decisions of Git that define everything else about it:
- Git uses communication between repositories instead of one central repository
- Git uses a graph history instead of a linear history
The reason for these aspects is that Git was originally developed to accept changes from thousands of developers to the Linux project. The central maintainer of the Linux project needed to delegate the majority of acceptance work to his "lieutenants", so that he could just accept large batches of changes from his lieutenants instead of coordinating with thousands of developers at once.
This means the lieutenants must have their own repositories where they can locally merge changes from developers, and those repositories must be able to communicate with the central maintainer's repository. The hierarchy also means that history would look like a tree, with thousands of developers at the branches, combining branches into thicker ones at the lieutenants, and finally combining lieutenant branches into the central maintainer's "trunk". But developer branches actually started from an earlier point in the trunk, so this is actually a Directed Acyclic Graph.
A git repository is really just the contents of the .git
directory.
It tracks the content of another directory (the "source directory"),
usually the directory above it.
A repository contains the full history the source,
because there is not necessarily a separate server to store history.
It also contains configuration settings, branches, remotes, etc.
There are two main kinds of repository:
- A "normal" repository which is a
.git
subdirectory of the source directory. Users run Git commands inside the source directory that have effects on the repository. - A "bare" repository, which is just a
.git
directory. Bare repositories are what exists on servers like Github. They accept pushed changes from normal repositories and their contents can be fetched from normal repositories.
The repository history is a collection of commits that forms a graph, with commits being graph nodes. Each commit represents the contents of the source at some point in time, along with a descriptive message and pointers to parent commits. A commit can have different numbers of parents:
- Zero parents: this is the "root commit", usually just the first commit ever made in the repository, describing the first version of the source.
- One parent: most commits are just changes from an old version (the parent commit) to a new version (this commit).
- Two parents: this is a merge commit, indicating that two versions of the source were combined together. Order is important: the first parent is the "trunk" commit, and the second parent is the "incoming" commit being merged into the trunk commit.
In Subversion, all commits are consecutively numbered, and identified
by this number.
In Git, there is no obvious way to assign such numbers to commits, because
two repositories may combine their commits and at that point they would
need to be renumbered and lose their identity.
This is one reason why the identity of a commit is not a nice integer
but rather a big random-looking string like 3fe9028a74b9ec12e3e8a78af2417d25887be6a8
.
Actually this string is not random, it is the result of running the
SHA-1 hash algorithm with the following inputs:
- The current contents of the source
- The commit message, author name, and date of the change
- The SHA-1 of parent commits
Which is why this identifier is simply called "the SHA-1",
or "the hash" of the commit.
The SHA-1 algorithm has the nice property that there is a
negligible probability of getting the same hash
with different inputs (2^-160
).
This means commits from two repositories can be combined
with negligible risk of two commits having the same hash.
Because it is unnatural for humans to remember such long pseudo-random strings,
Git offers many other ways to identity commits.
There is the short SHA-1, which is just
the first 7 characters of the full SHA-1: 3fe9028
.
There are also graph traversal symbols,
for example 3fe9028~
is the first parent of 3fe9028
.
Developers often make lots of changes in a hurry
so it can be nice to have a way to commit parts of the
changes you have made.
Conversely, you may be making extensive changes to many files,
and want some way to save your
progress without making partial commits.
Git has a staging area called the "index".
This is temporary storage for the changes that will
go into a commit.
The git add
commands will "accept" changes from the
working directory into the index, which can then
be committed as a whole when ready.
Branches are just pointers to commits.
Branches have nice human-readable names,
for example the default branch is called master
.
There is also another pointer, called HEAD
, which usually
points to the "current branch" the user is working on.
When a user runs git commit
, a new commit is created,
and the current branch changes is pointer to point to the new commit.
HEAD
really indicates what point in history the source
directory currently reflects, and one can for example
run git checkout 3fe9028
which will change the source
directory contents to state they were at commit 3fe9028
and set
HEAD
to point to 3fe9028
.
When indicating commits to Git commands, the name of a branch can be used instead of the hash of the commit it points to.
Git has two important modes of merging, which are invoked
by the git merge
command.
The inputs are the current branch, and an incoming commit to merge
into the current branch.
- Normal merges with a commit, i.e. a new commit is created with the current branch commit and incoming commit as parents. The current branch then points to the newly created merge commit.
- If there are no changes in the current branch that are not known in the incoming commit, then the current branch simply switches to point to the incoming commit. This is called a "fast-forward" merge, and represents just "catching up" to the state of a branch that is ahead.
Creating a merge commit is not always automatic. There can be merge conflicts, usually when two people change the same part of the same file and then try to merge their changes. There are many cases in which a developer must resolve conflicts by hand. Git leaves annotations in your source files that can be resolved with a text editor, like this:
double get_density() {
<<<<<<< HEAD
return 6;
=======
return 0;
>>>>>>> pauls-branch
}
All the above holds true for a single repository:
commits are stored in the local .git
repository,
completely independently of any server like Github.
Communication between repositories is done using
remotes, which are just pointers from one repository
to another.
There are three protocols most commonly used to communicate with a remote repository:
-
SSH protocol: use the same syntax as for
an
scp
command, which is[email protected]:path/repo.git
. This protocol can use your SSH key or your password for authentication. Using your SSH key is good because you will not have to type anything in, and it is more secure. -
HTTP protocol: web-page syntax, for example
https://github.com/SCOREC/core.git
. This will use your Github password for authentication. -
local protocol: you can communicate with
another repository that is in the same filesystem
just by using its path:
/home/another_user/their_repo
.
The git clone
command accepts a remote in one
of these formats above and creates a new repository
with the same contents.
The new repository will have a remote pointer called
origin
, which points to the repository that was cloned.
There are three fundamental operations that can be done to a remote repository:
-
fetch
: Copy commits from the remote repository to your repository, if they are not in your repository yet. Also record the location of all branches in the remote repository. -
pull
: This is just afetch
followed by amerge
from one of the remote branches into your branch. If you have not made any changes, it is good for this to be a fast-forward merge, because you are just catching up to changes made in the remote. If you do have changes that are not in the remote, then a normal merge commit need to be created, and possibly conflicts resolved. -
push
: This is most easily defined as a reversepull
, i.e. telling the remote to pull from you. Commits are copied from your repository to the remote if they don't exist there yet, and then a remote branch is forced to catch up to your branch. In other words, the remote branch will carry out a fast-forward merge from your branch into itself. The reason it must be fast-forward is that the remote repository may live on a Github server and have no way to resolve complex merge conflicts.
Your branches can be set up to "track" remote branches,
i.e. a branch can know which remote branch it will push
to or pull from by default.
For example, when you clone a repository, your
master
branch already "tracks" the origin/master
remote branch,
meaning push
and pull
will operate between these branches
by default.