Skip to content

The BFG Repo Cleaner

Randy McDermott edited this page Aug 30, 2016 · 26 revisions

There are generally two reasons why you might need to "clean" your repo:

  • The repo has grown too large

  • Someone committed sensitive information

The Git functionality to handle these problems is the utility called git-filter-branch. However, unless you are a real expert with Git, git-filter-branch is pretty difficult to use. Thankfully, there is an amazing alternative called The BFG Repo-Cleaner by Roberto Tyley.

To use BFG you need to have command-line Java installed. Download and install JDK for your platform. To test your installation, open a terminal and type

$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

Once Java has successfully installed, move on to installing (really just downloading) The BFG Repo-Cleaner. In the link provided, the download button is at the upper-right. Move the bfg-1.12.13.jar (or whatever the current version is) file to where ever you keep your program files (e.g., /Applications on OSX). Next, it is convenient to create an alias to the run command in your .bash_profile or .bashrc. For example, add the following line for OSX:

alias bfg="java -jar /Applications/bfg-1.12.13.jar"

Of course, if you do not want to create an alias you can just substitute "java -jar <bfg.jar>" everywhere I have "bfg" below, where <bfg.jar> is the full path to the version of BFG you downloaded.

##Usage

The BFG is more of a hatchet than a scalpel. It is not possible, for example, to go in and clean a specific commit from the history of a repo. Below we show how to: (1) remove files, (2) remove folders, (3) remove blobs (files) larger than a certain size.

###IMPORTANT

First, backup your repo! Mistakes cannot be undone.

Second, if this is a public repo, realize that running BFG will rewrite the history. This means that all your collaborators will need to basically start from scratch with the new repo. They should be given advanced warning so they can organize their changes and be ready to migrate to the new repo.

###Removing Specific Files

Suppose someone commits a password file called password.txt. To get rid of this file you need to do two things: First, you need to either revert the commit or git rm the file and commit this change to the repo. The reason for this step is that by default BFG leaves the current commit intact and only cleans the history of file. There is a way around this behavior, but it is recommended that if you do not want the file in the working tree that you explicitly remove it before cleaning the repo.

Step 1:

$ git rm password.txt
$ git commit -m "remove password file"

Step 2:

Now we run BFG on the repo to remove the file from the history. At the top level of the repo, do

$ bfg --delete-files password.txt
...

BFG will do a bunch of stuff and show output in your terminal. Finally, when you are finished it will tell you to do the following:

$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Your repo is now clean of the password file.

Note that you could remove different file types as well, e.g., *.png, *.pdf, etc. But you cannot give paths to the files. For example, /dir1/dir2/*.png will not tease out only the png files in the subdirectory. This is why I said BFG is not a scalpel. You can, however, get rid of certain directory names, which we will do next.

###Removing Folders

Suppose a repo has two subdirectories, dir1 and dir2, and you want to split this repo into two new smaller repos. First, copy the repo so that you have two identical repos. Of course, now you have used up twice the disk space. Next, you can go into each repo separately and clean out the subdirectory you no longer want. Just do the following:

$ cd repo1
$ bfg --delete-folders dir2
...
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Note that you can remove multiple directories at a time. Suppose you want to remove dir_a and dir_b in one step.

$ cd repo1
$ bfg --delete-folders "{dir_a,dir_b}"
...
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

###Removing Large Blobs

Another common mistake made in a repo is that someone accidentally commits a large binary or image file (be careful with git add . and git push). If you want to remove all files larger than a certain size, say 10 Megabytes, just do

$ cd reponame
$ bfg -b 10M
...
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Now go checkout the size of the repo.

$ cd ..
$ du -sk reponame

You should notice a substantial decrease in disk usage.

Clone this wiki locally