How to handle big repositories with git
git is a fantastic choice for tracking the evolution of your code base and to collaborate efficiently with your peers. But what happens when the repository you want to track is really huge?
In this post I'll try to give you some ideas and techniques to deal properly with the different categories of huge.
Two categories of Big repositories
If you think about it there are broadly two major reasons for repositories growing massive:
- They accumulate a very very long history (the project grows over a very long period of time and the baggage accumulates)
- They include huge binary assets that need to be tracked and paired together with code.
- Both of the above.
So a repository can grow in two orthogonal directions: The size of the working directory - i.e. the latest commit - and the size of the entire accumulated history.
Sometimes the second category of problem is compounded by the fact that old deprecated binary artifacts are still stored in the repository, but for that has a moderately easy - if annoying - fix, see below.
For the above two scenarios the techniques and workarounds are different - though sometimes complementary - and so let me cover them separately.
Handling Repositories With Very Long History
Even though the bounds that identify a repository as massive are pretty high - for example the latest Linux kernel clocks at 15+ million lines of code but people seem happy to peruse it in full - very old projects that for regulatory/legal reasons have to be kept intact can become a pain to clone (Now to be transparent the Linux kernel is split in a historical repository and a more recent one, and requires a simple grafting setup to have access to the full unified history).
Simple solution is a shallow clone
The first solution to a fast clone and to saving developers and systems time and disk space is to perform a
shallow clone using git. A shallow clone allows you to clone a repository keeping only the latest
n commits of history.
How do you do it? Just use the
- -depth option, for example:
git clone --depth depth remote-url
Imagine you accumulated ten or more years of project history in your repository - for example for Jira we migrated to
git an 11 years old code base -, the time savings can add up and be very noticeable.
The full clone of Jira is 677MB with the working directory being another 320+MB , making up for more than 47,000+ commits. From a quick check on the Jira checkout a shallow clone took
29.5 seconds compared to the
4 minutes 24 seconds of a full complete clone with all the history. The disparity grows also proportionally to how many binary assets your project has swallowed over time. In any case build systems can greatly profit from this technique too.
Recent git has improved support for shallow clones
Shallow clones used to be somewhat impaired citizens of the
git world as some operations were barely supported. But recent versions (1.9+) have improved the situation greatly and you can properly
push to repositories even from a shallow clone now.
Partial solution is filter-branch
For the huge repositories that have big binary cruft committed by mistake or old assets not needed anymore a great solution is to use
filter-branch. The command allows to walk through the entire history of the project filtering out, massaging, modifying, skipping files according to predefined patterns. It is a very powerful tool in your
git arsenal. There are already helper scripts available to identify big objects in your git repository, so that should be easy enough.
Sample usage of
git filter-branch --tree-filter 'rm -rf /path/to/spurious/asset/folder' HEAD
filter-branch has a minor drawback: once you use
filter-branch you effectively rewrite the entire history of your project: all commit ids change. This requires every developer to re-clone the updated repository.
So in case you're planning to carry out a cleanup action using
filter-branch you should alert your team, plan a short freeze while the operation is carried out and then notify everyone that they should
clone the repository again.
Alternative to shallow-clone: Clone only one branch
git 1.7.10, of April 2012 you can also limit the amount of history you clone by cloning a single branch, like the following:
git clone URL --branch branch_name --single-branch [folder]
This specific hack would be useful for long running and divergent branches or if you have many branches. If you only have a handful of branches with very few differences you probably won't see a huge difference using this.
Handling Repositories With Huge Binary Assets
The second category of big repositories is made up from code bases that have to track huge binary assets. Gaming teams have to juggle around huge 3D models, Web development teams might need to track raw image assets, CAD teams might need to manipulate and track the status of binary deliverables. So there are different categories of software team that run into this issue with
Git is not especially bad at handling binary assets, but it's not especially good either. By default
git will compress and store all subsequent full versions of the binary assets, which is obviously not optimal if you have many.
There are some basic tweaks that improve the situation, like running the garbage collection
git gc, or tweaking the usage of
delta commits for some binary types in
But it's important to reflect on the nature of you project's binary assets as the winning approach may vary. For example here are three points to check (thanks to Stefan Saasen for the remarks):
- For binary files that change significantly - and not just some meta data headers - the delta compression is probably going to be useless so the suggestion is to turn
delta offfor those files to avoid the unnecessary delta compression work as part of the repack
- In the scenario above it's likely that those files don't zlib compress very well either so you could turn compression off with
core.loosecompression 0; That's a global setting that would negatively affect all the non-binary files that actually compress well so the suggestion makes sense if you split the binary assets in a separate repository.
- It's important to remember that
git gcturns the "duplicated" loose objects into a single pack file but again unless the files compress in any way that probably doesn't make any significant difference in relation to the resulting pack file.
- Explore the tuning of
core.bigFileThreshold. Anything larger than
512 MiBwon't be delta compressed anyway - without having to set
.gitattributes- so maybe that's something worth tweaking.
Technique 1: sparse checkout
A mild help to the binary assets problem is sparse checkout (available since Git
1.7.0]). This technique allows to keep the working directory clean by explicitly detailing which folders you want to populate. Unfortunately it does not affect the size of the overall local repository but can be helpful if you have a huge tree of folders.
What are the involved commands? Here's an example (credit):
- Clone the full repository once:
git clone <repository-address>
- Activate the feature:
git config core.sparsecheckout true
- Add folders that are needed explicitly, ignoring assets folders:
echo src/ › .git/info/sparse-checkout
- Read the tree as specified:
git read-tree -m -u HEAD
After the above you can go back to use your normal
git commands, but your work directory will only contain the folders you specified above.
Technique 2: Use of submodules
Another way to handle huge binary asset folders is to split those into a separate repository and pull the assets in your main project using submodules. This gives you a way a way to control when you update the assets. See more on submodules in these posts: core concept and tips and alternatives.
If you go the way of the
submodules way you might want to checkout the complexities of handling project dependencies, since some of the possible approaches to the huge binaries problem might be helped by the approaches I mention there.
Technique 3: Use git annex or git-bigfiles
A third option for handling binary assets with
git is to rely on an apt third party extension.
The first one I mention is git-annex, which allows managing binary files with git without checking the file contents into the repository.
git-annex saves the files in a special key-value store and only symbolic links are then checked into git and versioned like regular files. It is straightforward to use and the examples are self explanatory.
The second one is git-bigfiles, a
git fork that hopes to make life bearable for people using Git on project hosting very large files.
[UPDATE] …or you can skip all that and use Git LFS
If you work with large files on a regular basis, the best solution might be to take advantage of the large file support (LFS) Atlassian co-developed with GitHub in 2015.
Git LFS is an extension that stores pointers (naturally!) to large files in your repository, instead of storing the files themselves in there. The actual files are stored on a remote server. As you can imagine, this dramatically reduces the time it takes to clone your repo.
Bitbucket supports Git LFS, as does GitHub. So chances are, you already have access to this technology. It’s especially helpful for teams that include designers, videographers, musicians, or CAD users.
Don't give up the fantastic capabilities of
git just because you have a huge repository history or huge assets. There are workable solutions to both problems.
Follow me @durdn for more DVCS rocking.