Reference (cache) repositories to speed up clones: git clone --reference

Update: Please read this update on my experience before using the technique in this article.
Update: See the drush implementation of this approach in the comment below.

DamZ taught me a great new piece of git trivia today. You can use a local repository as a kind of cache for a git clone.

Let's create a reference repository for Drupal (and it will be bare, because we don't need any files checked out)

git clone --mirror git://git.drupal.org/project/drupal.git ~/gitcaches/drupal.reference

That makes a complete clone of Drupal's full history in ~/gitcaches/drupal.reference

Now when I need to clone the Drupal project's entire history (as I might do often in testing) I can

git clone --reference ~/gitcaches/drupal.reference git://git.drupal.org/project/drupal.git

And the clone time is on the order of 2 seconds instead of several minutes. And yes, it picks up new changes that may have happened in the real remote repository.

To go beyond this (again from DamZ) we can have a reference repository that has many projects referenced within it.

mkdir -p ~/gitcaches/reference
cd ~/gitcaches/reference
git init --bare
for repo in drupal views cck examples panels  # whatever you want here
do
  git remote add $repo git://git.drupal.org/project/$repo.git
done

git fetch --all

Now I have just one big bare repo that I can use as a cache. I might update it from time to time with git fetch --all. But I don't have to. And I can use it like this:

cd /tmp
git clone --reference ~/gitcaches/reference git://git.drupal.org/project/drupal.git
git clone --reference ~/gitcaches/reference git://git.drupal.org/project/examples.git

We'll try to use this technique for the testbots, which do several clean checkouts per patch tested, as it should speed them up by at least a minute per test.

Edit: Here is the version that I used with the testbots, as it appears as a gist:

5 comments

by Damien Tournoud on Mon, 2011-02-21 04:08

Adding a new repository to the cache can be slow, because on the first fetch Git will try to determine if it has any revision in common with the remote repository, and the only way to do that is to send out the list of *all* the commits in the local repository.

The trick to workaround this is to clone to an empty repository first, and to fetch from there into the cache repository. Here is a simple bash script that automate this process. Execute with import.sh [remote name] [remote url] when inside the cache repository:

#!/bin/bash

set -ex

# Create a temporary directory.
TEMPDIR=`mktemp -d`
trap "rm -Rf $TEMPDIR" EXIT

# First clone the directory separately.
git clone $2 $TEMPDIR

# Then fetch from the temporary dir into our main repo.
git remote add $1 $TEMPDIR/
git fetch $1 --tags

# Then change the remote URL and fetch normally.
git remote set-url $1 $2
git fetch $1 --tags

by Matt Farina on Sat, 2011-02-26 15:34

Based on the work of Randy and Damien I created a script to init, add, and update a Drupal Git Cache. The script is at http://drupal.org/sandbox/mfer/1074256.

Nice work gentlemen.

by moshe weitzman on Thu, 2011-03-03 22:32

we are working on a drush command at http://drupal.org/node/1076302. just have to solve an incompatibility with git_deploy module.

by pfrenssen on Fri, 2011-05-06 06:51

This is the full drush command to use for those who are interested:
drush dl --package-handler=git_drupalorg --cache drupal

You can also add it as a default setting to your ~/.drush/drushrc.php file:

<?php
$command_specific
['dl'] = array(
 
'package-handler' => 'git_drupalorg',
 
'cache' => TRUE,
);
?>

by dvessel on Fri, 2011-09-09 12:04

This post got me thinking so I put up a more generalized bash script that caches on demand and it’s not specific to Drupal.

https://gist.github.com/8839519ec5b823e047bf

Replacing ‘git’ with ‘git-cached’ will get it working.

Drupal theme by Kiwi Themes.