Git and Large Repositories

by Peter Jones / February 19, 2008

Abstract

Import a large repository into Git and perform some testing to see how fast it really is.

Importing from a Large CVS Repository

The FreeBSD CVS repository was chosen because it should provide an appropriately sized code base to work from in terms of the number of files, and the overall historical data available.

As of this writing, the FreeBSD src CVS repository had 67,311 files, weighing in at 1.7 GB worth of data. The first commit was made in 1993, providing us with 15 years worth of historical information.

Getting a Local Copy of the FreeBSD CVS Repository

For performance reasons, it's probably a good idea to get a local copy of the CVS repository you wish to import into Git. Using the list of FreeBSD rsync sites, locate a mirror that contains the entire FreeBSD FTP site.

You can then use rsync to download the entire repository:

$ mkdir freebsd
$ cd freebsd
$ rsync -vaz --delete rsync://a-freebsd-mirror/FreeBSD/development/FreeBSD-CVS/src .

Preparing the CVS Repository

The FreeBSD mirror that you pulled the src directory from probably doesn't have the CVSROOT directory. We'll create that now.

Just create an empty directory, and run cvs init.

$ cd ..
$ mkdir empty
$ cvs -d $PWD/empty init
$ mv empty/CVSROOT freebsd/
$ rmdir empty

Preparing the CVS Conversion Tools

I wanted to use the git-cvsimport tool, but alas, it didn't work1. Therefore, I used the cvs2svn tool which support Git as of version 2.1.

Tools Needed:

  • Python Version >= 2.2
  • Python GDBM Bindings (should be part of Python2)
  • cvs2svn Version >= 2.1

Prepare cvs2svn for the conversion:

$ fetch http://cvs2svn.tigris.org/files/documents/1462/41596/cvs2svn-2.1.0.tar.gz
$ tar xzf cvs2svn-2.1.0.tar.gz
$ mkdir freebsd-to-git
$ cd freebsd-to-git
$ cp ../cvs2svn-2.1.0/cvs2svn-example.options .
$ cp ../cvs2svn-2.1.0/test-data/main-cvsrepos/cvs2svn-git.options .

You then need to edit the cvs2svn-git.options file. Change the path to the CVS repository (look near run_options.add_project).

Make sure you remove the comment character before the fallback_encoding lines. Otherwise the log message conversion will fail because it can't convert all commit messages to ASCII. I also removed the comment character before the utf8 string, just above fallback_encoding because some of the FreeBSD commit messages were in UTF83.

Importing the CVS Repository

Everything should be ready for the actual conversion. The following steps took me approximately 48 hours. Your millage may vary.

$ env PYTHONPATH="$PWD/../cvs2svn-2.1.0/contrib" python ../cvs2svn-2.1.0/cvs2svn --options=cvs2svn-git.options
$ git-init
$ cat cvs2svn-tmp/git-blob.dat cvs2svn-tmp/git-dump.dat | git-fast-import

I noticed (and ignored) several warnings along the lines of:

branch '1.1.1' already has name 'ISC',
cannot also have name 'VIXIE', ignoring the latter

Examining Git Performance and Scalability

The following tests were performed using a local repository so that network access times would not be a factor. The most significant issue with remote repositories should be the initial cloning of a large Git repository.

Size of Repository

$ du -hs cvs
$ du -hs svn
$ du -hs git
CVS 1.7 GB
SVN 3.9 GB
Git 511 MB

Size of Working Directory

$ du -hs .
$ du -hs .
$ du -hs .
CVS 534 MB
SVN 1.1 GB
Git 995 MB

Time Required to Perform a Checkout

$ cvs -Q -d `pwd`/cvs co src
$ svn co -q file://`pwd`/svn/trunk
$ git clone git freebsd.git
CVS 11.56s user 79.36s system 24% cpu 6:09.16 total
SVN 38.64s user 203.96s system 17% cpu 22:57.05 total
Git 3.53s user 42.25s system 19% cpu 3:53.25 total

Time Required to Export HEAD

Time Required to Create and Checkout a Branch

$
$ svn copy -m "make branch" TRUNK_URL BRANCH_URL; svn switch BRANCH_URL
$ git checkout -b pjjexp
CVS
SVN 1.35s user 12.22s system 3% cpu 5:49.05 total
Git 0.37s user 4.09s system 7% cpu 56.524 total

Time Required to Retrieve Change Status

$ cvs status > /dev/null 2>&1
$ svn status
$ git status
CVS 1.42s user 29.60s system 18% cpu 2:45.72 total
SVN 0.74s user 6.34s system 5% cpu 2:16.22 total
Git 0.42s user 3.64s system 27% cpu 14.810 total

Time Required to Tag a Branch

Time Required to Retrieve History on a Single File

$ cvs log Makefile > /dev/null
$ svn log src/Makefile > /dev/null
$ git log src/Makefile > /dev/null
CVS 0.00s user 0.03s system 89% cpu 0.029 total
SVN 0.06s user 2.05s system 19% cpu 10.751 total
Git 2.46s user 4.13s system 37% cpu 17.627 total

Time Required to Retrieve Entire Project History

$ cvs log > /dev/null 2>&1
$ svn log > /dev/null 2>&1
$ git log > /dev/null 2>&1
CVS 6.99s user 71.31s system 31% cpu 4:09.50 total
SVN 11.38s user 130.37s system 15% cpu 14:53.16 total
Git 0.66s user 5.20s system 36% cpu 15.853 total

1. The git-cvsimport tool would get trapped in an infinite loop reporting warnings about bad patch sets. I tried to get it to work several times, and eventually gave up.

2. On my FreeBSD server, I needed to install the Python GDBM bindings separately.

3. I actually had to restart the conversion with the -p option, which allows you to start the process from a point other than the beginning. For example: -p 2: to start at pass 2 and continue through all remaining passes.


Tags: git