Fast listing of SVN repository with svn-crawler

I already described an approach to list an SVN repository quickly. Now I’ve published the code as a separate tool:

https://sourceforge.net/projects/fastsvncrawler/

To compile you need libsvn-dev, make, and cmake packages:

$ sudo aptitude install libsvn-dev make cmake

Then perform standard steps to compile CMake-based project:

$ mkdir build
$ cd build
$ cmake ..
$ make

To show all files for an URL run

$ ./svn-crawler <URL>

As expected this tool is much faster than official native SVN. Here’re some benchmarks:

$ svn --version
svn, version 1.6.17 (r1128011)
   compiled Nov 20 2011, 01:10:33
...

For Subversion repository itself:

$ time svn ls --depth infinity http://svn.apache.org/repos/asf/subversion/trunk
...

real    1m17.487s
user    0m0.476s
sys     0m0.112s

$ time ./svn-crawler http://svn.apache.org/repos/asf/subversion/trunk
...
real    0m3.836s
user    0m0.184s
sys     0m0.060s
Time spent to list Subversion repository

Time spent to list Subversion repository

For SVNKit repository:

$ time svn ls --depth infinity http://svn.svnkit.com/repos/svnkit
...
real    2m12.392s
user    0m0.448s
sys     0m0.064s

$ time ./svn-crawler http://svn.svnkit.com/repos/svnkit/trunk
...
real    0m18.712s
user    0m0.184s
sys     0m0.028s
Time spent to list SVNKit repository

Time spent to list SVNKit repository


For SQLJet repository:

$ time svn ls --depth infinity http://svn.sqljet.com/repos/sqljet/branches/1.0.x
...
real    1m23.513s
user    0m0.276s
sys     0m0.052s

$ time ./svn-crawler http://svn.sqljet.com/repos/sqljet/branches/1.0.x
...
real    0m5.916s
user    0m0.100s
sys     0m0.028s

Time spent to list SQLJet repository

Time spent to list SQLJet repository

Which repository is more compact: Git or SVN?

Git is reputed to have more compact repositories than SVN. It is really true? Let’s check ourselves.

To run honest comparison we will use SubGit utility that is unlike git-svn preserves more SVN and Git concepts in both SVN-to-Git and Git-to-SVN directions (for Git-to-SVN direction git-svn will just lose commits on anonymous branches).

Original Git vs converted SVN.

For example let’s convert Git.git repository to SVN format.
First, create an empty SVN repository:

$ svnadmin create git.svn

Second, create a full clone of the Git.git repository at git.svn/.git:

$ git clone --mirror git://github.com/gitster/git.git git.svn/.git
Cloning into bare repository 'git.svn/.git'...
remote: Counting objects: 197301, done.
remote: Compressing objects: 100% (68281/68281), done.
remote: Total 197301 (delta 135873), reused 184664 (delta 125681)
Receiving objects: 100% (197301/197301), 38.57 MiB | 927 KiB/s, done.
Resolving deltas: 100% (135873/135873), done.

Third, check the Git repository size while the translation is not started:

$ du git.svn/.git -s -h
44M     git.git/.git

Finally, run SubGit on the empty SVN repository:

$ subgit install git.svn

It will take rather long time because of the repository size.

When the translation is complete let’s check the repository size:

$ cd git.svn

$ svn info file://`pwd`
Path: git.svn
URL: file:///.../git.svn
Repository Root: file:///.../git.svn
Repository UUID: afa562f0-e173-47e2-83e6-2452fde0775f
Revision: 32338
Node Kind: directory
Last Changed Author: Junio C Hamano
Last Changed Rev: 32338
Last Changed Date: 2012-10-22 00:58:48 +0400 (Mon, 22 Oct 2012)

$  git rev-list --branches --tags | wc -l
31573

As one can see, the translation is mostly lossless. The difference in the number of commits can be explained by the fact that in Git branch/tag addition/removal doesn’t result in new commit creation that is not true for SVN. To make sure, that Git commits are mapped to SVN commits one can run

$ mkdir .git/refs/notes
$ cp .git/refs/svn/map .git/refs/notes/commits
$ git log

and see an SVN revision for every Git commit.

Now let’s check how large (or small?) the resulting SVN repository is. Currently it contains not only basic SVN data, but some SubGit files (including logs).

First, remove SubGit specific data from the repository:

$ subgit uninstall --purge .

Second, move the Git repository outside of the SVN repository:

$ mv .git ../git.git

Now we have the most honest SVN analog of Git.git repository. Check it size:

$ du -s -h
1,8G    .

Let’s show this fact with a picture:

SVN repository size vs Git repository size

SVN repository size vs Git repository size

Original SVN vs converted Git.

One may say “this comparison is not honest, Git repository was natural but SVN repository — artificial”. Ok, let’s convert some SVN repository to Git.
Unfortunately I can’t convert the repository of Subversion at apache.org because Apache guys tend to ban people who generate too many requests. But I’ll try on SVNKit repository that is another Subversion implementation (despite the fact that SVNKit already has an official Git repository — anyway I need to have the SVNKit repository locally to estimate its size in SVN format).

Of course I can run svnrdump on it to get its dump, but fortunately I have a dump (not so fresh) of the SVNKit repository locally.

First, create an empty SVN repository

$ svnadmin create svnkit.svn

Second, load the dump into it:

$ svnadmin load svnkit.svn < svnkit.dump 2> /dev/null > /dev/null

Third, remember the repository size and the dump size (it is not so large as Git.git though)

$ cd svnkit.svn
$ svn info file://`pwd`
Path: svnkit.svn
URL: file:///.../svnkit.svn
Repository Root: file:///.../svnkit.svn
Repository UUID: 0a862816-5deb-0310-9199-c792c6ae6c6e
Revision: 7920
Node Kind: directory
Last Changed Author: semen
Last Changed Rev: 7920
Last Changed Date: 2011-09-15 19:25:27 +0400 (Thu, 15 Sep 2011)

$ du -s -h .
202M    .

$ du -h svnkit.dump
859M    svnkit.dump

Now let’s install SubGit into the repository

$ subgit install .

Translated Git repository size (even with SubGit-related metadata) is:

$ du -s -h .git
74M     .git

We can run “git gc” that is rather honest, because Git will run it anyway sooner or later:

$ git gc --prune
Counting objects: 143789, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (40515/40515), done.
Writing objects: 100% (143789/143789), done.
Total 143789 (delta 76663), reused 141252 (delta 75412)
Checking connectivity: 143789, done.

$ du -s -h .git
62M     .git
SVN repository size vs Git repository size

SVN repository size vs Git repository size

The difference is not so large now but is still significant. Why this happens? AFAIK this is because SVN keeps deltas between file contents in sequential revisions while Git keeps deltas between the most similar contents. So it’s natural to expect that the larger SVN and Git repositories are the more compact Git repository is (compared to Subversion), that is confirmed by ours tests.

They say that Subversion 1.8 will have more compact repository. Let’s wait and test!

SVNKit SvnRemoteXXX operations: one more common mistake

I’d like to describe one more mistake that one can encounter into while using SVNKit. Suppose you want to copy a file from a working copy to the repository directly. You write code like:

file File file = ...;
final SVNURL targetUrl = ...;

final SvnCopy copy = svnOperationFactory.createCopy();
copy.addCopySource(SvnCopySource.create(SvnTarget.fromFile(file), SVNRevision.BASE));
copy.setSingleTarget(SvnTarget.fromURL(targetUrl));
copy.run();

You run this code and get

org.tmatesoft.svn.core.SVNException: svn: E200007: Runner for 'org.tmatesoft.svn.core.wc2.SvnCopy' command have not been found; probably not yet implement in this API.
    at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:64)
    at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:51)
    at org.tmatesoft.svn.core.wc2.SvnOperationFactory.getImplementation(SvnOperationFactory.java:1340)
    at org.tmatesoft.svn.core.wc2.SvnOperationFactory.run(SvnOperationFactory.java:1227)
    at org.tmatesoft.svn.core.wc2.SvnOperation.run(SvnOperation.java:291)

I can agree that the stacktrace is a bit cryptic. Actually is tells us that SvnCopy class can perform copy from remote repository to working copy, or from working copy to working copy. To copy to remote repository one should use SvnRemoteCopy class instead:

final SvnRemoteCopy remoteCopy = svnOperationFactory.createRemoteCopy();
remoteCopy.addCopySource(SvnCopySource.create(SvnTarget.fromFile(file), SVNRevision.BASE));
remoteCopy.setSingleTarget(SvnTarget.fromURL(targetFileUrl));
remoteCopy.run();

The same is true about other SvnRemoteXXX operations.

What SVNKit resources should be disposed?

Like in a previous post about SVNKit objects reusability I’d like to write about the most common mistakes related to resources closing.

SVNRepository instances should be closed

SVNRepository class of SVNKit represents a connection with a remote or maybe local repository. It has SVNRepository#closeSession method that closes that connection. Unfortunately people often forget to call it.

If SVNRepository instance was constructed using not SVNRepositoryFactory but using ISVNRepositoryPool it should be closed if and only if it was constructed by passing mayReuse=false to ISVNRepositoryPool#createRepository. Otherwise the repository is controlled by the pool and should be disposed together with the pool (by calling ISVNRepositoryPool#dispose).

For example, these ways of closing of SVNRepository are correct:

final SVNRepository svnRepository = SVNRepositoryFactory.create(url);
svnRepository.closeSession();

Non-reusable repository connection should be closed explicitly:

final ISVNRepositoryPool repositoryPool = new DefaultSVNRepositoryPool(null, null);
try {
    final SVNRepository svnRepository = repositoryPool.createRepository(url, false);
    svnRepository.closeSession();
} finally {
    repositoryPool.dispose();
}

Reusable repository connection will be closed together with the connections pool:

final ISVNRepositoryPool repositoryPool = new DefaultSVNRepositoryPool(null, null);
try {
    final SVNRepository svnRepository = repositoryPool.createRepository(url, true);
} finally {
    repositoryPool.dispose();
}

And of course all ISVNRepositoryPool instances should be always disposed in finally-block.

SVNClientManager instances should also be disposed with dispose() method

I would say that this is the most common mistake. As I mentioned in a previous post, SVNClientManager aggregates an ISVNRepositoryPool instance and implements this interface. And of course it should be disposed, otherwise all connections it creates won’t be closed

All other classes that have dispose() or close() methods should be disposed

Actually this goes without saying, I just remind about that evident fact.

Are SVNKit methods reenterable?

People often ask me which SVNKit objects can and which can’t be reused from different threads or while another operation running on those objects.

SVNRepository methods are not reenterable

This means that the same SVNRepository instance can’t be used from the several threads at the same time. But also this means that the same SVNRepository object can’t be used within the same thread but from some callback provided to another function.

For example, this code (rather useless) checking that paths returned by SVNRepository#log method really exist:

final SVNRepository svnRepository = SVNRepositoryFactory.create(url);
try {
    log(new String[]{""}, 1, 2, true, true, new ISVNLogEntryHandler() {
        @Override
        public void handleLogEntry(SVNLogEntry logEntry) throws SVNException {
            final long revision = logEntry.getRevision();
            final Map<String,SVNLogEntryPath> changedPaths = logEntry.getChangedPaths();
            for (Map.Entry<String, SVNLogEntryPath> entry : changedPaths.entrySet()) {
                final String path = entry.getKey();

                //WRONG!!! svnRepository object can't be reused!
                final SVNNodeKind kind = svnRepository.checkPath(path, revision);
                System.out.println(kind);
            }
        }
    });
} finally {
    svnRepository.closeSession();
}

fails with

java.lang.Error: SVNRepository methods are not reenterable
	at org.tmatesoft.svn.core.io.SVNRepository.lock(SVNRepository.java:2820)
	at org.tmatesoft.svn.core.io.SVNRepository.lock(SVNRepository.java:2811)
	at org.tmatesoft.svn.core.internal.io.fs.FSRepository.openRepositoryRoot(FSRepository.java:767)
	at org.tmatesoft.svn.core.internal.io.fs.FSRepository.openRepository(FSRepository.java:758)
	at org.tmatesoft.svn.core.internal.io.fs.FSRepository.checkPath(FSRepository.java:205)
	at org.tmatesoft.svn.test.InfoTest$1.handleLogEntry(InfoTest.java:150)
	at org.tmatesoft.svn.core.internal.io.fs.FSLog.sendLog(FSLog.java:332)
	at org.tmatesoft.svn.core.internal.io.fs.FSLog.runLog(FSLog.java:162)
	at org.tmatesoft.svn.core.internal.io.fs.FSRepository.logImpl(FSRepository.java:381)
	at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:1035)
	at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:940)
	at org.tmatesoft.svn.core.io.SVNRepository.log(SVNRepository.java:864)

The same is true about reusing SVNRepository object while commiting to repository.

SVNRepository#getCommitEditor starts a transaction. This transaction can be terminated in three ways:

  • By ISVNEditor#closeEdit call on the editor. In this case the transaction is committed (or rejected).
  • By ISVNEditor#abortEdit that terminates the transaction.
  • By any exception thrown by ISVNEditor methods.

In all other cases the transaction remains unfinished. While the transaction is not finished, a corresponding SVNRepository object can’t be reused. An example:

final SVNRepository svnRepository = SVNRepositoryFactory.create(url);
try {
    final ISVNEditor commitEditor = svnRepository.getCommitEditor("Commit message", null);
    commitEditor.openRoot(-1);

    //WRONG!!! svnRepository can't be reused until commitEditor.closeEdit(); is called
    svnRepository.checkPath("", -1);

    commitEditor.closeDir();
    commitEditor.closeEdit();
} finally {
    svnRepository.closeSession();
}

This code also fails with a similar stacktrace. One of the most common mistakes is not to cancel commit transaction if any custom code throws an exception:

final SVNRepository svnRepository = SVNRepositoryFactory.create(url);
try {
    try {
        final ISVNEditor commitEditor = svnRepository.getCommitEditor("Commit message", null);
        commitEditor.openRoot(-1);

        //some code that can throw an exception
        if (2 + 2 == 4) {
            throw new SomeException();
        }

        commitEditor.closeDir();
        commitEditor.closeEdit();
    } catch (SomeException e) {
        e.printStackTrace();

        //the commit transaction should be closed here by commitEditor.abortEdit() call
    }
    //this call will fail because of unclosed transaction
    svnRepository.checkPath("", -1);
} finally {
    svnRepository.closeSession();
}

Still incorrect because the catch block should contain commitEditor.abortEdit() call that would stop the commit transaction.

DefaultSVNRepositoryPool connections can’t be reused simultaneously

SVNKit uses ISVNRepositoryPool interface to keep and reuse connections between Subversion requests. This approach significantly improves SVNKit performance but the connections pool should be used carefully.

DefaultSVNRepositoryPool is an implementation of ISVNRepository pool provided by SVNKit. It keeps “repository root” -> SVNRepository instance map and returns an existing or creates a new connection on ISVNRepositoryPool#createRepository invocation.

Note that ISVNRepositoryPool does not know if any of the connection it keeps has any operation in progress and returns the connection if URL requested matches corresponding repository root of the saved connection. And from the previous section you know that SVNRepository instances can’t be reused.

For example:

final ISVNRepositoryPool repositoryPool = new DefaultSVNRepositoryPool(null, null);
try {
    final SVNRepository svnRepository1 = repositoryPool.createRepository(url, true);
    final SVNRepository svnRepository2 = repositoryPool.createRepository(url, true);
    final ISVNEditor commitEditor = svnRepository1.getCommitEditor("Commit message", null);
    commitEditor.openRoot(-1);

    //WRONG!!! svnRepository2 is the same object as svnRepository1!
    svnRepository2.checkPath("", -1);

    commitEditor.closeDir();
    commitEditor.closeEdit();
} finally {
    repositoryPool.dispose();
}

This code fails because repositoryPool.createRepository(url, true); returns the same instance for the 2nd and all subsequent calls. Instead one should create the second connection with mayReuse=false and of course close it by hand afterwards because it won’t be closed on ISVNRepositoryPool#dispose:

final ISVNRepositoryPool repositoryPool = new DefaultSVNRepositoryPool(null, null);
SVNRepository svnRepository2 = null;
try {
    final SVNRepository svnRepository1 = repositoryPool.createRepository(url, true);
    svnRepository2 = repositoryPool.createRepository(url, false);
    final ISVNEditor commitEditor = svnRepository1.getCommitEditor("Commit message", null);
    commitEditor.openRoot(-1);

    //Correct, svnRepository2 is another connection
    svnRepository2.checkPath("", -1);

    commitEditor.closeDir();
    commitEditor.closeEdit();
} finally {
    repositoryPool.dispose();
    if (svnRepository2 != null) {
        //it should be closed by hand because it was created with mayReuse=false
        svnRepository2.closeSession();
    }
}

This code is correct though is not symmetric. You can often meet it inside SVNKit itself for operations where 2 connections are used at the same time.

SVNClientManager and SVNXXXClient can’t be reused

This is also true because of several reasons. First, SVNClientManager implements and aggregates ISVNRepositoryPool which, as you know now, can’t be reused. But also because of the way SVNKit works it can’t be reused for working copy of 1.7 format operations (otherwise there can be an error “svn: E200030: There are unfinished transactions detected in …”).

The reason is that SVNBasicClient encapsulates SvnOperationFactory, that encapsulates SVNWCContext, that encapsulates SVNWCDb, that contains

private Map<String, SVNWCDbDir> dirData;

This is a cache path->working_copy_root_data where “working_copy_root_data” is a structure that contains a working copy root path and a database object (SVNSqlJetDb), and this database object contains “openCount” — transaction in progress counter that is increased when a transaction starts and is decreased when it ends (in thread-unsafe manner). If the operation is finished, but openCount > 0 (for example, because the database is used from another thread, you see

svn: E200030: There are unfinished transactions detected in ...

exception). So SVNSqlJetDb objects can’t be reused among threads. And the same is true about callbacks.

Instead of reusing SVNClientManager or SVNXXXClient instance one should create a separate instance per thread. For callbacks case — at least 2: one for the main operation and another one for operations inside a callback. But note: these operations cannot modify the same working copy because

Several working copy modification operations cannot run simultaneously

It is more Subversion’s restriction that SVNKit’s. Until WC 1.7 format any Subversion working copy directory could be processed independently allowing parallel executing of modification operaions if they run on different directories.

Now every working copy modification operations locks the whole working copy until completion and no other write operation can be run at the same time.

But read-only operations do not lock anything and can be run anytime. Subversion working copy 1.7 is based on transactions moving the working copy from valid state to another valid state. So read-only operations while another write operation will find find the working copy in some intermediate but valid state.

EOLs in Git and SVN

This post will explain how Subversion handles line endings, how Git couples with the same problem, and how not to lose those settings while Git to SVN or SVN to Git translation.

When a team members use different OSes with different default EOLs it’s important to allow them to work on the same files without causing EOLs mess or other problems. Everybody knows that if Windows Notepad doesn’t like LFs. But not everybody knows about CRLF problems in shell scripts:

$ echo '#!/bin/sh' > test.sh
$ echo >> test.sh
$ echo 'echo Hello world!' >> test.sh

$ bash test.sh
Hello world!
$ dash test.sh
Hello world!

$ unix2dos test.sh
unix2dos: converting file test.sh to DOS format ...

$ bash test.sh
test.sh: line 2: $'\r': command not found
Hello world!

$ dash test.sh
: not found test.sh:
Hello world!

Too strange behaviour for XIX century. To avoid these problems let’s take care about line endings.

EOLs in Subversion

Subversion controls line endings using svn:eol-style property. Its valid values are:

  • native — when checking out the file EOLs will be converted to the current system default EOL (CRLF on Windows, LF on Linux)
  • LF — when checking out the file EOLs will be converted to LFs
  • CRLF
  • CR
  • the property is not set — treat the file as binary, no EOL management

Note that at the SVN repository file can have arbitrary EOLs (and pristine files (se my post about pristine files) contain the same EOLs as the repository does) the conversion is performed while creation of the working copy file. Usually Subversion clients take care about repository contents and svn:eol-style correspondence. I.e. file with svn:eol-style=LF will have LF endings in the repository, file with svn:eol-style=CRLF will have CRLFs. In the special case svn:eol-style=native the file will be stored with LFs. But the Subversion remote API doesn’t check for EOLs inconsistencies. If the repository is rather old or a buggy client was used to work with it, there’s a chance to meet other combnations.

There’s one more Subversion property that relates to line endings — svn:mime-type. When you set svn:eol-style on a file, you say to Subversion that this file is a text file, but if you set svn:mime-type on the same file, its value should start with “text/” otherwise Subversion will report about svn:eol-style-svn:mime-type inconsistency and will fail to work (one of the most common mistake is to set svn:eol-style to some value and svn:mime-type to application/xml — use text/xml instead, application/xml means “not human-readable XML”).

Usually it is recommended to set Subversion autoproperties for svn:eol-style to native and to set it to LFs only shell scripts.

EOLs in Git

Line endings in Git for individual files are controlled by Git attributes “text” and “eol”. The first one says to Git that the file is not binary. Possbile “text” attribute values are:

  • auto — check for binary (Git thinks that the file is binary if the first 8kb contains at least one zero byte)
  • the attribute is set — the file is treated as text
  • the attribute is unset — the file is treated as binary

“eol” attribute is an analog of svn:eol-style. It’s value are:

  • lf — working tree file line endings will be converted to LFs, the blob is assumed to contain LFs
  • crlf — the same but to CRLFs, the blob is assumed to contain LFs
  • the attribute value is undefined (!eol) but “text” attribute is set — the file line endings are specified in core.eol config file (which possible values are lf, crlf, native (default))
  • the attribute value is unset and “text” attribute is unset too — thee file is treated as binary (note that if “eol” attribute is set, “text” value is ignored and assumed to be set)

So for the most of the files “/file text !eol” will be the best option (it’s meaning corresponds to svn:eol-style=native). For individual LF and CRLF files the setting will be “/lf_file eol=lf” “/crlf_file eol=crlf”.

One may even use *-rule to create an analog of Subversion autoproperties (this rule will be applied to every newly created file): “* text=auto !eol”. “text=auto” will care about binary files.

But be careful: if “eol” attribute is set, blob should already contain LFs. Otherwise you’ll have a problem: to understand if the file with “eol” attribute is changed Git converts it to LFs, calculates SHA-1 for it (assuming it’s a blob) and compares to the corresponding hash code in the database. If the object database blob contains CRLFs in the blob, these hash ids won’t be equal:

$ git init repo
$ echo "line 1" >> file
$ echo "line 2" >> file
$ unix2dos file
$ git add file
$ git commit -m "Added a file with CRLFs"

#now the database contains file with CRLFs, let's change it's eol attribute
$ echo "/file eol=crlf" > .gitattributes

$ git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       .gitattributes

#let's say to Git it should reread the file contents
$ touch file

$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#       modified:   file
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       .gitattributes

#ok, we have some unexpected changes, let's try to discard them:
$ git reset --hard HEAD
HEAD is now at 14fd0ce Added a file with CRLFs

$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#       modified:   file
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       .gitattributes
no changes added to commit (use "git add" and/or "git commit -a")

#they can't be discarded!!

Looks like Git is a bit stupid here, it could be more tolerant to EOLs settings changes. To couple with this one can just unset attributes for the file, to treat it as binary again, change it’s EOLs to LF (dos2unix), commit to put it into the database and set “text” and/or “eol” again. But this is the only problem that appears with attributes-related approach. Hope it will be fixed soon.

Why not to use git-svn if you care about EOLs?

Git-svn is a perl script that tries to convert Git contents to Subversion and vice versa. But it doesn’t perform EOLs conversion at all (absolutely ignoring svn:eol-style property and Git gitattributes), converting only Subversion repository contents to Git blobs not touching corresponding properties and attributes.

As we know, Subversion keeps all text files with LFs if svn:eol-style=native. So while cloning Subversion repository git-svn converts those text files to blobs with LFs and on Windows Git won’t convert their EOLs to native while checking out (that is inconsistent with Subversion property semantics).

Should one set core.autocrlf?

Usually people recommend to use core.autocrlf=true config setting (which sense is equal to set “* eol=crlf” attribute) but is a weapon of mass destruction: it will convert not only files svn:eol-style=native to CRLFs but also files with svn:eol-style=LF to CRLFs too. And also if core.autocrlf=true when adding any to the Git database, Git converts blobs’ EOLs to LFs and send to Subversion in this form — even for files with svn:eol-style=CRLF this will result into inconsistencies between Subversion file contents and svn:eol-style property.

My answer is no. To convert svn:eol-style to Git attributes correctly one should set attributes carefully for each file. Git attributes have higher priority than core.autocrlf, so if the corresponding attributes are set core.autocrlf value is ignored.

How can I do that automatically?

Fortunately there’s a Git-SVN bridge called SubGit. In short: you install it into the repository and it converts Subversion revisions to Git commits and vice versa. Among other features it performs svn:eol-style+svn:mime-type conversion to “text” and “eol” attributes:

$ svnadmin create repo
$ subgit install repo
$ git clone repo repo-git
$ cd repo-git
$ echo "line 1" >> native_file
$ echo "line 2" >> native_file
$ cp native_file lf_file
$ cp native_file crlf_file
$ cp native_file binary_file
$ unix2dos crlf_file
unix2dos: converting file crlf_file to DOS format ...
$ cp native_file auto_native_file
# dd if=/dev/zero of=auto_binary_file count=1
$ nano .gitattributes #edit .gitattributes this way:

$ cat .gitattributes
/native_file text !eol
/crlf_file eol=crlf
/lf_file eol=lf
/binary_file -text
/auto_native_file text=auto !eol
/auto_binary_file text=auto !eol

$ git add .gitattributes
$ git add *_file
$ git commit -m "Added different files"
$ git push origin master
Counting objects: 4, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 353 bytes, done.
Total 4 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (4/4), done.
To /tmp/repo
 * [new branch]      master -> master

$ cd ../repo
$ svn proplist -v --depth infinity file://`pwd`
Properties on 'file:///tmp/repo/trunk/lf_file':
  svn:eol-style
    LF
Properties on 'file:///tmp/repo/trunk/crlf_file':
  svn:eol-style
    CRLF
Properties on 'file:///tmp/repo/trunk/native_file':
  svn:eol-style
    native
Properties on 'file:///tmp/repo/trunk/auto_binary_file':
  svn:mime-type
    application/octet-stream
Properties on 'file:///tmp/repo/trunk/auto_native_file':
  svn:eol-style
    native

Subversion remote API: committing without working copy

Can you do something similar with Git? I’m sure: no. In my previous post I described Subversion API basics. Now I’d like to give one more example of editor-based remote API usage: commit creation on-the-fly.

Subversion has API bindings for the most popular programming languages. This time let’s use Java.

There’re 2 ways to use Subversion from Java. The first one is to use JavaHL API of Subversion. There’re 2 implementations of this API: native Subversion (compiled Subversion libraries + JNI) and SVNKit (pure Java implementation). The advantages of the native Subversion implementation are performance and stability. But the problem is that if something goes wrong in the native implementation the JVM is crashed, but if something goes wrong in SVNKit — an exception is thrown.

But as in this post I want to show the power of the remote API, JavaHL interface doesn’t suit because it provies only client API (see the first picture of my previous post; and the client API requires the working copy to commit). In opposite SVNKit provides all Subversion APIs for Java (like native Subversion provides all APIs for C language).

The central class of SVNKit remote API is SVNRepository (corresponds to svn_ra_session_t in C interface). It represents a connection with some certain protocol to some certain URL. After working with SVNRepository the connection should be closed with SVNRepository#closeSession (unless you use ISVNRepositoryPool).

Let’s consider an example:

.......

public class CommitWithoutWorkingCopy {

    public static void main(String[] args) {
        FSRepositoryFactory.setup();
        DAVRepositoryFactory.setup();
        SVNRepositoryFactoryImpl.setup();

        SVNRepository svnRepository = null;
        try {
            svnRepository = SVNRepositoryFactory.create(
                    SVNURL.parseURIEncoded("file:///tmp/test"));

            SVNDeltaGenerator deltaGenerator = new SVNDeltaGenerator();

            ISVNEditor commitEditor;
            String checksum;
            long latestRevision;
            SVNCommitInfo commitInfo;

            commitEditor = svnRepository.getCommitEditor(
                    "My first commit message", null);
            commitEditor.targetRevision(-1);
            commitEditor.openRoot(-1);
            commitEditor.addDir("trunk", null, -1);
            commitEditor.changeFileProperty("trunk/file",
                    "directoryPropertyName",
                    SVNPropertyValue.create("directoryPropertyValue"));
            commitEditor.addFile("trunk/file", null, -1);
            commitEditor.changeFileProperty("trunk/file",
                    "filePropertyName",
                    SVNPropertyValue.create("filePropertyValue"));
            commitEditor.applyTextDelta("trunk/file", null);

            final ByteArrayInputStream fileContentsStream =
                    new ByteArrayInputStream("File contents".getBytes());
            try {
                checksum = deltaGenerator.sendDelta("trunk/file",
                        fileContentsStream, commitEditor, true);
            } finally {
                try {
                    fileContentsStream.close();
                } catch (IOException e) {
                    //ignore
                }
            }
            commitEditor.closeFile("trunk/file", checksum);
            commitEditor.closeDir();
            commitEditor.addDir("branches", null, -1);
            commitEditor.closeDir();
            commitEditor.addDir("tags", null, -1);
            commitEditor.closeDir();
            commitInfo = commitEditor.closeEdit();

            latestRevision = commitInfo.getNewRevision();
            System.out.println("Committed revision " + latestRevision);

            commitEditor = svnRepository.getCommitEditor(
                    "My second commit message", null);
            commitEditor.targetRevision(-1);
            commitEditor.openRoot(1);
            commitEditor.openDir("branches", 1);
            commitEditor.addDir("branches/branch", "/trunk", 1);
            commitEditor.closeDir();
            commitEditor.closeDir();
            commitEditor.deleteEntry("tags", 1);
            commitEditor.closeDir();
            commitInfo = commitEditor.closeEdit();

            latestRevision = commitInfo.getNewRevision();
            System.out.println("Committed revision " + latestRevision);

        } catch (SVNException e) {
            e.printStackTrace();
        } finally {
            if (svnRepository != null) {
                svnRepository.closeSession();
            }
        }
    }
}

Do you understand what happens here? Just the opposite to update-like calls. You get an editor and call its methods (in update/status/switch/diff you run some method — SVNRepository#update or SVNRepository#status and provide your own editor to call). This is the beauty of Subversion API.

You just crawl the tree inside the URL, for which you create SVNRepository object, and describe you changes. The new revision is created only at ISVNEditor#closeEdit. At this time the transaction is fixed or rejected. You never know if someone else commits to the same repository at the same time until you call ISVNEditor#closeEdit to fix your revision. As you never know the latest repository state, you send delta against some certain revisions instead of against the latest revision — that’s what revision r1 means in the following code:

commitEditor.openDir("branches", 1);
//the delta is send against r1
commitEditor.closeDir();

If -1 is used instead of the latest revision, the changes are applied to the latest repository state.

As one can see Subversion checks checksums for every file

commitEditor.closeFile("trunk/file", checksum);

If the file was not added but changed, the checksum should be provided to ISVNEditor#applyTextDelta call. So every file checksum is checked twice: before and after applying delta. If any of checksum is wrong, the commit will be rejected.

One more detail, not very evident: all the paths, not starting with “/”, are relative to the URL of the SVNRepository object (“file:///tmp/test” in my example). But paths, starting with “/”, are relative to the repository root that may differ from the URL for which the connection is created. In my example “file:///tmp/test” is the repository root, that one can check by calling SVNRepository#getRepositoryRoot.

The example code produces the following history when running on the empty repository:

------------------------------------------------------------------------
r2 | (no author) | 2012-07-20 03:14:30 +0200 (Fri, 20 Jul 2012) | 1 line
Changed paths:
   A /branches/branch (from /trunk:1)
   D /tags

My second commit message
------------------------------------------------------------------------
r1 | (no author) | 2012-07-20 03:14:30 +0200 (Fri, 20 Jul 2012) | 1 line
Changed paths:
   A /branches
   A /tags
   A /trunk
   A /trunk/file

My first commit message
------------------------------------------------------------------------

Subversion remote API: listing repository with “status” request

One of the strongest side of Subversion is its nice API. It includes working copy API, remote API, client API, and repository API.

SVN API

SVN API

  • Client API is the replacement of CLI for programs. It consists of analogs of all command line calls like “checkout”, “update”, “propset” and so on. Usually every function of the client API, depending on whether the arguments are URLs or paths, performs call of working copy API and/or remote API.
  • Working copy API consists of low-level working operations. Such a workinig copy abstraction allows Subversion to change the working copy format without touching other functionality.
  • Consists of function for working with remote repository. This API doesn’t require working copy existence and allows to work with remote SVN repository with working copy at all.
  • Repository API is used on the server side and works with different subversion repository formats.

Maybe, I’ve missed some other APIs, but I consider them less important. To my opinion remote API is the most interesting, because it allows to work with SVN repository. It consists of different requests that are trasfered over the network with different protocols SVN supports: DAV, SVN and file-protocol. The requests are executed on the server and the answer is returned in a form of callbacks.

I would divide nearly all remote API requests into 3 groups:

  • editor-based: update, diff, status, …;
  • log-like: log, “get eligible mergeinfo”;
  • “cheap requests”: “get dir”, “get latest revision”, “info”.

Cheap requests usually get some information about only one node (directory or file). Log-like usually return a sequence of “log entry” structures (usually containing revision, author, date, and changed paths), one per revision. Editor-based calls crawl the directories within one revision.

All editor-like calls have the following structure:

Editor-based requests

Editor-based requests

Reporter is the working copy replacement. It consists of 3 functions: set_path, delete_path, link_path. They describe what working copy state you have locally (to describe the working copy state one doesn’t need to have the working copy actually).

And the server in return describes what actions you should apply to that working copy in order to reach state of some revision. The actions are given in a form of editor calls.

For example, the history is like the following:

------------------------------------------------------------------------
r2 | root | 2012-07-15 13:03:58 +0000 (Sun, 15 Jul 2012) | 1 line
Changed paths:
   A /trunk/file

Added a file.
------------------------------------------------------------------------
r1 | root | 2012-07-15 13:03:30 +0000 (Sun, 15 Jul 2012) | 1 line
Changed paths:
   A /branches
   A /tags
   A /trunk

Initial.
------------------------------------------------------------------------

If we describe our working copy with (pseudo-code):

set_path("", 0),
set_path("trunk", 2)
delete_path("trunk/file")

— non-interesting parameters are omitted, this means that we tell the server: “I have all working copy at the state corresponding to revision 0, except the trunk that has the state of revision 2 but trunk/file is deleted and we haven’t it locally”.

If we call “update” to revision 2, the server will send the commands (pseudo-code):

target_revision(2)
open_root(0)
add_directory("branches")
close_directory() //for branches
add_directory("tags")
close_directory() //for tags
open_directory("trunk", 2)
add_file("trunk/file")
//send file contents --- some calls, let's omit them
close_file() //for trunk/file
close_directory() //for trunk
close_directory() //for root
close_edit()

If we call “update” to revision 1, the server will send the commands (pseudo-code):

target_revision(1)
open_root(0)
add_directory("branches")
close_directory() //for branches
add_directory("tags")
close_directory() //for tags
//open_directory for trunk will only be called
//if trunk has properties
//changed in r2, otherwise the trunk state described is already the desired state
close_directory() //for root
close_edit()

If we call “update” to revision 1, the server will send the commands (pseudo-code):

target_revision(0)
open_root(0)
delete_entry("trunk", 2)
close_directory() //for root
close_edit()

Usually reporter crawl the working copy to generate correct set_path/link_path/delete_path sequence, and the editor calls are usually applied to the working copy or used of generate patch or to show the status. But both crawliing the working copy and applying changes are optional.

set_path calls actually has several parameters (not only path+revision). One of the parameters is start_empty. If it is true, the path is considered as locally empty and the server should send all it’s contents, revision parameter is ignored then. For example “svn checkout” and “svn export” use set_path(“”, ignored_revision, start_empty=TRUE) call do describe the working copy.

Another paramter is depth. It is used for sparse working copy operations. For the example above the report will tell the server not to send “trunk” contents in the case of “update” to r2:

set_path("", 0),
set_path("trunk", 1, depth=empty)

In opposite in this case all the “trunk” contents will be sent (“update” to 2):

set_path("", 0),
set_path("trunk", 2, start_empty=TRUE)

The only difference between “status” and “update” requests is that “status” doesn’t request the files contents. So it can be used to just list the repostiory paths and properties. Here’s the example of the code

#include <svn_client.h>
#include <svn_auth.h>
#include <svn_ra.h>

static svn_error_t *
set_target_revision(void *edit_baton,
                    svn_revnum_t target_revision,
                    apr_pool_t *pool) {
    fprintf(stderr, "listing revision\t\t\t%d\n", target_revision);
    return SVN_NO_ERROR;
}                                                                                                                                                                                                          

static svn_error_t *
open_root(void *edit_baton,
          svn_revnum_t base_revision,
          apr_pool_t *pool,
          void **dir_baton) {
    fprintf(stderr, "entered root directory\n");
    return SVN_NO_ERROR;
}                                                                                                                                                                                                          

static svn_error_t *
delete_entry(const char *path,
             svn_revnum_t revision,
             void *parent_baton,
             apr_pool_t *pool) {
    return SVN_NO_ERROR;
}

static svn_error_t *
add_directory(const char *path,
              void *parent_baton,
              const char *copyfrom_path,
              svn_revnum_t copyfrom_revision,
              apr_pool_t *pool,
              void **child_baton) {
    fprintf(stderr, "entered directory\t\t\t%s\n", path);
    return SVN_NO_ERROR;
}

static svn_error_t *
open_directory(const char *path,
               void *parent_baton,
               svn_revnum_t base_revision,
               apr_pool_t *pool,
               void **child_baton) {
    return SVN_NO_ERROR;
}

static svn_error_t *
change_dir_prop(void *dir_baton,
                const char *name,
                const svn_string_t *value,
                apr_pool_t *pool) {
    return SVN_NO_ERROR;
}

static svn_error_t *
close_directory(void *dir_baton,
                apr_pool_t *pool) {
    fprintf(stderr, "left directory\n");
    return SVN_NO_ERROR;
}

static svn_error_t *
add_file(const char *path,
         void *parent_baton,
         const char *copyfrom_path,
         svn_revnum_t copyfrom_revision,
         apr_pool_t *pool,
         void **file_baton) {
    fprintf(stderr, "entered file     \t\t\t%s\n", path);
    return SVN_NO_ERROR;
}

static svn_error_t *
open_file(const char *path,
          void *parent_baton,
          svn_revnum_t base_revision,
          apr_pool_t *pool,
          void **file_baton) {
  return SVN_NO_ERROR;
}

static svn_error_t *
apply_textdelta(void *file_baton,
                const char *base_checksum,
                apr_pool_t *pool,
                svn_txdelta_window_handler_t *handler,
                void **handler_baton) {
  return SVN_NO_ERROR;
}

static svn_error_t *
change_file_prop(void *file_baton,
                 const char *name,
                 const svn_string_t *value,
                 apr_pool_t *pool) {
  return SVN_NO_ERROR;
}

static svn_error_t *
close_file(void *file_baton,
           const char *text_checksum,
           apr_pool_t *pool) {
  fprintf(stderr, "left file, md5sum = %s\n", text_checksum);
  return SVN_NO_ERROR;
}

static svn_error_t *
close_edit(void *edit_baton,
           apr_pool_t *pool) {
  fprintf(stderr, "listing finished\n");
  return SVN_NO_ERROR;
}

static svn_error_t *
auth_callback(svn_auth_cred_username_t **cred, void *baton, const char *realm, svn_boolean_t may_save, apr_pool_t *pool) {
    if (cred) {
        svn_auth_cred_username_t *ret = apr_pcalloc (pool, sizeof (*ret));
        ret->username = apr_pstrdup(pool, "username");
        *cred = ret;
    }
    return SVN_NO_ERROR;
}

int main(int argc, char **argv) {
    apr_pool_t* pool;
    const char* url = "file:///path/to/svn/repository";

    apr_pool_initialize();
    apr_pool_create_ex(&pool, NULL, NULL, NULL);

    // initialize remote access API
    svn_ra_initialize(pool);

    svn_ra_callbacks2_t* callbacks;
    svn_ra_create_callbacks(&callbacks, pool);

    svn_ra_session_t* session;
    svn_error_t* error = svn_ra_open4(&session, NULL, url, NULL, callbacks, NULL, NULL, pool);

    if (!error) {
        const svn_ra_reporter3_t* status_reporter;
        void* reporter_baton;

        // revision to list (SVN_INVALID_REVNUM means HEAD revision)
        svn_revnum_t revision = SVN_INVALID_REVNUM;

        // setup our editor
        svn_delta_editor_t *editor = svn_delta_default_editor(pool);
        editor->set_target_revision = set_target_revision;
        editor->open_root = open_root;
        editor->add_directory = add_directory;
        editor->close_directory = close_directory;
        editor->add_file = add_file;
        editor->close_file = close_file;

        // run status call
        svn_ra_do_status2(session, &status_reporter, &reporter_baton, "", revision, svn_depth_infinity, editor, NULL, pool);

        // report our virtual working copy as empty (start_empty=TRUE)
        status_reporter->set_path(reporter_baton, "", 0, svn_depth_infinity, TRUE, NULL, pool);
        status_reporter->finish_report(reporter_baton, pool);
    } else {
        fprintf(stderr, "Unable to open connection to %s: %s\n", url, error->message);
    }

    apr_pool_destroy(pool);
    apr_pool_terminate();
    return 0;
}

To compile the code on debian we need the latest subversion from the trunk and APR and sqlite libraries from the apt:

$ sudo aptitude install libapr1-dev libaprutil1-dev libsqlite3-dev
$ gcc crawl_repository.c -I/usr/include/subversion-1 -I/usr/include/apr-1.0 -lsvn_ra-1 -lsvn_client-1 -o crawl_repository

The code reports the working copy root as with start_empty=TRUE. As result the server sends add_directory and add_file editor calls that we can use to list the repository contents.

$ ./crawl_repository
listing revision                        2
entered root directory
entered directory                       trunk
entered file                            trunk/file
left file, md5sum = d41d8cd98f00b204e9800998ecf8427e
left directory
entered directory                       branches
left directory
entered directory                       tags
left directory
left directory

And I’ll just notice that this approach should be faster than the approach used by “svn list –depth infinity”, because “svn list –depth infinity” uses a number of recursive “git dir” calls (that result in a number of network requests). The approach based on the “status” request and start_empty=TRUE allows to perform only one request.

How to build Subversion on Debian GNU/Linux

Looks like a trivial question? Unfortunately, because of this bug standard “configure + make + sudo make install” approach doesn’t work.

The problem occurs because Subversion sources contain libsvn_XXX libraries in it. But if libsvn1 package is installed, it contains these libraries too, and they are confused. If libsvn1 package is not installed, compilation succeeds, but anyway the number of features supported depends on the currently installed libraries.

Fortunately, Subversion sources contain a script that downloads all necessary dependencies in Maven-like fashion and doesn’t touch local libraries (except openssl). The algorithm is:

1. Install packages necessary to run the script.

$ sudo aptitude install subversion gperf autoconf libssl-dev make gcc binutils libtool libxml2-dev

2. Download and run the script.

$ mkdir build
$ cd build
$ svn co https://svn.apache.org/repos/asf/subversion/trunk/tools/dev/unix-build
$ ln -s unix-build/Makefile.svn Makefile
$ make

If the script fails (e.g. because of absence of some utility or library) one can solve the problem and continue with just

$ make

The build result will be put to svn-trunk directory.

$ svn-trunk/subversion/svn/svn --version
svn, version 1.8.0-dev (under development)
   compiled Jun 16 2012, 21:01:03

Copyright (C) 2012 The Apache Software Foundation.
This software consists of contributions made by many people; see the NOTICE
file for more information.
Subversion is open source software, see http://subversion.apache.org/

The following repository access (RA) modules are available:

* ra_svn : Module for accessing a repository using the svn network protocol.
  - with Cyrus SASL authentication
  - handles 'svn' scheme
* ra_local : Module for accessing a repository on local disk.
  - handles 'file' scheme
* ra_serf : Module for accessing a repository via WebDAV protocol using serf.
  - handles 'http' scheme
  - handles 'https' scheme

E200030: BUSY error explained

As I wrote in my previous post, Subversion 1.7 keeps all working copy metadata in .svn/wc.db SQLite database. I earlier working copy formats (versions <= 1.6) every subdirectory of a working copy could be considered independently. In particular, one could run 2 processes/threads which could modify different working copy subdirectories at the same time. With SVN 1.7 one cannot.

Every SVN 1.7 write operation (add, copy, mv, rm, …) locks .svn/wc.db database. When another process tries to obtain a write lock on the same .svn/wc.db at the same time, it fails with an error.

Let’s reproduce the problem to make things clear. First, install SQLite library and development packages.

$ sudo aptitude install libsqlite3-0 libsqlite3-dev

Let’s write a small program that would lock our database:

#include <sqlite3.h>
#include <stdio.h>

int main(int argc, char** argv) {
    sqlite3 *db;

    if (argc != 2) {
        fprintf(stderr, "Usage: %s <sqlite_database.db>\n", argv[0]);
        return 1;
    }

    sqlite3_open_v2(argv[1], &db, SQLITE_OPEN_READWRITE, NULL);
    sqlite3_exec(db, "PRAGMA locking_mode = EXCLUSIVE;",
        NULL, NULL, NULL);
    sqlite3_exec(db, "BEGIN TRANSACTION;", NULL, NULL, NULL);
    sqlite3_exec(db, "DELETE FROM NODES;", NULL, NULL, NULL);

    while (1) {
        sleep(1000);
    }

    return 0;
}

Don’t worry, without “COMMIT” statement it won’t execute the query. But anyway I wouldn’t recommend you to run it on .svn/wc.db, that it important for you: every software contains a bug.

Let’s compile and try it:

$ gcc lock_sqlite.c -lsqlite3 -o lock_sqlite
$ ./lock_sqlite .svn/wc.db

While the program is running, our SQLite database is locked. Let’s try to modify .svn/wc.db with SVN. SVN says

$ svn add file
svn: E155004: Working copy '/tmp/test-co' locked
svn: E200033: sqlite: database is locked
svn: E200033: sqlite: database is locked
svn: run 'svn cleanup' to remove locks (type 'svn help cleanup' for details)

SVNKit is a pure Java SVN implementation (actually, there’re only 2 SVN implentations: native SVN and SVNKit). Nearly all Java programs that work with Subversion use SVNKit. Let’s try to modify .svn/wc.db with SVNKit:

$ jsvn add file
svn: E200030: BUSY

So if you use Java-based SVN client and see this message there’re 2 explanations:

  1. You have a backgroud process that tries to access your working copy at the same time.
  2. Your SVN client tries to access the working copy from several threads. Then it’s a bug, report it to the client developers.