Thursday, April 14, 2011

Bazaar repository bloat - rebase, merge, push, pull

UPDATE : Repository bloat (at least in the merge-then-merge-back scenario) can be solved very easily : "bzr pack --clean-obsolete-packs". The content of this article is interesting in the things it examines, but somewhat outdated by this update. Read at own risk...

Bazaar is awesome. I say that almost every time I talk about Bazaar. I love it.

However, there are some use cases where you can end up with repository bloat, completely unnecessarily.

Repository bloat occurs when Bazaar decides - for whatever reason - to make a new version of an existing revision, and thereby duplicate data that was in the revision.

So for example, you have a 5MB file and you add it to a branch, and then merge the branch with the trunk. Trunk's repository size should increase by roughly 5MB, you say? Well, should, you're right, but depending on how you do it, you can actually end up with a 10MB increase instead.

Repository bloat.

So how do you avoid it?

Well, I've noticed repository bloat in two main situations (although there are likely more). Both situations involve the trunk and branch diverging - so if your workflow is such that the trunk and branch are always sync'ed before divergence happens (i.e. when there are only changes on one side or the other but not both), then repository bloat won't be a problem for you (but rare will be workflows where you can guarantee that!)

Here are the two repository bloat scenarios I've noticed :

1) rebase : your revisions get rewritten to your local repository.

e.g.

md trunk
cd trunk
bzr init
echo Hi>readme.txt
bzr add
bzr commit -m "trunk commit"
cd..
bzr branch --stacked trunk branch
cd branch
(put 5MB file called BigFile.dat in branch folder)
bzr add
bzr commit -m "Added BigFile.dat in branch"

So far so good. And if you rebase at this point, you're fine ('coz nothing will happen).

But if we continue :

cd ../trunk
echo A change in trunk>>readme.txt
bzr commit -m "Another trunk commit"
cd ../branch
bzr rebase

... well, the rebase runs just fine, but if you check the size of the .bzr folder in the branch, it is around 10MB, not 5!

Repository bloat!

How to avoid repository bloat when rebasing?

Well, the conslusion I've come to is : let the repository bloat, and merge or push to trunk, and then follow my instructions on purging stacked branches to remove the bloat. (The bloat in rebase cases is only in the branch, not the trunk. And fortunately, it seems that pushing the bloated repository to the trunk only pushes the new versions of the affected revisions instead of pushing both old and new versions - i.e. the bloat is fortunately not propagated back to the trunk in this case.)

(Not using stacked branches? Sorry, not my use case, so I haven't investigated further and thus can't tell you for sure what will work - although if you get really really desperate you can make a new branch --no-tree and then delete the .bzr folder in your existing branch and replace it with the .bzr folder in the new branch. Again - only do that at a point where trunk and branch are in-sync.)

2) merge to branch then merge to trunk

This is a pretty standard operation if you've been working on your branch for a while and the trunk has changed in the meantime.

You can't pull the trunk changes into the branch. Once the two are out-of-sync, you're forced to use merge or rebase. The rebase scenario is covered above, and results in duplication of data in branch revisions from the point of divergence onwards.

The merge scenario is what we're covering here. Its repository bloat characteristics are more interesting. Whereas rebase results in duplication of data in BRANCH revisions from the point of divergence onwards, merge can result in duplication of data in TRUNK revisions from the point of divergence onwards, assuming that you proceed to merge branch back into trunk. (If you PUSH branch back into trunk, I suspect (but haven't tested) that you'll get away without repository bloat - but then you lose the trunk's unique perspective on the change history - i.e. your log and qlog are thereafter from the branch's perspective instead of from the trunk's perspective.)

e.g.
md trunk
cd trunk
bzr init
echo Hi>test.txt
(add 5MB file into trunk folder)
bzr add
bzr commit -m "Initial commit in trunk"
cd..
bzr branch --stacked trunk branch
cd branch
echo bla>test2.txt
bzr add
bzr commit -m "First commit in branch"
cd ..
cd trunk
(replace 5MB file in trunk folder with a different 5MB file of same name)
bzr commit -m "Modified BigFile.dat"
cd ..
cd branch

OK - so far so good - but trunk and branch have diverged and now we're at the point we want to make them converge. Normally we might do :

bzr merge ../trunk
bzr commit -m "Merged trunk changes into branch"
cd ..
cd trunk
bzr merge ../branch
bzr commit -m "Merged branch into trunk"

... but if you do that, you'll get our lovely friend Repository Bloat(TM)!

Why?

Well, it seems that merging the 5MB file's modification revision in from trunk to branch, which requires a commit, results in that 5MB file's data ending up in a second revision, and when we merge back into trunk, that second revision ends up in the trunk's repository. (Interestingly, does not happen if the file was newly created in the trunk - just if it was already known to the branch and was updated in the trunk.)

10MB repository growth for a 5MB file. Baaaaad.

(To emphasize : the final trunk repository size is 15MB : 5MB after initial commit of the 5MB file, then a further 5MB totalling 10MB after second commit to trunk, and finally a third 5MB totalling 15MB after merging in from branch and committing again.)

We saw how to get around it with the rebase bloat problem. How to get around it with the merge bloat problem?

One way is to avoid the merge-then-merge-back entirely. If trunk has changed and you can't pull the changes into the branch because trunk and branch have diverged, then rebase instead. You might/will end up with branch repository bloat, but I cover how to deal with that in the preceding section on repository bloat caused by the rebase operation.

All a bit tedious? Perhaps. But easily scriptable.

Of course, if your workflow relies on the merge process, you might just have to accept the bloat. Not ideal. You might be able to avoid the bloat by using the merge -c option when merging back into trunk, to "cherry-pick" only the branch revisions that are not themselves merge-from-trunk commits. And there are yet more desperate approaches one could take if needed - e.g. export branch changes to a patch set, delete branch, recreate it from trunk and apply patches!!! Well y'know, it would probably work.......

And maybe I need my head checked, but even with a few little problems like this, I still absolutely love Bazaar. (Yes - relatively little. In practice, does it matter if your repository is twice the size it needs to be? Sometimes yes, usually no. For me, it's a little more critical than for others due to certain peculiar circumstances, and hence my investigations in how to avoid/resolve repository bloat.) Thanks for stopping by! :o)

No comments: