UPDATE : Repository bloat (at least in the merge-then-merge-back scenario) can be solved very easily : "bzr pack --clean-obsolete-packs". The content of this article is interesting in the things it examines, but somewhat outdated by this update. Read at own risk...
Bazaar is awesome. I say that almost every time I talk about Bazaar. I love it.
However, there are some use cases where you can end up with repository bloat, completely unnecessarily.
Repository bloat occurs when Bazaar decides - for whatever reason - to make a new version of an existing revision, and thereby duplicate data that was in the revision.
So for example, you have a 5MB file and you add it to a branch, and then merge the branch with the trunk. Trunk's repository size should increase by roughly 5MB, you say? Well, should, you're right, but depending on how you do it, you can actually end up with a 10MB increase instead.
Repository bloat.
So how do you avoid it?
Well, I've noticed repository bloat in two main situations (although there are likely more). Both situations involve the trunk and branch diverging - so if your workflow is such that the trunk and branch are always sync'ed before divergence happens (i.e. when there are only changes on one side or the other but not both), then repository bloat won't be a problem for you (but rare will be workflows where you can guarantee that!)
Here are the two repository bloat scenarios I've noticed :
1) rebase : your revisions get rewritten to your local repository.
e.g.
md trunk
cd trunk
bzr init
echo Hi>readme.txt
bzr add
bzr commit -m "trunk commit"
cd..
bzr branch --stacked trunk branch
cd branch
(put 5MB file called BigFile.dat in branch folder)
bzr add
bzr commit -m "Added BigFile.dat in branch"
So far so good. And if you rebase at this point, you're fine ('coz nothing will happen).
But if we continue :
cd ../trunk
echo A change in trunk>>readme.txt
bzr commit -m "Another trunk commit"
cd ../branch
bzr rebase
... well, the rebase runs just fine, but if you check the size of the .bzr folder in the branch, it is around 10MB, not 5!
Repository bloat!
How to avoid repository bloat when rebasing?
Well, the conslusion I've come to is : let the repository bloat, and merge or push to trunk, and then follow my instructions on purging stacked branches to remove the bloat. (The bloat in rebase cases is only in the branch, not the trunk. And fortunately, it seems that pushing the bloated repository to the trunk only pushes the new versions of the affected revisions instead of pushing both old and new versions - i.e. the bloat is fortunately not propagated back to the trunk in this case.)
(Not using stacked branches? Sorry, not my use case, so I haven't investigated further and thus can't tell you for sure what will work - although if you get really really desperate you can make a new branch --no-tree and then delete the .bzr folder in your existing branch and replace it with the .bzr folder in the new branch. Again - only do that at a point where trunk and branch are in-sync.)
2) merge to branch then merge to trunk
This is a pretty standard operation if you've been working on your branch for a while and the trunk has changed in the meantime.
You can't pull the trunk changes into the branch. Once the two are out-of-sync, you're forced to use merge or rebase. The rebase scenario is covered above, and results in duplication of data in branch revisions from the point of divergence onwards.
The merge scenario is what we're covering here. Its repository bloat characteristics are more interesting. Whereas rebase results in duplication of data in BRANCH revisions from the point of divergence onwards, merge can result in duplication of data in TRUNK revisions from the point of divergence onwards, assuming that you proceed to merge branch back into trunk. (If you PUSH branch back into trunk, I suspect (but haven't tested) that you'll get away without repository bloat - but then you lose the trunk's unique perspective on the change history - i.e. your log and qlog are thereafter from the branch's perspective instead of from the trunk's perspective.)
e.g.
md trunk
cd trunk
bzr init
echo Hi>test.txt
(add 5MB file into trunk folder)
bzr add
bzr commit -m "Initial commit in trunk"
cd..
bzr branch --stacked trunk branch
cd branch
echo bla>test2.txt
bzr add
bzr commit -m "First commit in branch"
cd ..
cd trunk
(replace 5MB file in trunk folder with a different 5MB file of same name)
bzr commit -m "Modified BigFile.dat"
cd ..
cd branch
OK - so far so good - but trunk and branch have diverged and now we're at the point we want to make them converge. Normally we might do :
bzr merge ../trunk
bzr commit -m "Merged trunk changes into branch"
cd ..
cd trunk
bzr merge ../branch
bzr commit -m "Merged branch into trunk"
... but if you do that, you'll get our lovely friend Repository Bloat(TM)!
Why?
Well, it seems that merging the 5MB file's modification revision in from trunk to branch, which requires a commit, results in that 5MB file's data ending up in a second revision, and when we merge back into trunk, that second revision ends up in the trunk's repository. (Interestingly, does not happen if the file was newly created in the trunk - just if it was already known to the branch and was updated in the trunk.)
10MB repository growth for a 5MB file. Baaaaad.
(To emphasize : the final trunk repository size is 15MB : 5MB after initial commit of the 5MB file, then a further 5MB totalling 10MB after second commit to trunk, and finally a third 5MB totalling 15MB after merging in from branch and committing again.)
We saw how to get around it with the rebase bloat problem. How to get around it with the merge bloat problem?
One way is to avoid the merge-then-merge-back entirely. If trunk has changed and you can't pull the changes into the branch because trunk and branch have diverged, then rebase instead. You might/will end up with branch repository bloat, but I cover how to deal with that in the preceding section on repository bloat caused by the rebase operation.
All a bit tedious? Perhaps. But easily scriptable.
Of course, if your workflow relies on the merge process, you might just have to accept the bloat. Not ideal. You might be able to avoid the bloat by using the merge -c option when merging back into trunk, to "cherry-pick" only the branch revisions that are not themselves merge-from-trunk commits. And there are yet more desperate approaches one could take if needed - e.g. export branch changes to a patch set, delete branch, recreate it from trunk and apply patches!!! Well y'know, it would probably work.......
And maybe I need my head checked, but even with a few little problems like this, I still absolutely love Bazaar. (Yes - relatively little. In practice, does it matter if your repository is twice the size it needs to be? Sometimes yes, usually no. For me, it's a little more critical than for others due to certain peculiar circumstances, and hence my investigations in how to avoid/resolve repository bloat.) Thanks for stopping by! :o)
Thursday, April 14, 2011
Purging stacked branches in Bazaar
Stacked branches are awesome!
Shared repositories go so far, but don't work so well if the parent and child branches are far away from each other in the file system (nor if they are on different volumes), and shared repositories have the weakness that if you create a revision, it lives on forever, even if you later delete the branch associated with that revision. (You can't actually get the revision back, not by any way I've found (UPDATE : "bzr heads --all" looks like it lets you find "lost" revisions.), but the shared repository's size never goes down - it just keeps accruing more and more data, never letting any of it go. (UPDATE : I'm no longer entirely sure when the repository's size changes - "bzr pack --clean-obsolete-packs" does wonders))
In contrast, stacked branches can be used at any time both the parent and child branch are simultaneously accessible (even if they're on different hard disks or even one on a URL), and best of all, if you make an experimental branch and decide to kill it, bam! - its history is gone forever and your trunk repository isn't forever bloated by the revisions you decided to nuke.
And they're extremely useful if you want the same library to be in multiple apps (in different Bazaar repositories) and want to be able to edit the source code in each copy of the library independently but have them all closely associated.
And did I mention they save a lot of storage space?
But thence cometh the problem : stacked branches start out tiny, because they aren't carrying the five decades of history that the trunk contains, but after that they grow.
And grow.
What if you just want the stacked branch repositories to stay nice and trim, like they were when you made them?
There doesn't seem to be any built-in feature in Bazaar to do that.
push, pull, merge, do whatever you want - the stacked branch's repository only grows.
So we resort to a little bit of - very effective - skullduggery.
FIRST UP, ENSURE YOU TRY THIS EXPERIMENTALLY FIRST. It worked for me, but might destroy you and your world and your company's beautiful source code and get you fired. THIS USES UNDOCUMENTED TRICKS. So it could stop working when new versions of Bazaar roll out. I have and accept no responsibility for what happens to you if you try this yourself!
1) Purging the stacked branch history obviously needs to be done at times that the stacked branch is in-sync with the trunk. So make sure you've merged or pushed the branch into the trunk.
2) In the branch, delete all files in these two folders :
.bzr\repository\indices
.bzr\repository\packs
3) Still in the branch, locate this file :
.bzr\repository\pack-names
... and change its content to the following five lines :
B+Tree Graph Index 2
node_ref_lists=0
key_elements=1
len=0
row_lengths=
Voila! Do a bzr status or bzr log and the history is all there - its just now coming from the stacked-on branch like you wanted all along. You have successfully purged the stacked branch's history.
Shared repositories go so far, but don't work so well if the parent and child branches are far away from each other in the file system (nor if they are on different volumes), and shared repositories have the weakness that if you create a revision, it lives on forever, even if you later delete the branch associated with that revision. (You can't actually get the revision back, not by any way I've found (UPDATE : "bzr heads --all" looks like it lets you find "lost" revisions.), but the shared repository's size never goes down - it just keeps accruing more and more data, never letting any of it go. (UPDATE : I'm no longer entirely sure when the repository's size changes - "bzr pack --clean-obsolete-packs" does wonders))
In contrast, stacked branches can be used at any time both the parent and child branch are simultaneously accessible (even if they're on different hard disks or even one on a URL), and best of all, if you make an experimental branch and decide to kill it, bam! - its history is gone forever and your trunk repository isn't forever bloated by the revisions you decided to nuke.
And they're extremely useful if you want the same library to be in multiple apps (in different Bazaar repositories) and want to be able to edit the source code in each copy of the library independently but have them all closely associated.
And did I mention they save a lot of storage space?
But thence cometh the problem : stacked branches start out tiny, because they aren't carrying the five decades of history that the trunk contains, but after that they grow.
And grow.
What if you just want the stacked branch repositories to stay nice and trim, like they were when you made them?
There doesn't seem to be any built-in feature in Bazaar to do that.
push, pull, merge, do whatever you want - the stacked branch's repository only grows.
So we resort to a little bit of - very effective - skullduggery.
FIRST UP, ENSURE YOU TRY THIS EXPERIMENTALLY FIRST. It worked for me, but might destroy you and your world and your company's beautiful source code and get you fired. THIS USES UNDOCUMENTED TRICKS. So it could stop working when new versions of Bazaar roll out. I have and accept no responsibility for what happens to you if you try this yourself!
1) Purging the stacked branch history obviously needs to be done at times that the stacked branch is in-sync with the trunk. So make sure you've merged or pushed the branch into the trunk.
2) In the branch, delete all files in these two folders :
.bzr\repository\indices
.bzr\repository\packs
3) Still in the branch, locate this file :
.bzr\repository\pack-names
... and change its content to the following five lines :
B+Tree Graph Index 2
node_ref_lists=0
key_elements=1
len=0
row_lengths=
Voila! Do a bzr status or bzr log and the history is all there - its just now coming from the stacked-on branch like you wanted all along. You have successfully purged the stacked branch's history.
Subscribe to:
Posts (Atom)