sanmai sanmai - 9 months ago 44
Git Question

How do I reduce the size of a bloated Git repo by non-interactively squashing all commits except for the most recent ones?

My Git repo has hundreds of gigabytes of data, so I'm trying to remove old, outdated commits, because they're making everything larger and slower. I need a solution that's fast; the faster, the better.

How do I squash all commits except for the most recent
ones, and do so without having to manually squash each one in an interactive rebase? Specifically, I don't want to have to use

git rebase -i --root

My repo

I have these commits:

A .. B .. C ... ... H .. I .. J .. K .. L

What I want is this (squashing everything in between

A .. H .. I .. J .. K .. L

There is an answer on how to squash all commits, but I want to keep some of the
more recent commits. I don't want to squash the most recent commits either. (Especially I need to keep the first two commits counting from the top.)

Answer Source

Fastest counting implementation time is almost certainly going to be with grafts and a filter-branch, though you might be able to get faster execution with a handrolled commit-tree sequence working off rev-list output.

Rebase is built to apply changes on different content. What you're doing here is preserving contents and intentionally losing the change history that produced them, so pretty much all of rebase's most tedious and slow work is wasted.

The payload here is, working from your picture,

echo `git rev-parse H; git rev-parse A` > .git/info/grafts  
git filter-branch -- --all

Documentation for git rev-parse and git filter-branch.

Filter-branch is very careful to be recoverable after a failure at any point, which is certainly safest .... but it's only really helpful when recovery by simply redoing it wouldn't be faster and easier if things go south on you. Failures being rare and restarts usually being cheap, the thing to do is to do an un"safe" but very fast operation that is all but certain to work. For that, the best option here is to do it on a tmpfs (the closest equivalent I know on Windows would be a ramdisk like ImDisk), which will be blazing fast and won't touch your main repo until you're sure you've got the results you want.

So on Windows, say T:\wip is on a ramdisk, and note that the clone here copies nothing. As well as reading the docs on git clone's --shared option, do examine the clone's innards to see the real effect, it's very straightforward.

# switch to a lightweight wip clone on a tmpfs
git clone --shared --no-checkout . /t/wip/filterwork
cd !$

# graft out the unwanted commits
echo `git rev-parse $L; git rev-parse $A` >.git/info/grafts
git filter-branch -- --all

# check that the repo history looks right
git log --graph --decorate --oneline --all

# all done with the splicing, filter-branch has integrated it
rm .git/info/grafts

# push the rewritten histories back
git push origin --all --force

There are enough possible variations on what you might be wanting to do and what might be in your repo that almost any of the options on these commands might be useful. The above is tested and will do what it says it does, but that might not be exactly what you want.