J-P J-P - 1 year ago 127
Git Question

Removing git repository objects entirely from all branches and tags and pushing changes to remote

We did a client migration into a website. Our code was on a separate branch, which was then merged into master and release. Master has been branched several times since for other features, as well. All these branches make the repository slightly more complicated than the examples I've found on the web.

We now realise that the client's original media - mostly images and a big CSV file - was also checked into Git. Although it's only 12MB or so, there are several reasons for removing it (not least that the client's filenames have non-ASCII characters that are playing hell with our Vagrant box's shared folders on OSX.) Here's the size breakdown for the repository:

$ du --max-depth 1 -h
12M ./.git
13M ./modules
2.0M ./themes
27M .

Although the binaries are obviously now present on several branches, then as far as I'm aware I should be able to just do the following to remove both the binaries, and then the repository objects corresponding to them:

$ git filter-branch --tree-filter "git rm -rf --ignore-unmatch modules/custom/mymigration/data/photos/*" # Did this with and without "HEAD" argument
[snip lots of output]
$ git reflog expire --expire=now --all
$ git gc --aggressive --prune=now

However, I still have a large .git subfolder:

$ du --max-depth 1 -h
12M ./.git
1.4M ./modules
2.0M ./themes
15M .

The biggest file is .git/objects/pack/pack-....pack . When I verify the .idx file for this:

$ git verify-pack -v .git/objects/pack/pack-53c8077d0590dabcf5366589c3d6594768637f5e.idx | sort -k 3 -n | tail -n 5

I get a long list of objects. If I pipe this into rev-list, and grep for my migration data directory:

$ for i in `git verify-pack -v .git/objects/pack/pack-53c8077d0590dabcf5366589c3d6594768637f5e.idx | sort -k 3 -n | tail -n 5 | awk '{print $1}'`; do
git rev-list --objects --all | \
grep $i | \
grep modules/custom/mymigration/data
47846536601f0bc3a31093c88768b522a5500c96 modules/custom/mymigration/data/photos/Turkey.jpg
b920e36357d855352f4fdb31c17772d21c01304d modules/custom/mymigration/data/photos/Burger_Top.JPG

then as you can see the photos are still in the pack file.

  • If I push this repository up to a (completely empty) remote, then clone that remote somewhere else completely different, there's still 12MB of pack file.

  • Cloning this repository locally with
    git clone file://path/to/old-repos new-repos
    also has the same effect: worse, all my origin branches disappear (as you'd probably expect) so I only have master.

Is there anything I can do to get rid of those packed objects? Does their very continued existence suggest that they're still associated with some git commit object somewhere? I've tried to
but nothing has changed.

Furthermore, if I just "get rid of them", is anything likely to break, if I haven't done the first bit properly? What happens if a file object is deleted that a git commit still refers to?

Answer Source

The following works, repeatably, in reducing the repository down to around 2.5MB .git and 5.8MB in total. It includes the suggestions made by @jamessan above.

This removes the objects from all branches and pushes those removals to a remote repository. That remote repository is then entirely free of these objects as far as I can tell (by the repository size dropping so dramatically.)

# Configure the repository to push all existing branches & tags
# when none are explicitly specified
git config --add remote.origin.push '+refs/tags/*:refs/tags/*'
git config --add remote.origin.push '+refs/heads/*:refs/heads/*'

# Make sure all local branches exist, so they get filtered
for remote_branch in `git branch --all | grep -v HEAD | sed -e 's/\*//'`; do local_branch=`echo $remote_branch | sed -e 's!remotes/origin/!!'`; git checkout $local_branch; done

# Prevent git < from complaining about dirty working directory
git update-index -q --ignore-submodules --refresh

# Do the filtering across --all branches and rewrite tags
# Note that this will necessarily remove signatures on tags
git filter-branch -f --tree-filter "git rm -rf --ignore-unmatch modules/custom/mymigration/data/photos/*" --tag-name-filter cat -- --all

# Remove the backed-up refs
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

# Clear out the reflog and garbage-collect
git reflog expire --expire=now --all
git gc --aggressive --prune=now

# Push all changes to origin - pushes tags and branches
git push origin
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download