Marc Marc - 5 days ago 6
Git Question

How to substitute words in Git history & properly debug related problems?

I'm trying to remove sensitive data like passwords from my Git history. Instead of deleting whole files I just want to substitute the passwords with

removedSensitiveInfo
. This is what I came up with after browsing through numerous StackOverflow topics and other sites.

git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"


When I run this command it seems to be rewriting the history (it shows the commits it's rewriting and takes a few minutes). However, when I check to see if all sensitive data has indeed been removed it turns out it's still there.

For reference this is how I do the check

git grep aSecretPassword1 $(git rev-list --all)


Which shows me all the hundreds of commits that match the search query. Nothing has been substituted.

Any idea what's going on here?

I double checked the regular expression I'm using which seems to be correct. I'm not sure what else to check for or how to properly debug this as my Git knowledge quite rudimentary. For example I don't know how to test whether 1) my regular expression isn't matching anything, 2) sed isn't being run on all files, 3) the file changes are not being saved, or 4) something else.

Any help is very much appreciated.

P.S.
I'm aware of several StackOverflow threads about this topic. However, I couldn't find one that is about substituting words (rather than deleting files) in all (ASCII) files (rather than specifying a specific file or file type). Not sure whether that should make a difference, but all suggested solutions haven't worked for me.

Answer

git-filter-branch is a powerful but difficult to use tool - there are several obscure things you need to know to use it correctly for your task, and each one is a possible cause for the problems you're seeing. So rather than immediately trying to debug them, let's take a step back and look at the original problem:

  • Substitute given strings (ie passwords) within all text files (without specifying a specific file/file-type)
  • Ensure that the updated Git history does not contain the old password text
  • Do the above as simply as possible

There is a tailor-made solution to this problem:

Use The BFG... not git-filter-branch

The BFG Repo-Cleaner is a simpler alternative to git-filter-branch specifically designed for removing passwords and other unwanted data from Git repository history.

Ways in which the BFG helps you in this situation:

  • The BFG is 10-720x faster
  • It automatically runs on all tags and references, unlike git-filter-branch - which only does that if you add the extraordinary --tag-name-filter cat -- --all command-line option (Note that the example command you gave in the Question DOES NOT have this, a possible cause of your problems)
  • The BFG doesn't generate any refs/original/ refs - so no need for you to perform an extra step to remove them
  • You can express you passwords as simple literal strings, without having to worry about getting regex-escaping right. The BFG can handle regex too, if you really need it.

Using the BFG

Carefully follow the usage steps - the core bit is just this command:

$ java -jar bfg.jar  --replace-text replacements.txt  my-repo.git

The replacements.txt file should contain all the substitutions you want to do, in a format like this (one entry per line - note the comments shouldn't be included):

PASSWORD1 # Replace literal string 'PASSWORD1' with '***REMOVED***' (default)
PASSWORD2==>examplePass         # replace with 'examplePass' instead
PASSWORD3==>                    # replace with the empty string
regex:password=\w+==>password=  # Replace, using a regex

Your entire repository history will be scanned, and all text files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Comments