meshy meshy - 21 days ago 7
Git Question

Add original hash to commit on git rebase (with new root)

I have a codebase that used to be managed with SVN, but is now managed with git. When the code was migrated to git, the history was lost.

I have managed to recover the SVN-history, and am now trying to

git-rebase
the more recent commits over the top.

I have two branches,
git-commits
, which contains the commits since the migration to git, and
svn-commits
which contains the older history. Each branch contains over 3000 commits.

I have found that the following command builds the new history on top of the old (albeit with some manual merge conflict handling):

git rebase git-commits --root --onto svn-commits --preserve-merges


Several of the commits reference commit hashes, and I am aware that these would change when the rebase is done. So that this information is not lost forever, I would like to add the original commit hash of each commit to the newly-rebased commit's message.

This would mean that an original commit like this:

commit aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Author: Boaty McBoatface <boaty@example.com>
AuthorDate: Wed Jul 27 00:00:00 1938 +0000
Commit: Boaty McBoatface <boaty@example.com>
CommitDate: Wed Jul 27 00:00:00 1938 +0000

Reticulate splines

The splines had been derezzed, and needed to be reticulated.


Would become something like

commit bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Author: Boaty McBoatface <boaty@example.com>
AuthorDate: Wed Jul 27 00:00:00 1938 +0000
Commit: Meshy <meshy@example.com>
CommitDate: Wed Nov 16 10:23:31 2016 +0000

Reticulate splines

The splines had been derezzed, and needed to be reticulated.

Original hash: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa


Is this possible? Perhaps with
git-filter-branch
?

Answer

First, a note: be sure you really want to do this, since git replace (mentioned briefly below) can be used to stitch together the histories in a way that preserves the IDs. It has its own drawbacks too, of course; search for reports from people who have used it.


Yes, you can do this with git filter-branch.

You might, though, want to combine the "rebase new commits atop new conversion" step with the "... and then edit all the new commits to also contain their old IDs" step, because rebase works by copying commits, and filter-branch works by ... copying commits. :-)

All Git commands that do this kind of thing must copy, since the hash ID of each commit is a function of the commit's contents. If the new commit is different from the original commit in any way, it gets a new, different ID.

The differences between git rebase and git filter-branch lie in which commits are copied and how the copying is performed.

Rebase, when done without --preserve-merges, works by selecting a list of non-merge commits, turning each such commit into a changeset (via subtraction, more or less: child minus parent = delta from parent to child), then adding this delta to the --onto point or to the commits-added-so-far.

When you use --preserve-merges, rebase still selects a list of non-merge commits. Then, where there was a merge commit, rebase re-performs the merge (which is why you must resolve merge conflicts all over again). It must re-merge, because the new base may result in a different merge, and because merges cannot be turned into a single changeset ("child - parent" gives you one delta, but there are at least two parents, hence at least two deltas, and in the general case we cannot preserve both).

Filter-branch uses an entirely different approach. The commits to be filtered are selected regardless of whether they are merges or not. (The actual selection is done by running git rev-list, which is the "plumbing" equivalent of git log.) This complete list of commit IDs is placed into a pile: a sorted, topological-order pile stored in an ordinary file, so that parent commits are always processed before their children.

Then, for each ID in the list:

  • Extract the original commit a la git checkout, into a temporary tree that has no underlying Git repo.

  • Apply the tree filter to modify the tree. (This modification runs in the temporary directory that holds the temporary tree. That part trips up a lot of people doing their first tree-filter, when they try to access a file like ../../fixed-version. The relative path fails because the temporary tree is not in the repository at all.)

  • Reconstruct a new set of Git tree-and-blob-objects representing the new tree, i.e., the new commit snapshot.

  • Apply the commit message filter to the message.

  • Apply the commit environment filter to the remaining commit metadata (author and committer stuff).

  • Make a new commit using the new message and new tree. Or, if you supply a commit filter, use it to make-or-don't-make the commit; and you can also modify the new commit's parent(s) at this point, using the parent filter.

  • Last, record a pairing: "old commit <oldhash> became new commit <newhash>." (If you skip a commit using a commit filter, the old hash maps instead to its corresponding new ancestor, i.e., the parent that you didn't skip.) This pairing is a map.

This process is extremely slow due to the extract + tree-filter + rebuild part. Therefore, if you don't use a tree filter, git filter-branch skips this part: it's just going to get the original tree back anyway. To let you modify the new commit's contents anyway, filter-branch also lets you specify an index filter (commits always work from the index anyway, so the extract+modify+rebuild just updates the index; if we can update in place, that's much faster). But—here's the key point—for your purposes you don't need to do anything at all to each tree. All you want is to modify the parentage! This will let you preserve your original merges and their source trees, with no re-merging.

Note that the --commit-filter description talks about the map convenience function (shell function). This "map" function uses the map I mentioned above. The default is to automatically map to the new parent of the new copied commit.

Finally, after copying all the commits—and, if you provide a --tag-name-filter, also copying annotated tags and mapping the copies (so if you do have annotated tags, you do want a --tag-name-filter cat here)—the filter-branch command rewrites some references, i.e., branch and tag names. The original references, which will still point to the original commits (and annotated tag objects), are dumped into the refs/original/ name-space. (This must be empty at the start of the process unless you use --force.) The rewritten references point into the new copies. The rewrite uses the same mapping technique, so that if there are skipped commits, the names now point to the retained ancestor commits.

("Some" references? Wait, which references? The answer is in the documentation, but it's a bit mysterious: it talks about positive references. The arguments get passed to git rev-list so that you can filter a specific range of commits, e.g., branch~30..branch or branch ^otherbranch. The "positive" references are the ones that actively select commits, while the "negative" references are the ones that limit commits, so for branch ^otherbranch we have one positive reference, branch, and one negative, the not-otherbranch part. So this rewrites only refs/heads/branch and not refs/heads/otherbranch.)

That was a lot of verbiage, but ... how?

The reason to explain all of the above is to point out how simple the transplant process is, when using git filter-branch, and then to show how to access the map.

First, we only need to explicitly replace one single parent ID. Specifically, we want the parent of the root commit in git-commits to become the existing tip commit of svn-commits:

$ git rev-parse svn-commits
9999999999999...

(that's the desired new parent), and:

$ git rev-list --max-parents=0 git-commits
11111111111111...

(that's the root commit—with any luck there is only one, otherwise, now what?).

So, we would want a parent filter that says: "if this is commit 1111111... then echo 9999999..., else just echo the arguments back". The default parent arguments are on stdin, as a series of -p <id>s, with the IDs already mapped. Of course, an existing root has no parents, so stdin will have no contents for the one commit we want to change here. Hence:

--parent-filter 'if [ $GIT_COMMIT = 11111... ]; then
  echo -p 999999...; else cat; fi'

This part of the filter-branch will accomplish our re-parenting. Note that unlike git rebase, all the trees are simply retained intact. We never convert a snapshot to a delta here, we just take it as-is. This means there is no need to re-resolve merge conflicts.

(Side note: you can actually use the name svn-commits in place of the hard-coded 99999... here. You could use a name in place of the hard-coded 11111... as well but we don't have a name. Also, looking up the name each time will add a tiny bit of delay to the filtering. For the one re-parenting to svn-commits, that's one tiny delay; for testing whether this is the old root, though, that would be one tiny delay times 3000 commits.)

(Second side note: you can also do this reparenting via "grafts" or its more modern version, git replace. If a graft or replacement is in force when you run filter-branch, that graft or replacement becomes permanent, since Git simply copies the commits as instructed, with the instructions also following the replacement.)

That still leaves the problem of filtering the commit messages, to add:

Original hash: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

As shown above, the original hash is in $GIT_COMMIT, so all we need is this:

--msg-filter 'echo; echo "Original hash: $GIT_COMMIT"'

If we wanted to be fancy, we could even use that map convenience function:

--msg-filter 'echo; echo "new commit $(map $GIT_COMMIT) \
filtered to reparent original commit $GIT_COMMIT"'

or something silly like that, but there's no good reason to bother ... unless you want to get really fancy, and see if you can detect old hash IDs in the commit message and rewrite them in place. I'm not sure if this is even a good idea, and won't attempt to provide a bit of shell script for it, but note that all1 of these filters are "eval"-ed as shell fragments. You can invoke other shell scripts from these eval-ed fragments, just remember that all the filtering is going on in a temporary directory.

Run the filtering on the reference git-commits. Once the filtering is done, refs/heads/git-commits will point to the last copied commit, and refs/original/refs/heads/git-commits will point to the original chain (the one rooted at 11111... in the above examples).


1Well, almost all. As the documentation says, "with the notable exception of the commit filter, for technical reasons".


Summary

We need two filters, --parent-filter (or a graft or replacement in force), and --msg-filter. The parent filter says "replace the root of the transplanted copy with the tip of the place we're transplanting onto", and this accomplishes our rebase-without-changing-snapshots. The message filter says "this new commit replaces the commit whose ID we expanded at filtering-time from the variable $GIT_COMMIT".