Matthew Matthew - 2 months ago 10
Git Question

Detailed reason why remote git rebase is so evil

So I come from a centralized VCS background and am trying to nail down our workflow in Git (new company, young code base). One question I can't find a simple yet detailed answer to is what exactly does rebase on a remote branch do. I understand it rewrites the history, and in general should be limited to local branches only.

The workflow I'm currently trying to vet out involves a remote collaboration branch, each dev "owning" one for the purpose of sharing code. (Having 2 developers and max 3 in the foreseeable future a feature branch for each project & feature request seems excessive and more overhead than benefit gained.)

Then I came across this answer and tried it and it accomplished what I'd like - a dev commits and pushes often to his own collab branch, when he knows what is approved to be released to staging he can rebase remotely (to squash and perhaps reorganize) before merging into develop.

Enter the original question - if the remote branch is for the purpose of collaboration someone else is bound to pull it sooner or later. If it is a process/training issue to not have the 'guest developer' commit to that collab branch, what actually happens with the branch owner rebases that remote branch?

Answer

It's not really evil, it's a matter of implementations and expectations.

We start with a tangle of facts:

  • Every Git hash represents some unique object. For our purposes here we need only consider commit objects. Each hash is the result of applying a cryptographic hash function (for Git, specifically, it's SHA-1) to the contents of the object. For a commit, the contents include the ID of the source tree; the name and email address and time/date-stamp of the author and committer; the commit message; and most crucially here, the ID of the parent commit.

  • Changing even just a single bit in the content results in a new, very-different hash ID. The cryptographic properties of the hash function, which serve to authenticate and verify each commit (or other object), also mean that there is no way to have some different object have the same hash ID. Git counts on this for transferring objects between repositories, too.

  • Rebase works (necessarily) by copying commits to new commits. Even if nothing else changes—and usually, the source code associated with the new copies differs from the original source code—the whole point of the rebase is to re-parent some commit chain. For instance, we might start with:

    ...--o--*--o--o--o   <-- develop
             \
              o--o       <-- feature
    

    where branch feature separates from branch develop at commit *, but now we would like feature to descend from the tip commit of develop, so we rebase it. The result is:

    ...--o--*--o--o--o        <-- develop
             \        \
              \        @--@   <-- feature
               \
                o--o          abandoned [used to be feature, now left-overs]
    

    where the two @s are copies of the original two commits.

  • Branch names, like develop, are just pointers pointing to a (single) commit. The things we tend think of as "a branch", like the two commits @--@, are formed by working backwards from each commit to its parent(s).

  • Branches are always expected to grow new commits. It's perfectly normal to find that develop or master has some new commits added on, so that the name now points to a commit—or the last of many commits—that points back to where the name used to point.

  • Whenever you get your Git to synchronize (to whatever degree) your repository with some other Git and its other repository, your Git and their Git have an exchange of IDs—specifically, hash IDs. Exactly which IDs depends on the direction of the transfer, and any branch names you ask your Git to use.

  • A remote-tracking branch is actually an entity that your Git stores, associated with your repository. Your remote-tracking branch origin/master is, in effect, your Git's place to remember "what the Git at origin said his master was, the last time we talked."

So, now we take these seven items, and look at how git fetch works. You might run git fetch origin, for instance. At this point, your Git calls up the Git on origin and asks it about its branches. They say things like master = 1234567 and branch = 89abcde (though the hash values are all exactly 40 characters long, rather than these 7-character ones).

Your Git may already have these commit objects. If so, we are nearly done! If not, it asks their Git to send those commit objects, and also any additional objects your Git needs to make sense of them. The additional objects are any files that go with those commits, and any parent commit(s) those commits use that you do not already have, plus the parents' parents, and so on, until we get to some commit object(s) that you do have. This gets you all the commits and files you need for any and all new history.1

Once your Git has all the objects safely stored away, your Git then updates your remote-tracking branches with the new IDs. Their Git just told you that their master is 1234567, so now your origin/master is set to 1234567. The same goes for their branch: it becomes your origin/branch and your Git saves the 89abcde hash.

If you now git checkout branch, your Git uses origin/branch to make a new local label, pointing to 89abcde. Let's draw this:

...--o--*--o--1   <-- master, origin/master
         \
          o--8    <-- branch, origin/branch

(I've shortened 1234567 to just 1 here, and 89abcde to just 8, to get them to fit better.)

To make things really interesting, let's make our own new commit on branch, too. Let's say it gets numbered aaaaaaa...:

...--o--*--o--1    <-- master, origin/master
         \
          o--8     <-- origin/branch
              \
               A   <-- branch

(I shortened aaaaaaa... to just A).

The interesting question, then, is what happens if they—the Git from which you fetch—rebase something. Suppose, for instance, that they rebase branch onto master. This copies some number of commits. Now you run git fetch and your Git sees that they say branch = fedcba9. Your Git checks to see if you have this object; if not, you get it (and its files) and its parent (and that commit's files) and so on until we reach some common point—which will, in fact, be commit 1234567.

Now you have this:

...--o--*--o--1        <-- master, origin/master
         \     \
          \     o--F   <-- origin/branch
           \
            o--8--A    <-- branch

Here I've written F for commit fedcba9, the one origin/branch now points-to.

If you come across this later without realizing that the upstream guys rebased their branch (your origin/branch), you might look at this and think that you must have written all three commits in the o--8--A chain, because they're on your branch and not on origin/branch anymore. But the reason they're not on origin/branch is that the upstream abandoned them in favor of the new copies. It's a bit hard to tell that those new copies are, in fact, copies, and that you, too, should abandon those commits.


1If branches grow in the "normal", "expected" way, it's really easy for your Git and their Git to figure out which commits your Git needs from them: your origin/master tells you where you saw their master last time, and now their master points further down a longer chain. The commits you need are precisely those on their master that come after the tip of your origin/master.

If branches are shuffled around in less-typical ways, it's somewhat harder. In the most general case, they simply have to enumerate all their objects by hash IDs, until your Git tells them that they have reached one you already have. The specific details get further complicated by shallow clones.


It's not impossible

It's not impossible to tell, and since Git version 2.0 or so, there are now built-in tools to let Git figure it out for you. (Specifically, git merge-base --fork-point, which is invoked by git rebase --fork-point, uses your reflog for origin/branch to figure out that the o--8 chain used to be on origin/branch at one point. This only works for the time-period that those reflog entries are retained, but this defaults to at least 30 days, giving you a month to catch up. That's 30 days in your time-line: 30 days from the time you run git fetch, regardless of how long ago the upstream did the rebase.)

What this really boils down to is that if you and your upstream agree, in advance, that some particular set of branch(es) get rebased, you can arrange to do whatever is required in your repository every time they do this. With a more typical development process, though, you won't expect them to rebase, and if they don't—if they never "abandon" a published commit that you have fetched—then there's nothing you need to recover from.