Xendar Xendar - 21 days ago 14
Git Question

Git log ^ algorithm

I am using the "git log ^" command in order to make sure than between two branches (let's say current release and previous release) there is no commit in the previous release that is not part of the current release (we use a record number in the commit message, which is the base for comparison).

I was basically wondering what is the algorithm behind the command. Does anyone know how it works?

I suppose that you can get all the commits of both branches and compare them one by one. However, in case the history is quite long this can be a long process.

But I think there is a smarter way to do this. The idea would be to find the common ancestor, but not sure if this is feasible and how it could be done.

If someone can enlighten me it would be great :)

Answer

From comments, you have noted that:

I was indeed speaking of the caret

and:

.. I should have precised that my purpose would be to do this job using Bitbucket API later on (instead of standard git commands) ...

That will make your job quite difficult, unless the part of the API you use is git clone. :-) Exactly how difficult depends on whether you can get away with just the information about each commit object, or whether you need the equivalent of git cherry.

Kevin's comment is correct: what git log or git rev-list do, to implement the ^branch_exclude branch_include or branch_exclude..branch_include syntaxes from gitrevisions, is to run a breadth and/or depth first search on the commit graph. (Based on the code in revision.c, it's mostly breadth-first.)

This, of course, requires constructing the commit graph, or at least enough of it.

Internally, within Git, each commit is named by its hash ID. Each commit consists of a small text object, which includes the commit's parent hash IDs (one per parent). Those two parts, plus a starting point such as the ID of the HEAD commit or the ID from some branch name, are all we need to do the simplest kinds of traversal, i.e., the standard git rev-list <startpoint> graph walk.

Note that most commits have just one parent: these are ordinary commits. At least one commit in the graph has no parents, and is a "root" commit. The very first commit made in the repository is always a root commit (you can make more roots with git checkout --orphan, or using Git's plumbing commands). Some commits are merges: these have two or more parents.

Simple graph walks

There are many graph walk algorithms (see various books by Sedgewick, Aho, and of course Knuth) but Git uses a very simple one to start with: keep an in-memory data structure (struct commit) for every commit encountered so far, and flag the object once it is "seen". To walk a graph, given some commit hash H, we put H on a queue of "commits to visit". Then, in not-quite-C:

while (there are commits in the queue) {
    struct commit *c = lookup_by_id(remove_first_id_from_queue());
    if (c->flags & SEEN)
        ... we already saw this, so do nothing ...
    else {
        ... print this commit ...
        c->flag |= SEEN; /* now we've seen it! */
        for (h in all C's "parent" hashes)
            append_hash_to_queue(h);
    }
}

The queue is what handles merge commits, and makes this a breadth first search. When we start with an empty queue, and put an ordinary commit into it, we visit that commit and add that commit's single parent to the queue, then immediately remove the parent from the queue and visit the parent, adding its parent to the queue, and so on. This just walks linearly. The process stops when we hit the root commit. But when we hit a merge commit, we add both parents to the queue, then begin walking both "sides", taking turns visiting "first parent" and queueing its parent(s), then "second parent" and queueing its parents, and so on. At some point we will queue a commit that is already in the queue, or has already been seen. When we reach that re-queued or re-seen commit later, we simply skip it.

To implement "commits reachable from identifier include but not from identifier exclude", we could use this same algorithm, but modified: first we walk all commits find-able from exclude, setting the SEEN flag but not printing anything; then we put the hash for include in the queue, walk again, and print the commits we visit this time that we have not already visited via the exclusion.

The first walk, to exclude commits, is what we might call a "negative walk", and the second, to include commits, is what we might call a "positive walk".

(This algorithm is very sub-optimal and hence not what Git uses. If you read the code you will see that specifically-excluded commits actually get UNINTERESTING | BOTTOM flags set, while commits excluded via walks get UNINTERESTING set. There's some interesting (ahem) code in the relevant_commit function and its callers which helps to prune off wide swaths of the graph during the negative walk, without having to walk past them. The trick is that we must know whether there are multiple ways to reach the point we'd like to prune right now. If not, it's safe to mark the one commit, which will avoid traversing it and all its parents during the positive walk.)

Git cherry

What git cherry (or git rev-list --cherry-mark, or any of the various similar items) does is more complicated than just the simple walks above. To implement git cherry or equivalent, we need not just the commit graph, but also the remaining objects in the repository, because now we would like to mark commits that are "in" one set but not "in" another set.

Before we even get there, we need to define the symmetric difference, which in Git syntax is denoted A...B (three dots instead of two). The symmetric difference is, at its core, "include commits that are reachable from either the left side name, or from the right side name, but not from both names." Again, we could use a very simple graph walk algorithm (though Git doesn't, again because this one is too inefficient): do a walk from A, marking each "seen" commit with a "seen on left" flag. Then, do a walk from B, marking each newly seen commit with a "seen on right" flag, separate from the "seen on left" flags (so that we mark all commits that are reachable from both names, with both flags).

Now we do one final traversal of all our stored commit objects (not from the two names, just over the struct commit internal objects), and print only the "seen on left but not on right" commits as left-side commits, and print only the "seen on right but not on left" commits as right-side commits. This gives us the symmetric difference.

To implement git cherry, though, we need to go quite a bit further. For each single-parent commit, we obtain a changeset. (We completely ignore merges, generally, although it's possible to turn a merge into a changeset against one of its parents.) The changeset is basically the result of git diff <parent-id> <commit-id>.

Now we have the hard part: each changeset can be turned into a hash ID, the same way Git turns all Git-repository objects into hash IDs. If we do this carefully, in a way that's not sensitive to line numbering (and ignores the commit message, and author and date and so on), we wind up with what Git calls a patch ID.

To implement git cherry, we mark all the ordinary commits on each "side" with their patch IDs. Then we can easily check, while printing out each commit, whether each left-side patch ID is in the set of all right-side patch IDs, and vice versa. This allows us to find "missing" commits, even if the commit was copied (via git cherry-pick or git rebase or similar).

(To see the algorithm Git actually uses for symmetric differences, see the source. The UNINTERESTING flag is augmented with another flag, BOTTOM, and "boundary" commits—those where the two sides meet, permitting some parent chains to be avoided—get marked with yet another flag, the BOUNDARY flag. You can view these commits by adding --boundary to the flags you give to git rev-list. I find that --boundary is not as useful as it might seem, because breadth first searches from multiple starting points can have "extra" boundaries that would not be necessary in searches done in other orders.)

(Internally, the BOTTOM flag is also used for --ancestry-path, along with a list of "negative references", i.e., the negation in ^X marks the commit to which X refers as both a negative reference and as a "bottom commit", which then goes on this list. The rev-list walk code can then exclude a commit that does not have all of these "bottom" commits as ancestors.)

Comments