Loren Shqipognja Loren Shqipognja - 1 month ago 15
Git Question

Git Diff of same files in two directories always result in "renamed"

git diff --no-index --no-prefix --summary -U4000 directory1 directory2

This works as expected in that it returns a diff of all the files between the two directories. Files that are added output as expected, files that are deleted also result in the expected diff output.

However because the diff takes into account the file path as part of the file name, files with the same name, in the two different directories, result in a diff output with the renamed flag instead of changed.


  1. Is there a way to tell git to not take into account the full file path in the diff and only look at the file name, as if the files were originating from the same directory?

  2. Is there a way for git to actually know if a copy of the same file in a different directory was actually renamed? I don't see how, unless it has a way of comparing the files md5s somehow or something (probably a bad guess lol).

  3. Would using branches instead of directories resolve this issue easily and if so what is the branch version of the command listed above?


Answer

There are multiple questions here, whose answers intertwine. Let's start with rename and copy detection, then move on to branches.

Rename detection

However because the diff takes into account the file path as part of the file name, files with the same name, in the two different directories, result in a diff output with the renamed flag instead of changed.

This is not quite right. (The text below is meant to address both your items 1 and 2.)

Although you are using --no-index (presumably, to make Git work on directories outside the repository), Git's diff code behaves the same way in all cases. In order to diff (compare) two files in two trees, Git must first determine file identity. That is, there are two sets of files: those in the "left side" or source tree (the first directory name), and those in the "right side" or destination tree (the second directory name). Some files on the left are the same file as some files on the right. Some files on the left are different files that have no corresponding right-side file, i.e., they have been deleted. Finally, some files on the right side are new, i.e., they have been created.

Files that are "the same file" need not have the same path name. In this case, those files have been renamed.

Here's how it works in detail. Note that "full path name" is modified somewhat when using git diff --no-index dir1 dir2: the "full path name" is what is left after stripping off the dir1 and dir2 prefixes.

When comparing the left and right side trees, files that have the same full path names are normally automatically considered "the same file". We place all these files into a queue of "files to be diffed", and none will show up as being renamed. Note the word "normally" here—we'll come back to this in a moment.

This leaves us with two remaining lists of files:

  • paths that exist on the left, but not the right: source without destination
  • paths that exist on the right, but not the left: destination without source

Naïvely, we can simply declare that all of these source-side files have been deleted, and all of these destination files have been created. You can instruct git diff to behave this way: set the --no-renames flag to disable rename detection.

Or, Git can go on to use a smarter algorithm: set the --find-renames and/or -M <threshold> flag to do this. In Git versions 2.9 and later, rename detection is on by default.

Now, how shall Git decide that a source file has the same identity as a destination file? They have different paths; which right-side file does a/b/c.txt on the left correspond to? It might be d/e/f.bin, or d/e/f.txt, or a/b/renamed.txt, and so on. The actual algorithm is relatively simple, and in the past did not take final name component into effect (I'm not sure if it does now, Git is constantly evolving):

  • If there are source and destination files whose contents match exactly, pair them. Because Git hashes contents, this comparison is very fast. We can compare left-side a/b/c.txt by its hash ID to every file on the right, simply by looking at all of their hash IDs. Therefore, we run through all source files first, finding destination files that match, putting the new pairs into the diff queue and pulling them out of the two lists.

  • For all remaining source and destination files, run an efficient, but unsuitable for git diff output, algorithm to compute "file similarity". A source file that is at least <threshold> similar to some destination file causes a pairing, and that file-pair is removed. The default threshold is 50%: if you have enabled rename detection without choosing a particular threshold, two files that are still in the lists by this point, and are 50% similar, get paired.

  • Any remaining files are either deleted or created.

Now that we have found all pairings, git diff proceeds to diff the paired, same-identity files, and tells us that deleted files are deleted, and newly-created files are created. If the two path names for same-identity files differ, git diff says the file is renamed.

The arbitrary-file-pairing code is expensive (even though the same-name-gives-a-pair code is very cheap), so Git has a limit on how many names go into these pairing source and destination lists. That limit is configured through git config diff.renameLimit. The default has climbed over the years and is now several thousand files. You can set it to 0 (zero) to make Git use its own internal maximum at all times.

Breaking pairs

Above, I said that normally, files with the same name are paired automatically. This is usually the right thing to do, so it is Git's default. In some cases, however, the left-side file that is named a/b/c.txt is actually not related to the right-side file named a/b/c.txt, it's really related to the right-side a/doc/c.txt for instance. We can tell Git to break pairings of files that are "too different".

We saw the "similarity index" used above to form pairings of files. This same similarity index can be used to split files: -B20%/60%, for instance. The two numbers need not add up to 100% and you can actually omit either one, or both: there's a default value for each if you set -B mode.

The first number is the point at which a default-already-paired file can be put into the rename detection lists. With -B20%, if the files are 20% dis-similar (i.e., only 80% similar), the file goes into the "source for renames" list. If it never gets taken as a rename, it can re-pair with its automatic destination—but at this point, the second number, the one after the slash, takes effect.

The second number sets the point at which a pairing is definitely broken. With -B/70%, for instance, if the files are 70% dis-similar (i.e., only 30% similar), the pairing is broken. (Of course, if the file was taken away as a rename source, the pairing is already broken.)

Copy detection

Besides the usual pairing and rename detection, you can ask Git to find copies of source files. After running all the usual pairing code, including finding renames and breaking pairs, if you have specified -C, Git will look for "new" (i.e., unpaired) destination files that are actually copied from existing sources. There are two modes for this, depending on whether you specify -C twice or add --find-copies-harder: one considers only source files that are modified (that's the single -C case), and one that considers every source file (that's the two -C or --find-copies-harder case). Note that this "was a source file modified" means, in this case, that the source file is already in the paired queue—if not, it's not "modified" by definition—and its corresponding destination file has a different hash ID (again, this is a very low-cost test, which helps keep a single -C option cheap).

Branches don't matter

Would using branches instead of directories resolve this issue easily and if so what is the branch version of the command listed above?

Branches make no difference here.

In Git, the term branch is ambiguous. See What exactly do we mean by "branch"? For git diff, though, a branch name simply resolves to a single commit, namely the tip commit of that branch.

I like to draw Git's branches like this:

...--o--o--o   <-- branch1
         \
          o--o--o   <-- branch2

The small round os each represent a commit. The two branch names are simply pointers, in Git: they point to one specific commit. The name branch1 points to the rightmost commit on the top line, and the name branch2 points to the rightmost commit on the bottom line.

Each commit, in Git, points back to its parent or parents (most commits have just one parent, while a merge commit is simply a commit with two or more parents). This is what forms the chain of commits that we also call "a branch". The branch name points directly to the tip of a chain.1

When you run:

$ git diff branch1 branch2

all that Git does is resolve each name to its corresponding commit. For instance, if branch1 names commit 1234567... and branch2 names commit 89abcde..., this just does the same thing as:

$ git diff 1234567 89abcde

Git's diff takes two trees

Git does not even care that these are commits, really. Git just needs a left side or source tree, and a right side or destination tree. These two trees can come from a commit, because a commit names a tree: the tree of any commit is the source snapshot taken when you made that commit. They can come from a branch, because a branch-name names a commit, which names a tree. One of the trees can come from Git's "index" (aka "staging area" aka "cache"), as the index is basically a flattened tree.2 One of the trees can be your work-tree. One or both trees can even be outside of Git's control (hence the --no-index flag).

Of course, Git can just diff two files

If you run git diff --no-index /path/to/file1 /path/to/file2, Git will simply diff the two files, i.e., treat them as a pair. This bypasses all the pairing and rename-detecting code entirely. If no amount of fiddling with --no-renames, --find-renames, --rename-threshold, etc., options does the trick, you can explicitly diff file paths, rather than directory (tree) paths. For a large set of files, this will, of course, be painful.


1There can be more commits past that point, but it's still the tip of its chain. Moreover, multiple names can point to a single commit. I draw this situation as:

...--o--o   <-- tip1
         \
          o--o   <-- tip2, tip3

Note that commits that are "behind" more than one branch name are, in fact, on all of those branches. So both bottom-row commits are on both tip2 and tip3 branches, while both top-row commits are on all three branches. Nonetheless, each branch name resolves to one, and only one, commit.

2In fact, to make a new commit, Git simply converts the index, just as it stands right now, into a tree using git write-tree, and then makes a commit that names that tree (and that uses the current commit as its parent, and has an author and committer, and a commit message). The fact that Git uses the existing index is why you must git add your updated work-tree files into the index before committing.

There are some convenience short-cuts that let you tell git commit to add files to the index, e.g., git commit -a or git commit <path>. These can be a bit tricky as they don't always produce the index you might expect. See the --include vs --only options to git commit <path>, for instance. They also work by copying the main index to a new, temporary index; and this can have surprising results, because if the commit succeeds, the temporary index is copied back over the regular index.