Martin Wiboe Martin Wiboe - 26 days ago 13
Git Question

Why can't I discard local changes caused by a `git clean` filter?

I have a few .resx files in my repository containing string translations for my app. This works well, except for merge conflicts when new strings have been added to the end of the file in separate git branches. KDiff3 doesn't play well with merging XML lists of pairs.

The resx file is basically a list of key/value-pairs with no particular ordering. To avoid merge conflicts, I therefore want to sort the pairs alphabetically prior to committing, and I have used the excellent SortRESX program to do this using a git filter:

git config --global filter.resx.clean SortRESX
git config --global filter.resx.smudge cat


which does the job. However, if I check out an unsorted file it will be sorted immediately by the filter, and git does not allow me to discard those changes -- I 'm forced to commit the sorted version of the file before switching branches. How can I discard the changes made by the filter without committing?

Answer

However, if I check out an unsorted file it will be sorted immediately by the filter

This should not happen (I think!). I believe something else is actually happening.

The intent/idea of a clean filter is that it is applied when the file is added to the index, and a smudge filter is applied when the file is extract from the index into the work-tree (this is why git checkout must first write a file from a commit into the index, before copying it to the work-tree, when you use git checkout <commit> -- <path>). Note that end of line / CRLF transformations are treated as a form of filter (done internally if possible, but done on a pipe from or to your actual user-supplied filter if needed).

(It's possible that there's some code I missed somewhere that does runs the clean filter in some extra case. But I don't think so: this part of the Git source is fairly obvious.)

I believe what is happening is more subtle. When Git applies a smudge filter, it automatically marks the cache entry "dirty" in the index. (This code is considerably less obvious, so I could be wrong here.) Because of this marking, when Git goes to check the status of the file, it says: Hmm, this cache entry is marked dirty, I'd best run the clean filter on it and find out for sure. So it runs your clean filter, which sorts the key/value pairs, then compares the result to the underlying blob. These differ, so Git now declares the work-tree entry "truly dirty", even though the original, unsorted work-tree entry actually matches the current commit.

In other words, Git assumes that the equivalent of git cat-file <hash-id> | smudge | clean produces the same bits as git cat-file <hash-id>, and if it doesn't, you should commit the file—which is actually generally true when you are attempting to normalize line endings as stored in the repository. That doesn't mean that the checked-out copy is sorted; your cat filter (which, incidentally, you can discard: a nonexistent filter means "leave this alone") did not sort the file, and the working tree copy is still unsorted. It's just that Git insists that it should become sorted.

What this means in the end is that the answer to:

How can I discard the changes made by the filter without committing?

is to simply ignore Git's complaints, and check out other commits anyway. You may have to use the --force flag to do this though, which is unsettling at best (and at worst, can cause you to lose changes you intended to keep!). So there's a slightly better (ish) method: temporarily disable the "clean" filter (by editing .gitattributes).

With the filter disabled (or replaced with cat, which does the same thing only slower), Git will now, upon checking status, see that the "dirty" flag is set, and repeat its Hmm, I'd best run the cleanfilter thing. This time the filter is a no-op, the resulting binary bits match the blob, and Git clears the dirty flag. You can now restore the filter at any point, because now the cache entry is no longer marked dirty, and Git will skip all this testing.

(It might be nice to have a way to get Git to try two comparisons before declaring a file "truly dirty": one using the actual, configured clean filter, and then if that says "dirty", one more time using no filter. That would automatically decide that work-tree files based on an "uncleaned" in-repo blob, but which ultimately match that blob anyway, are in fact "not dirty". Of course this would mean you would not be encouraged to fix your line endings, but if this were a user-defined switch, you could set it for old repositories containing unclean objects, the same way you can set merge.renormalize for such repositories.)