Mohsen Laali Mohsen Laali - 14 days ago 7x
Git Question

Python Git diff parser

I would like to parse git diff with Python code and I am interested to get following information from diff parser:

  1. Content of deleted/added lines and also line number.

  2. File name.

  3. Status of file whether it is deleted, renamed or added.

I am using unidiff 0.5.2 for this purpose and I wrote the following code:

from unidiff import PatchSet
import git
import os

commit_sha1 = 'b4defafcb26ab86843bbe3464a4cf54cdc978696'
repo_directory_address = '/my/git/repo'
repository = git.Repo(repo_directory_address)
commit = repository.commit(commit_sha1)
diff_index = commit.diff(commit_sha1+'~1', create_patch=True)
diff_text = reduce(lambda x, y: str(x)+os.linesep+str(y), diff_index).split(os.linesep)
patch = PatchSet(diff_text)
print patch[0].is_added_file

I am using GitPython to generate Git diff. I received following error for the above code:

current_file = PatchedFile(source_file, target_file,
UnboundLocalError: local variable 'source_file' referenced before assignment

I would appreciate if you could help me to fix this error.


Finally, I found the solution. The output of gitpython is little bit different from the standard git diff output. In the standard git diff source file start with --- but output of gitpython start with ------ as you can see in the out put of running the following python code (this example is generated with elasticsearch repository):

import git

repo_directory_address = '/your/elasticsearch/repository/address'
revision = "ace83d9d2a97cfe8a8aa9bdd7b46ce71713fb494"
repository = git.Repo(repo_directory_address)
commit = repository.commit(rev=revision)
# Git ignore white space at the end of line, empty lines,
# renamed files and also copied files
diff_index = commit.diff(revision+'~1', create_patch=True, ignore_blank_lines=True, 
                         ignore_space_at_eol=True, diff_filter='cr')

print reduce(lambda x, y: str(x)+str(y), diff_index)

The partial out put would be as follow:

lhs: 100644 | f8b0ce6c13fd819a02b1df612adc929674749220
rhs: 100644 | b792241b56ce548e7dd12ac46068b0bcf4649195
------ a/core/src/main/java/org/elasticsearch/action/index/
+++ b/core/src/main/java/org/elasticsearch/action/index/
@@ -20,16 +20,18 @@
package org.elasticsearch.action.index;

 import org.elasticsearch.ElasticsearchGenerationException;
+import org.elasticsearch.Version;
 import org.elasticsearch.action.ActionRequestValidationException;
 import org.elasticsearch.action.DocumentRequest;
 import org.elasticsearch.action.RoutingMissingException;
 import org.elasticsearch.action.TimestampParsingException;
 import org.elasticsearch.client.Requests;
+import org.elasticsearch.cluster.metadata.IndexMetaData;
 import org.elasticsearch.cluster.metadata.MappingMetaData;
 import org.elasticsearch.cluster.metadata.MetaData;
 import org.elasticsearch.common.Nullable;
-import org.elasticsearch.common.UUIDs;
+import org.elasticsearch.common.Strings;
 import org.elasticsearch.common.bytes.BytesArray;
 import org.elasticsearch.common.bytes.BytesReference;

As you can see the line 4 of the source file start with ------. To fix the problem, you need to edit the source file regular expression of unidiff 0.5.2 which you find in /unidiff/ from :

RE_SOURCE_FILENAME = re.compile(
                      r'^--- (?P<filename>[^\t\n]+)(?:\t(?P<timestamp>[^\n]+))?')


RE_SOURCE_FILENAME = re.compile(
                   r'^------ (?P<filename>[^\t\n]+)(?:\t(?P<timestamp>[^\n]+))?')

PS: if the source file is renamed, gitpython generates diff start with ---. But it will not thrown an error, because I filtered git diff of rename file (diff_filter='cr').