Lazer Lazer - 1 year ago 80
Git Question

How does Git save space and is fast at the same time?

I just saw the first Git tutorial at

How does Git store all the versions of all the files, and how can it still be more economical in space than Subversion which saves only the latest version of the code?

I know this can be done using compression, but that would be at the cost of speed, but this also says that Git is much faster (though where it gains the maximum is the fact that most of its operations are offline).

So, my guess is that

  • Git compresses data extensively

  • It is still faster because
    uncompression + work
    is still faster than
    network_fetch + work

Am I correct? Even close?

Answer Source

I assume you are asking how it is possible for a git clone (full repository + checkout) to be smaller than checked-out sources in Subversion. Or did you mean something else?

This question is answered in the comments

Repository size

First you should take into account that along checkout (working version) Subversion stores pristine copy (last version) in those .svn subdirectories. Pristine copy is stored uncompressed in Subversion.

Second, git uses the following techniques to make repository smaller:

  • each version of a file is stored only once; this means that if you have only two different versions of some file in 10 revisions (10 commits), git stores only those two versions, not 10.
  • objects (and deltas, see below) are stored compressed; text files used in programming compress really well (around 60% of original size, or 40% reduction in size from compression)
  • after repacking, objects are stored in deltified form, as a difference from some other version; additionally git tries to order delta chains in such a way that the delta consists mainly of deletions (in the usual case of growing files it is in recency order); IIRC deltas are compressed as well.

Performance (speed of operations)

First, any operation that involves network would be much slower than a local operation. Therefore for example comparing current state of working area with some other version, or getting a log (a history), which in Subversion involves network connection and network transfer, and in Git is a local operation, would of course be much slower in Subversion than in Git. BTW. this is the difference between centralized version control systems (using client-server workflow) and distributed version control systems (using peer-to-peer workflow), not only between Subversion and Git.

Second, if I understand it correctly, nowadays the limitation is not CPU but IO (disk access). Therefore it is possible that the gain from having to read less data from disk because of compression (and being able to mmap it in memory) overcomes the loss from having to decompress data.

Third, Git was designed with performance in mind (see e.g. GitHistory page on Git Wiki):

  • The index stores stat information for files, and Git uses it to decide without examining files if the files were modified or not (see e.g. core.trustctime config variable).
  • The maximum delta depth is limited to pack.depth, which defaults to 50. Git has delta cache to speed up access. There is (generated) packfile index for fast access to objects in packfile.
  • Git takes care to not touch files it doesn't have to. For example when switching branches, or rewinding to another version, Git updates only files that changed. The consequence of this philosophy is that Git does support only very minimal keyword expansion (at least out of the box).
  • Git uses its own version of LibXDiff library, nowadays also for diff and merge, instead of calling external diff / external merge tool.
  • Git tries to minimize latency, which means good perceived performance. For example it outputs first page of "git log" as fast as possible, and you see it almost immediately, even if generating full history would take more time; it doesn't wait for full history to be generated before displaying it.
  • When fetching new changes, Git checks what objects you have in common with the server, and sends only (compressed) differences in the form of thin packfile. Admittedly Subversion can (or perhaps by default it does) also send only differences when updating.

I am not a Git hacker, and I probably missed some techniques and tricks that Git uses for better performance. Note however that Git heavily uses POSIX (like memory mapped files) for that, so the gain might be not as large on MS Windows.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download