Sean Sean - 19 days ago 6
Git Question

Do bots/spiders clone public git repositories?

I host a few public repositories on GitHub which occasionally receive clones according to traffic graphs. While I'd like to believe that many people are finding my code and downloading it, the nature of the code in some of them makes me suspect that most of these clones are coming from bots or search engine crawlers/spiders. I know myself that if I find a git repository via a search engine, I usually look at the code with my browser and decide if it's useful or not before cloning it.

Does anyone know if cloning git repositories is a standard technique for search engine crawlers, or if my code is just more popular than I think?

Answer

The "Clone or download" button present in the Github page of a repository provides the URL of the repository. If you use that URL with a web browser you get the HTML page you can see in the browser. The same page is received by a web spider too.

However, if you provide the URL to a Git client, it is able to operate on the repository files (clone the repo, pull, push). This is because the Git client uses one of the two Git's own protocols built on top of HTTP.

In order to use this protocols, the Git client build URLs based on the base URL of the repository and submits HTTP requests on this URLs.

For example, if the Git URL is https://github.com/axiac/code-golf.git, a Git client tries one of the following two requests in order to find more information about the internal structure of the repository:

GET https://github.com/axiac/code-golf.git/info/refs HTTP/1.0

GET https://github.com/axiac/code-golf.git/info/refs?service=git-upload-pack HTTP/1.0

The first one is called the "dumb" protocol (and is not supported by Github anymore), the second one is called the "smart" protocol. The "dumb" one works with text message, the "smart" one works with binary string blocks and custom HTTP headers.

In order to operate on a Git repository, the Git client must parse the responses received from the server and use the information to create and submit the correct requests for the actions it intends.

A browser is not able to operate on a Git repository because it doesn't know the protocols. An all-purpose web crawler works, more or less, like a browser. It usually doesn't care too much about styles and scripts and the correctness of the HTML but regarding the HTTP it is very similar to a browser.

In order to clone your repo, a web crawler must be specifically programmed to understand the Git transport protocols. Or (better) it can run an external git clone command when it finds an URL that it thinks is the URL of a Git repository. In both situations, the crawler must be programmed with this purpose in mind: to clone Git repositories.

All in all, there is no way a web crawler (or an user using a web browser) can clone a Git repository by mistake.

A web crawler does not even need to clone Git repositories from Github or from other web servers that serve Git repositories. It can get each and every version of all the files contained in the repository by using the links the (Github or another) web server provides.

Comments