Preventing git push from sending entire repo if not up-to-date_问答_开发者

Related question: why does Git send whole repository each time push origin master

The short version: When working with two Git repositories, even if 99% of the commit objects are identical, using git push to send a commit to repository B when origin is set to point to repo A causes all objects (200MB +) to be tr开发者_运维技巧ansferred.

The much longer version: We have a second Git repository set up on our continuous integration server. After we have prepared our commit objects locally, instead of pushing directly to origin/master as one normally would, we instead push our changes to a branch on this second repository. The CI server picks up the new branch, auto-rebases it onto master, runs our integration tests and, if all is well, pushes the branch to origin/master on the master repo.

The CI server also periodically calls git fetch to retrieve the latest copy of origin/master from the master repo, in case someone has gone around the CI process and pushed directly.

This works wonderfully, especially if one does a git fetch; git rebase origin/master before pushing to the CI repo; Git only sends the commit objects that are not already in origin/master. If one skips the fetch/rebase step before pushing, the process still works, but Git appears to send, if not all, then a majority of commit objects to the CI repo -- currently more than 200MB worth. (A fresh clone of our repo clocks in at 225MB.)

Are we doing something wrong? Is there a way to correct this behaviour such that Git only sends the commit objects it needs to form the branch on the CI repo? We can obviously work around the issue by doing a pre-push git fetch; git rebase origin/master, but it feels like we should be able to skip that step, especially because pushing directly to the master repo does not present the same problem.

Our repos are served up by Gitosis 0.2, and our clients are overwhelmingly running msysgit 1.7.3.1-preview.

...auto-rebases it onto master...

I think that is the root of the problem right there. Every time your CI server does this auto-rebase step, it will create a whole new set of commits relative to the nearest common ancestor of the current and the master branch.

The next time you push your code to the CI server, it doesn't actually have all those object anymore (they're not reachable from any live heads), so it requests your client to send them all again.

You should be able to see this happening by watching the SHA1 commit IDs of the commits you're making. You will probably find that the commit IDs of local commits no longer match the corresponding commit IDs in the rebased branch on the CI server.

It turns out the simplest solution to this problem is to fetch right before the push:

$ git fetch origin master
$ git push user@host:repo.git HEAD:refs/heads/commit128952690069

In our case, it's important to fetch a specific branch into FETCH_HEAD; in this way, the user's local branch state will be unaffected, but we still receive the most up-to-date set of objects from the main repository; the following git push will always have the ancestor commit present when the Git starts to pack objects.

I did some tooling around with git pack-objects: if one builds a pack file containing the commits <common_ancestor>..HEAD, it only packs as much data as is required:

$ echo $(git merge-base master origin/master)..HEAD | git pack-objects --revs --thin --stdout --all-progress-implied > packfile

However, issuing git push with the repository in the same state causes all objects to get packed and sent.

I suspect what happens is that upon connecting to the Git repo, one receives the SHA of the latest revision in the repo -- if Git does not have the commit object represented by that SHA locally, it cannot run git merge-base to determine the common ancestor; therefore, it must send all the objects to the remote repo. If that commit object does exist, then git merge-base succeeds, and the pack file can be built referencing the common ancestor.

It sounds like your local repositories got out of sync with the CI server repository, the fact that a push from you to the CI server does this means that your local repository has a different set of commit hashes. It could go something like this:

git clone master
(... do work ...)
git push ci branch
(... CI does a rebase that changes all the commits hashes you pushed ..)
(... CI does its' testing and pushes to master ...)
(... Now master and CI match but the hashes of all the commits you just pushed
     don't exist anywhere except your local machine ...)
(... do work ...)
git push ci branch

That last push is going to contain the entire set of commits from your first push because the CI's rebase changed all of their hashes and you're still working off the original commits that you created.