Robert Haas: git

Showing posts with label git. Show all posts

Friday, March 02, 2012

The Git Workflow

When the PostgreSQL project decided to migrate to git, we decided not to allow merge commits. A number of people made comments, in a number of different fora, to the effect that we weren't following "the git workflow". While a few commentators seemed to think that was reasonable, many thought that it demonstrated our ignorance and stupidity, and some felt it outright heresy.

So I noted with some interest Julio Hamano's blog post about the forthcoming release of git 1.7.10, which is slated to include a change to the way that merge commits work: users will now be prompted to edit the commit message, rather than just accepting a default one. Actually, what I found most interesting where Linus Torvalds' comments on this change, particularly where he says this: "This change hopefully makes people write merge messages to explain their merges, and maybe even decide not to merge at all when it's not necessary." His comments are quoted more fully in the above-linked blog article; unfortunately I don't know how to link directly to his Google+ post. And Julio Hamano makes this remark: "Merging updated upstream into your work-in-progress topic without having a good reason is generally considered a bad practice. [...] Otherwise, your topic branch will stop being about any particular topic but just a garbage heap that absorbs commits from many sources, both from you to work on a specific goal, and also from the upstream that contains work by others made for random other unfocused purposes."

Keeping Local Git Branches Up To Date

Because I spend most of my time working on the master branch of the PostgreSQL git repository, I prefer to work with just a single clone. The PostgreSQL wiki page Committing with Git describes several ways of using multiple clones and/or git-new-workdir, but my personal preference is to just use one clone. Most of the time, I keep the main branch checked out, but every once in a while I check out one of the back-branches to look at something, or to back-patch. (If you're unfamiliar with the PostgreSQL workflow, note that we do not merge into our official branches; we always rebase, so that there are no merge commits in our official repository. You may or may not like this workflow, but it works for us.)

One small annoyance is that "git pull" doesn't leave my clone in the state I want. Say I have the master branch checked out. "git pull" will update all of my remote tracking branches, but it will only update the local branch that I currently have checked out. This is annoying, first of all because if I later type "git log REL9_0_STABLE" I'll only get the commits since the last time I checked out and pulled that branch, rather than as I intended the latest state of the upstream, and secondly because it leads to spurious griping when I later do "git push": it complains that the old branches can't be pushed because it wouldn't be a fast-forward merge. This is of course a little silly: since my branch tip is an ancestor of the tracking branch, it would be more reasonable to conclude that I haven't updated it than to imagine I meant to clobber the origin.

Stupid Git Tricks for PostgreSQL

Even before PostgreSQL switched to git, we had a git mirror of our old CVS repository. So I suppose I could have hacked up these scripts any time. But I didn't get around to it until we really did the switch. Here's the first one. It's a one-liner. For some definition of "one line".

git log --format='%H' --shortstat `git merge-base REL9_0_STABLE master`..master | perl -ne 'chomp; if (/^[0-9a-f]/) { print $_, " "; } elsif (/files changed/) { s/^\s+//; my @a = split /\s+/; print $a[3] + $a[5], "\n" }' | sort -k2 -n -r | head | cut -d' ' -f1 | while read commit; do git log --shortstat -n 1 $commit | cat; echo ""; done

This will show you the ten "biggest" commits since the REL9_0_STABLE branch was created, according to number of lines of code touched. Of course, this isn't a great proxy for significance, as the output shows. Heavily abbreviated, largest first:

66424a284879b Fix indentation of verbatim block elements (Peter Eisentraut)

9f2e211386931 Remove cvs keywords from all files (Magnus Hagander)

4d355a8336e0f Add a SECURITY LABEL command (Robert Haas)

c10575ff005c3 Rewrite comment code for better modular

ity, and add necessary locking (Robert Haas)

53e757689ce94 Make NestLoop plan nodes pass outer-relation variables into their inner relation using the general PARAM_EXEC executor parameter mechanism, rather than the ad-hoc kluge of passing the outer tuple down through ExecReScan (Tom Lane)

5194b9d04988a Spell and markup checking (Peter Eisentraut)

005e427a22e3b Make an editorial pass over the 9.0 release notes. (Tom Lane)

3186560f46b50 Replace doc references to install-win32 with install-windows (Robert Haas)

debcec7dc31a9 Include the backend ID in the relpath of temporary relations (Robert Haas)

2746e5f21d4dc Introduce latches. A latch is a boolean variable, with the capability to wait until it is set (Heikki Linnakangas)

Of course, some of these are not-very-interesting commits that happen to touch a lot of lines of code, but a number of them represented significant refactoring work that can be expected to lead to good things down the line. In particular, latches are intended to reduce replication latency and eventually facilitate synchronous replication; and Tom's PARAM_EXEC refactoring is one step towards support for the SQL construct LATERAL().

OK, one more.

#!/bin/sh

BP=`git merge-base master REL9_0_STABLE`

git log --format='format:%an' $BP..master | sort -u |

while read author; do

    echo "$author: \c"

    git log --author="$author" --numstat $BP..master |

    awk '/^[0-9]/ { P += $1; M += $2 }

         /^commit/ { C++ }

         END { print C " commits, " P " additions, " M " deletions, " (P+M) " total"}'

done

This one shows you the total number of lines of code committed to 9.1devel, summed up by committer. It has the same problem as the previous script, which is that it sometimes you change a lot of lines of code without actually doing anything terribly important. It has a further problem, too: it only takes into account the committer, rather other important roles, including reporter, authors, and reviewers. Unfortunately, that information can't easily be extracted from the commit logs in a structured way. I would like to see us address that defect in the future, but we're going to need something more clever than git's Author field. Most non-trivial patches, in the form in which they are eventually committed, are the work of more than one person; and, at least IMO, crediting only the main author (if there even is one) would be misleading and unfair in many cases.

I think the most interesting tidbit I learned from playing around with this stuff is that git merge-base can be used to find the branch point for a release. That's definitely handy.

Friday, September 24, 2010

Enjoying Git

OK, I admit it. This is awesome. I'm still getting used to committing to PostgreSQL with git rather than CVS, but it's sort of like the feeling of being let out of the dungeon. Wow, sunlight, what am I supposed to do about that?

Actually, I've never really been into CVS bashing; it's an OK system for what it does. And compare to RCS, which I actually used once or twice a long time ago, it's positively phenomenal. But git, despite its imperfections, is just a lot better.

There are two major things that caused problems for me when committing to CVS. First, it was painfully slow. Second, since I was doing all of my development work on git, that meant extracting the patch, applying it to CVS, making sure to CVS add/rm any new/deleted files, retyping (or copying) the commit message, and double-checking that I hadn't messed anything up while moving the patch around.

$ git commit

 $ git show

$ git push

Nice! I feel like someone gave me an easy button.

Saturday, September 18, 2010

Git Conversion, Take Two

The PostgreSQL project will be making its second attempt to migrate from CVS to git this coming Monday. In a previous blog post, I talked about some of the difficulties we've had getting a clean conversion of our project history to git. I was surprised that a number of people suggested throwing out our development history and just moving the head of each branch to git; and I agree with some of the later comments that this would be a bad idea. I refer back to our development history fairly frequently, for a variety of reasons: to determine when particular features were introduced, to determine what patch last touched a particular area of the code, to see how old a particular bit of code is, and sometimes even to model a new patch on a previous patch that added a similar feature. So I'd find it very painful to lose convenient access to all of that history. Even a somewhat messed-up conversion would be better than no conversion at all.

Fortunately, it looks like we're going to end up with a pretty darn good conversion. Tom Lane spent most of last weekend cleaning up most of the remaining infelicities. The newest conversions are a huge improvement over both our current, incrementally-updated conversion (which is what I use for day to day development) as well as earlier attempts at a final conversion. Only a handful of minor artifacts remain, mostly because of wacky things that were done in CVS many years ago. Our use of CVS in recent years has been quite disciplined, which is why such a clean conversion is possible.

Sunday, September 12, 2010

So, Why Isn't PostgreSQL Using Git Yet?

Just over a month ago, I wrote a blog posting entitled Git is Coming to PostgreSQL, in which I stated that we planned to move to git sometime in the next several weeks. But a funny thing happened on the way to the conversion. After we had frozen the CVS repository and while Magnus Hagander was in the process of performing the migration, using cvs2git, I happened to notice - just by coincidence - that the conversion had big problems. cvs2git had interpreted some of the cases where we'd back-patched commits from newer branches into older branches as merges, and generated merge commits. This made the history look really weird: the merge commits pulled in the entire history of the branch behind them, with the result that newer commits appeared in the commit logs of older branches, even we didn't commit them there and the changes were not present there.

Fortunately, Max Bowsher and Michael Haggerty of the cvs2git project were able to jump in and help us out, first by advising us not to panic, and secondly by making it possible to run cvs2git in a way that doesn't generate merge commits. Once this was done, Magnus reran the conversion. The results looked a lot better, but there were still a few things we weren't quite happy with. There were a number of "manufactured commits" in the history, for a variety of reasons. Some of these were the result of spurious revisions in the CVS history of generated files that were removed from CVS many years ago; Max Bowsher figured out how to fix this for us. Others represented cases where a file was deleted from the trunk and then later re-added to a back branch. But because we are running a very old version of CVS (shame on us!), not enough information was recorded in the RCS files that make up the CVS repository to reconstruct the commit history correctly. Tom Lane, again with help from the cvs2git folks, has figured out how to fix this. We also end up with a few spurious branches (which are easily deleted), and there are some other manufactured commits that Tom is still investigating.

In spite of the difficulties, I'm feeling optimistic again. We seem to have gotten past the worst of the issues, and seem to be making progress on the ones that remain. It seems likely that we may decide to postpone the migration until after the upcoming CommitFest is over (get your patches in by September 14!) so it may be a bit longer before we get this done - but we're making headway.

Saturday, August 07, 2010

Git is Coming to PostgreSQL

As discussed at the PGCon 2010 Developer Meeting, PostgreSQL is scheduled to adopt git as its version control system some time in the next few weeks. Andrew Dunstan, who maintains the PostgreSQL build farm, has adapted the build farm code to work with either CVS or git; meanwhile, Magnus Hagander has done a trial conversion so that we can all see what the new repository will look like. My small contribution was to write some documentation for the PostgreSQL committers, which has subsequently been further edited by Heikki Linnakangas (the link here is to his personal web page, whose one complete sentence is one of the funnier things I've read on the Internet).

I don't think the move to git is going to be radical change; indeed, we're taking some pains to make sure that it isn't. But it will make my life easier in several small ways. First, the existing git clone of the PostgreSQL CVS repository is flaky and unreliable. The back-branches have had severe problems in this area for some time (some don't build), and the master branch (aka CVS HEAD) has periodic issues as well. At present, for example, the regression tests for contrib/dblink fail on a build from git, but pass on a build from CVS. While we might be able to fix (or minimize) these issues by fixing bugs in the conversion code, switching to git should eliminate them. Also, since I do my day-to-day PostgreSQL work using git, it will be nice to be able to commit that way also - it should be both faster (CVS is very slow by comparison) and less error-prone (no cutting and pasting the commit message, no forgetting to add a file in CVS that you already added in git).

Robert Haas