Stefan Hajnoczi: git

Showing posts with label git. Show all posts

Tuesday, February 2, 2021

Keeping a clean git commit history

Does the commit history of your source repository look like this:

f02af91822 docs: fix incorrect subheadings
dd7bee8b38 cli: add --import option
900ca2936a cli: extract move_topic() helper function

Or like this:

7011cc9868 lunch time
a07c82331d resolve code review comments
331d79a8ff more fixes

The first is a clean git commit history where each commit has a clear purpose and is a single logical change. The second is a messy commit history where the commits have no inherent structure:

"more fixes" does not describe clearly what is being fixed and the plural ("fixes") hints it may contain multiple logical changes instead of just one.
"resolve code review comments" contains changes requested by code reviewers in relation to another commit, it's not a self-contained logical change.
"lunch time" is an unfinished commit that was created because the programmer wanted to save their work.

Commit anti-patterns

The example above illustrates several anti-patterns:

Vague commit messages

If the commit message is vague and does not express a clear purpose, then it is hard to know what a commit does from the commit message. If git-log(1) doesn't provide useful information about commits then one has to resort to searching the code diffs. That is very tedious and sometimes it's almost impossible to come up with a good code search query while a clear commit message would have been easy to search. So clear commit messages are the first step towards clean commit history.

Doing too many things in one commit

Commits that make several logical code changes are hard to review and impede backporting fixes to stable branches. For example, a commit that fixes a bug as well as adding a new feature may need to be rewritten for a stable branch. If instead the code had been split into two commits, then the bug fix commit could have been backported easily. Therefore it is good practice to separate distinct bug fixes, features, and other logical code changes into separate commits.

Addressing code review comments

The code review and testing history is usually not useful information once a commit has been merged. For example, if there was a continuous integration (CI) test failure and a pull request needed to be changed, then the change should be made directly to the buggy commit so that the final commit passes the tests. No one needs to know about the code review or testing history once the code is merged and keeping these artifacts makes the commit history unwieldy by spreading a logical code change across multiple incomplete commits.

Saving work

There are valid reasons to temporarily save your work in a commit, but work-in-progress (WIP) commits should be cleaned up before merging them. For example, sometimes people make arbitrary commits to save work at the end of the day. That is fine in a local branch, but those temporary commits can be restructured into clean commits using git-rebase(1). No one else needs to know about temporary commits.

Broken commits

It can be easy to accidentally include a commit that does not build or fails tests if a later commit happens to resolve the issue. Since the later commit hides the issue it may not be apparent when testing the branch. When reordering commits the risk of introducing broken commits increases because those commits were originally written in a different order. I use the git-rebase(1) exec action to build and run tests after every commit to detect broken commits when doing extensive rebases.

Why clean commit history is important

Not all reasons for maintaining a clean commit history are obvious. Unfortunately all the above anti-patterns make commit history less useful so it's interesting to note that if you value any of the following reasons for keeping a clean commit history, then all anti-patterns need to be avoided.

Code review

Reviewers have an easier time reading clean commits than an unstructured series of commits. For example, if there is a broken commit because a function is used before it is defined in a later commit, then that affects code reviewers who read the commits linearly. They will be puzzled by the non-existent function and unable to decide whether it is being used correctly because it has not been defined yet. Although code reviewers could put in extra effort to reread the commits multiple times and try to remember the misordered changes, it's better to let code reviewers spend time on real issues rather than on untangling poorly structured commits.

Capturing the rationale for code changes

When each commit is a single logical change it becomes possible to write good commit descriptions that give the rationale for the code change. Explanations for why a code change is necessary, as well as links to issue trackers, email discussions, etc can be valuable when revisiting the commit history later. If commits contain multiple logical code changes or are incomplete then it is hard to include a good commit description, so the commit history is less useful when referring back to it later on.

Making cherry-picking easy

Many software projects maintain stable branches that still receive bug fixes for some time. This allows development to introduce new features and less mature code while users can run a mature stable release. However, maintaining stable branches can be time-consuming. Maintainers need to identify commits suitable for stable branches and cherry-pick or backport them. This requires clean commit history so that bug fixes can be applied in isolation without dragging in other code changes that do not fit the criteria for stable branches.

Enabling git-bisect(1)

When a bug is observed it may not be clear which commit introduced it. The git-bisect(1) command systematically searches the commit history and identifies the commit that caused the bug. However, git-bisect(1) only works with clean commit history. If there are broken commits then bisection becomes unreliable because some portions of commit history cannot be tested. Poorly structured commits, such as huge changes that do many different things, also make it difficult to identify which line caused the bug even when git-bisect(1) has determined which commit is to blame.

When clean commit history does not matter

The reason I have found that not everyone practices clean commit history is that they may not need any of this. Especially small projects developed by a single author may involve little code review, backporting changes to stable branches, or git-bisect(1). In that case the effort required to split code changes into clean commits and write good commit messages may seem unjustified. Of course this can change but once the commit history is messy there is not much to be done. So it's worth thinking carefully about whether to take shortcuts.

Another factor is poor tooling. Gerrit and GitHub's code review has historically made it hard to practice clean commit history. They were not designed for reviewing commit series and favored anti-patterns like squashing everything into a single commit or adding additional commits to address code review feedback. These are tool limitations and luckily GitHub code review has become better over the years. Tools that encourage you to review a commit series as a single diff are not conducive to clean commit history.

Finally, clean commit history requires proficiency with git-rebase(1) and that you are comfortable with the idea of rewriting your local branch to clean it up before publishing it. It takes a little practice to become competent at reordering, squashing, and splitting commits. The process can be a little scary, although git-reflog(1) makes it possible to undo even the most serious errors where commits were accidentally lost. On a related note, some people falsely believe that a pull or merge request branch should not be rebased. Although it is good practice to avoid rewriting history of branches that other people track, rewriting history and force-pushing a pull request is different. Most of the time no one else will maintain a local branch based on it and therefore force-pushing will not inconvenience anyone. Even if it is necessary to develop branches based on someone else's not-yet-merged branches, one needs to weigh the trade-offs of having to do more work in the short-term with the drawbacks of having a messy commit history forever.

Conclusion

I hope this is a useful summary of why each commit should have a clear purpose and embody a single logical change. For source repositories that are used by more than one person it is especially important to think about commit best practices. Clean commit history facilitates better code review, bug-finding, and maintaining stable branches. Beyond that it also provides a useful form of communication and sharing knowledge about the codebase that is missing when commit history is disregarded.

Friday, December 11, 2020

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

Operation	Description
`create-branch`	Push a new branch that doesn't exist yet
`create-tag`	Push a new tag that doesn't exist yet
`fast-forward`	Push a commit that is a descendent of the current commit
`force`	Push a commit or tag replacing the previous ref
`delete`	Delete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

Operation	Description
`create-directory`	Create a new directory
`create-file`	Create a new file
`create-symlink`	Create a symlink
`modify`	Change an existing file or symlink
`delete-file`	Delete a file
...

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:

branches/ - access control policies for branches
  owner.conf
groups/ - group definitions (see below)
  ...
refs/ - access control policies for refs
  owner.conf

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
#
#   merge-reqs/<user-id>/<merge-req-num>/labels/
#     needs-review -> /labels/needs-review
#     ...
#   labels/
#     needs-review/
#       <user-id>/
#         <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
#         ...
#       ...
#     ...
#
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
#
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.

Sunday, December 6, 2020

Understanding Peer-to-Peer Git Forges with Radicle

Git is a distributed version control system and does not require a central server. Although repositories are usually published at a well-known location for convenient cloning and fetching of the latest changes, this is actually not necessary. Each clone can have the full commit history and evolve independently. Furthermore, code changes can be exchanged via email or other means. Finally, even the clone itself does not need to be made from a well-known domain that hosts a git repository (see git-bundle(1)).

Given that git itself is already fully decentralized one would think there is no further work to do. I came across the Radicle project and its somewhat psychedelic website. Besides having a website with a wild color scheme, the project aims to offer a social coding experiment or git forge functionality using a peer-to-peer network architecture. According to the documentation the motivation seems to be that git's built-in functionality works but is not user-friendly enough to make it accessible. In particular, it lacks social coding features.

The goal is to add git forge features like project and developer discovery, issue trackers, wikis, etc. Additional, distinctly decentralized functionality, is also touched on involving Ethereum as a way to anchor project metadata, pay contributors, etc. Radicle is still in early development so these features are not yet implemented. Here is my take on the How it Works documentation, which is a little confusing due to its early stage and some incomplete sentences or typos. I don't know whether my understanding actually corresponds to the Radicle implementation that exists today or its eventual vision, because I haven't studied the code or tried running the software. However, the ideas that the documentation has brought up are interesting and fruitful in their own right, so I wanted to share them and explain them in my own words in case you also find them worth exploring.

The git data model

Let's quickly review the git data model because it is important for understanding peer-to-peer git forges. A git repository contains a refs/ subdirectory that provides a namespace for local branch heads (refs/heads/), local and remotely fetched tags (refs/tags/), and remotely fetched branches (refs/remotes/<remote>/). Actually this namespace layout is just a convention for everyday git usage and it's possible to use the refs/ namespace differently as we will see. The git client fetches refs from a remote according to a refspec rule that maps remote refs to local refs. This gives the client the power to fetch only certain refs from the server. The client can also put them in a different location in its local refs/ directory than the server. For details, see the git-fetch(1) man page.

Refs files contain the commit hash of an object stored in the repository's object database. An object can be a commit, tree (directory), tag, or a blob (file). Branch refs point to the latest commit object. A commit object refers to a tree object that may refer to further tree objects for sub-directories and finally the blob objects that make up the files being stored. Note that a git repository supports disjoint branches that share no history. Perhaps the most well-known example of disjoint branches are the GitHub Pages and GitLab Pages features where these git forges publish static websites from the HTML/CSS/JavaScript/image files on a specific branch in the repository. That branch shares no version history with other branches and the directories/files typically have no similarity to the repository's main branch.

Now we have covered enough git specifics to talk about peer-to-peer git forges. If you want to learn more about how git objects are actually stored, check out my article on the repository layout and pack files.

Identity and authority

Normally a git repository has one or more owners who are allowed to push refs. No one else has permission to modify the refs namespace. What if we tried to share a single refs namespace with the whole world and everyone could push? There would be chaos due to naming conflicts and malicious users would delete or change other users' refs. So it seems like an unworkable idea unless there is some way to enforce structure on the global refs namespace.

Peer-to-peer systems have solutions to these problems. First, a unique identity can be created by picking a random number with a sufficient number of bits so that the chance of collision is improbable. That unique identity can be used as a prefix in the global ref namespace to avoid accidental collisions. Second, there needs to be a way to prevent unauthorized users from modifying the part of the global namespace that is owned by other users.

Public-key cryptography provides the primitive for achieving both these things. A public key or its hash can serve as the unique identifier that provides identity and prevents accidental collisions. Ownership can be enforced by verifying that changes to the global namespace are signed with the private key corresponding to the unique identity.

For example, we fetch the following refs from a peer:

<identity>/
  heads/
    main
  metadata/
    signed_refs

This is a simplified example based on the Radicle documentation. Here identity is the unique identity based on a public key. Remember no one else in the world has the same identity because the chance of generating the same public key is improbable. The heads/ refs are normal git refs to commit objects - these are published branches. The signed_refs ref points to an git object that contains a list of commit hashes and a signature generated using the public key. The signature can be verified using the public key.

Next we need to verify these changes to check that they were created with the private key that is only known to the identity's owner. First, we check the signature on the object pointed to by the signed_refs ref. If the signature is not valid we reject these changes and do not store them in our local repository. Next, we look up each ref in heads/ against the list in signed_refs. If a ref is missing from the list then we reject these refs and do not allow them into our local repository.

This scheme lends itself to peer-to-peer systems because the refs can be propagated (copied) between peers and verified at each step. The identity owner does not need to be present at each copy step since their cryptographic signature is all we need to be certain that they authorized these refs. So I can receive refs originally created by identity A from peer B and still be sure that peer B did not modify them since identity A's signature is intact.

Now we have a global refs namespace that is partitioned so that each identity is able to publish refs and peers can verify that these changes are authorized.

Gossip

It may not be clear yet that it's not necessary to clone the entire global namespace. In fact, it's possible that no single peer will ever have a full copy of the entire global namespace! That's because this is a distributed system. Peers only fetch refs that they care about from their peers. Peers fetch from each other and this forms a network. The network does not need to be fully connected and it's possible to have multiple clusters of peers running without full global connectivity.

To bootstrap the global namespace there are seed repositories. Seeds are a common concept in peer-to-peer systems. They provide an entry point for new peers to learn about and start participating with other peers. In BitTorrent this is called a "tracker" rather than a "seed".

According to the Radicle documentation it is possible to directly fetch from peers. This probably means a git-daemon(1) or git-http-backend(1) needs to be accessible on the public internet. Many peers will not have sufficient network connectivity due to NAT limitations. I guess Radicle does not expect every user to participate as a repository.

Interestingly, there is a gossip system for propagating refs through the network. Let's revisit the refs for an identity in the global namespace:

<identity>/
  heads/
    main
  metadata/
    signed_refs
  remotes/
    <another-identity>/
      heads/
        main
        foo
      metadata/
        signed_refs
      remotes/
        ...

We can publish identities that we track in remotes/. It's a recursive refs layout. This is how someone tracking our refs can find out about related identities and their refs.

Thanks to git's data model the commit, tree, and blob objects can be shared even though we duplicate refs published by another identity. Since git is a content-addressable object database the data is stored once even though multiple refs point to it.

Now we not only have a global namespace where anyone can publish git refs, but also ways to build a peer-to-peer network and propagate data throughout the network. It's important to note that data is only propagated if peers are interested in fetching it. Peers are not forced to store data that they are not interested in.

How data is stored locally

Let's bring the pieces together and show how the system stores data. The peer creates a local git repository called the monorepo for the purpose of storing portions of the global namespace. It fetches refs from seeds or direct peers to get started. Thanks to the remotes/ refs it also learns about other refs on the network that it did not request directly.

This git repository is just a data store, it is not usable for normal git workflows. The conventional git branch and git tag commands would not work well with the global namespace layout and verification requirements. Instead we can clone a local file:/// repository from the monorepo that fetches a subset of the refs into the conventional git refs layout. The files can be shared because git-clone(1) supports hard links to local repositories. Thanks to githooks(5) and/or extensible git-push(1) remote helper support it's possible to generate the necessary global namespace metadata (e.g. signatures) when we push from the local clone to the local monorepo. The monorepo can then publish the final refs to other peers.

Building a peer-to-peer git forge

There are neat ideas in Radicle and it remains to be seen how well it will grow to support git forge functionality. A number of challenges need to be addressed:

Usability - Radicle is a middle-ground between centralized git forges and email-based decentralized development. The goal is to be easy to use like git forges. Peer-to-peer systems often have challenges providing a human-friendly interface on top of public key identities (having usernames without centralized user accounts). Users will probably prefer to think in terms of repositories, merge requests, issues, and wikis instead of peers, gossip, identities, etc.
Security - The global namespace and peer-to-peer model is a target that malicious users will attack by trying to impersonate or steal identities, flood the system with garbage, game reputation systems with sockpuppets, etc.
Scalability - Peers only care about certain repositories and don't want to be slowed down by all the other refs in the global namespace. The recursive refs layout seems like it could cause performance problems and maybe users will configure clients to limit the depth to a low number like 3. At first glance Radicle should be able to scale well since peers only need to fetch refs they are interested in, but it's a novel way of using git refs, so we can expect scalability problems as the system grows.
Data model - How will this data model grow to handle wikis, issue trackers, etc? Issue tracker comments are an example of a data structure that requires conflict resolution in a distributed system. If two users post comments on an issue, how will this be resolved without a conflict? Luckily there is quite a lot of research on distributed data structures such as Conflict-free Replicated Data Types (CRDTs) and it may be possible to avoid most conflicts by eliminating concepts like linear comment numbering.
CI/CD - As mentioned in my blog post about why git forges are von Neumann machines, git forges are more than just data stores. They also have a computing model, initially used for Continuous Integration and Continuous Delivery, but really a general serverless computing platform. This is hard to do securely and without unwanted resource usage in a peer-to-peer system. Maybe Radicle will use Ethereum for compute credits?

Conclusion

Radicle is a cool idea and I look forward to seeing where it goes. It is still at an early stage but already shows interesting approaches with the global refs namespace and monorepo data store.

Monday, November 30, 2020

Git Forge Apps: Why git forges are serverless computing providers

In this post I want to explore an idea for a new type of application made possible by the power of git forges like GitLab and GitHub. I don't have a proof-of-concept and in fact we'll discuss hurdles that could make the idea infeasible. But it's an interesting way of thinking that is worth sharing as it offers a new way of looking at and using git forges.

Git forges offer git repository hosting at their core and then add on related features like pull requests, wikis, issue trackers, static website hosting, continuous integration, and more. They are now a long way away from the initial idea of a hosted git repository where you can publish code. They do so much more. They have a von Neumann architecture and are Turing complete. It's possible to do interesting things with that.

A git forge app (GFA) is an application that runs on a git forge. A hosted git repository is not just a place to publish and back up the source code. It's where the app runs. You can view the git forge as a Platform-as-a-Service (PaaS) or serverless computing provider. The app doesn't need to be deployed elsewhere, because the git forge itself is the execution environment.

A GFA must be able to:

Process data.
Store data.
Interact with the outside world.

This is basically what computer applications do. Git forges have become powerful enough to do this.

Imagine the following: you fork a repository and this instantiates the GFA. The GFA is now running under your git forge account. The web interface is available at https://<user>.gitlab.com/<app>/ and HTTP POST requests can also be used to interact with the application. The GFA could be a todo list, an RSS reader, a blog, etc. To the website visitor it appears like any other web application but everything happens within the git forge and no other hosting provider is necessary.

Let's look at how it's possible to use a git forge as the execution environment for an app.

Processing Data

The first order of business is executing code so that the application can process data. A configuration file is placed into a git repository to define runnable jobs. The job execution systems offered by git forges are GitLab CI and GitHub Actions, respectively. When the job is triggered it executes in a temporary environment, for example a Linux container image. It allows code hosted in the git repository to run on a server somewhere for a little while. There is a free allowance that grants a certain number of hours of unpaid execution time. This is how GFAs process data.

For example, say we are building an RSS reader. Our repo looks like this:

rss-reader/
    .gitlab-ci.yml - the job configuration file
    fetch-new-feeds.sh - download RSS feeds and check for new posts

The .gitlab-ci.yml file will contain a scheduled pipeline that runs fetch-new-feeds.sh every 15 minutes.

Storing Data

Applications need to store data persistently. The primary way of doing that in a git forge is by storing data in the git repository itself. Mutable data can be kept on a separate branch called data and can be force-pushed to avoid forever increasing the repository size when we don't need to store previous revisions of the data.

This works because each git branch has an independent commit history. The file namespace it stores is completely separate from other branches. This means it's possible to have the GFA's code on the main branch and to store data on the data branch without upsetting the commit history or files on the main branch.

Data storage is available to jobs because they are allowed to manipulate their own git repository thanks to an authentication token. On GitHub GITHUB_TOKEN allows jobs to push to their git repository. This gives jobs a way of storing data.

There are other forms of data storage besides the git repository. Artifacts are files produced by a job that can be downloaded via a URL. Artifacts are subject to expiry time and file size limitations. Caching is available to temporarily store files between job runs, although this data can be lost at any time.

At this point you may be thinking that this is nice but there is no way to store private data if the git repository is public. Git forges offer a secrets mechanism where variables can be stored privately and only made available during job execution. This can be used to stash an encryption key so that the data is stored in public but is encrypted. Anyone can download the encrypted data but they will not have the key needed to decrypt it. Some applications can also take advantage of client-side encryption and homomorphic encryption.

Interacting with the Outside World

Git forges offer a wealth of ways to interact with the outside world, called triggers. Triggers can run a job when an HTTP POST request is made. POST requests typically need a trigger token for authentication, but that token can be published if the triggered job is safe to be invoked by anyone. Both browser clients and Webhooks can invoke triggers through HTTP POST requests.

For example, imagine our RSS reader app needs an API to mark feeds as read. We define an HTTP POST trigger that runs a mark-feed-read.sh script that updates the feeds stored in the data branch. Since only the git repository owner should be able to invoke this trigger we do not publish the trigger token. Instead it is stored encrypted in the repository and the user provides a passphrase for decrypting this secret.

At this point it is also useful to offer a web interface. This is possible thanks to the static pages hosting feature called GitLab Pages and GitHub Pages. Pushing HTML, JavaScript, CSS, and images to a special branch publishes the static website at https://<user>.gitlab.com/<app>/.

For the most part GFAs are asynchronous, they cannot handle HTTP requests synchronously like a web application that outputs an HTTP response. Incoming HTTP requests simply schedule GFA jobs that run sometime later. There are a few ways around this. The browser can poll for results using XMLHttpRequest or similar techniques. Alternatively, a GFA can fire up a job that runs for several minutes although I haven't investigated if there is a practical way to communicate with a web application running in a job (I guess incoming HTTPS is tricky to achieve).

Triggers offer a lot more fun than what I've already covered. They can respond to the creation and modification of wiki pages, issues, and pull request comments. This means it's possible to use those entities for interacting with the outside world. The GFA could act as a chat bot on its issue tracker, for example.

Limitations

Git forges weren't designed for GFAs. But then they weren't initially designed to be Turing complete and offering ways to interact with the outside world either. That was added incrementally as demand for that functionality grew. Maybe git forges will evolve into full-blown serverless hosting competitors to today's cloud hosting providers.

GFAs are a hack that uses features like static pages, CI jobs, and webhooks in a creative way to run applications on a git forge. Building GFAs that are actually useful is likely to hit some challenges:

No synchronous requests - it is hard to implement search queries and other dynamic behavior in GFAs because they are primarily asynchronous (and slow!). This limitation matters for certain classes of applications. A lot can be done client-side instead to make up for this deficiency. But maybe someone can figure out how to do synchronous requests in GFAs.
Security - I have outlined how to make data publicly available and also how to encrypt it so only the git repository owner can view the plaintext. This is enough for personal web applications, but it's not sufficient for multi-user applications. Maybe git forges offer a form of authentication that works with GFAs so multiple users can store private data in a single GFA instance (the git repository owner could view all users' data but other users could not).
Free usage tiers - job execution time, storage capacity, and request throttling will limit how resource-intensive a GFA can be before it outgrows the free tier and eventually even the paid tier.

The first generation of GFAs could be written from scratch with each job carefully designed. Then a second generation of GFAs could be built on top of frameworks that abstract the tedious git forge integration with standard APIs and data models. For example, a Vue.js frontend could use a key/value store API and the whole thing can be hosted as a GFA.

Even if GFAs don't become a thing because git forges decide there is not enough demand to make them work really well, changing your perspective to think of git forges in this way is valuable. For example, I have a git repo for building a container image that I deploy on my server and that pushes output files to a web server host. All of that can be replaced with a git repository that runs a job and publishes to GitLab Pages. This simplifies things and frees me from maintaining infrastructure.

Conclusion

Git forges offer jobs for processing data, git repositories and artifacts for storing data, and triggers for interacting with the outside world. It is possible to build applications that exist solely as a git repository on a git forge. There is no longer a need to deploy code because the git forge itself is powerful enough to run non-trivial applications. I look forward to how this evolves and whether git forges eventually become full-blown cloud service providers.

Friday, May 13, 2016

Git: Internals of how objects are stored

Git has become the ubiquitous version control tool. A big community has grown around Git over the years. From its origins with the Linux kernel community to GitHub, the popular project hosting service, its core design has scaled. Well it needed to since from day one the Linux source tree and commit history was large by most standards.

I consider Git one of the innovations (like BitTorrent and Bitcoin) that come along every few years to change the way we do things. These tools didn't necessarily invent brand new algorithms but they applied them in a way that was more effective than previous approaches. They have designs worth studying and have something we can learn from.

This post explains the internals of how git show 365e45161384f6dc704ec798828dc927d63e3b22 fetches the commit object for this SHA1. This SHA1 lookup is the main query operation for all object types in Git including commits, tags, and trees.

Repository setup

Let's start with a simple repo:

$ mkdir test && cd test
$ git init .
Initialized empty Git repository in /tmp/test/.git/
$ echo 'Hello world' >test.txt
$ git add test.txt && git commit -s

Here is the initial commit:

$ git show --format=raw
commit 365e45161384f6dc704ec798828dc927d63e3b22
tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

    Initial checkin
    
    Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..802992c
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+Hello world

Loose objects

The function to look up an object from a binary SHA1 is struct object *parse_object(const unsigned char *sha1) in Git v2.8.2-396-g5fe494c. You can browse the source here. In Git the versioned data and metadata is stored as "objects" identified by their SHA1 hash. There are several types of objects so this function can look up commits, tags, trees, etc.

Git stores objects in several ways. The simplest is called a "loose object", which means that object 365e45161384f6dc704ec798828dc927d63e3b22 is stored zlib DEFLATE compressed in its own file. We can manually view the object like this:

$ openssl zlib -d <.git/objects/36/5e45161384f6dc704ec798828dc927d63e3b22
commit 241tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

Initial checkin

Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

There! So this little command-line produces similar output to $ git show --format=raw 365e45161384f6dc704ec798828dc927d63e3b22 without using Git.

Note that openssl(1) is used in this example as a convenient command-line tool for decompressing zlib DEFLATE data. This has nothing to do with OpenSSL cryptography.

Git spreads objects across subdirectories in .git/objects/ by the SHA1's first byte to avoid having a single directory with a huge number of files. This helps file systems and userspace tools scale better when there are many commits.

Despite this directory layout it can still be inefficient to access loose objects. Each object requires open(2)/close(2) syscalls and possibly an access(2) call to check for existence first. This puts a lot of faith in cheap directory lookups in the file system and is not optimal when the number of objects grows.

Pack files

Git's solution to large numbers of objects is found in .git/objects/pack/. Each "pack" consists of an index file (.idx) and the pack itself (.pack). Here is how that looks in our test repository:

$ git repack
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ ls -l .git/object/pack/
total 8
-r--r--r--. 1 stefanha stefanha 1156 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.idx
-r--r--r--. 1 stefanha stefanha  246 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.pack

In order to look up an object from a SHA1, the index file must be used:

The index file provides an efficient way to look up the byte offset into the pack file where the object is located. The index file's 256-entry lookup table takes the first byte of the SHA1 and produces the highest index for objects starting with that SHA1 byte. This lookup table makes it possible to search only objects starting with the same SHA1 byte while excluding all other objects. That's similar to the subdirectory layout used for loose objects!

The SHA1s of all objects inside the pack are stored in a sorted table. From the 1st SHA1 byte lookup we now know the lowest and highest index into the sorted SHA1 table where we can binary search for the SHA1 we are looking for. If the SHA1 cannot be found then the pack file does not contain that object.

Once the SHA1 has been found in the sorted SHA1 table, its index is used with the offsets table. The offset is the location in the pack file where the object is located. There is a minor complication here: offset table entries are only 32-bits so an extension is necessary for 64-bit offsets.

The 64-bit offset extension uses the top-most bit in the 32-bit offset table entry to indicate that a 64-bit offset is needed. This means the 32-bit offset table actually only works for offsets that fit into 31 bits! When the top-most bit is set the remaining bits form an index into the 64-bit offset table where the actual offset is located.

Each object in the pack file has a header describing the object type (commit, tree, tag, etc) and its size. After the header is the zlib DEFLATE compressed data.

Deltas

Although the pack file can contain simple zlib DEFLATE compressed data for objects, it can also contain a chain of delta objects. The chain is ultimately terminated by a non-delta "base object". Deltas avoid storing some of the same bytes again each time a modified object is added. A delta object describes only the changed data between its parent and itself. This can be much smaller than storing the full object.

Think about the case where a single line is added to a long text file. It would be nice to avoid storing all the unmodified lines again when the modified text file is added. This is exactly what deltas do!

In order to unpack a delta object it is necessary to start with the base object and apply the chain of delta objects, one after the other. I won't go into the details of the binary format of delta objects, but it is a memcpy(3) engine rather than a traditional text diff produced by diff(1). Each memcpy(3) operation needs offset and length parameters, as well as a source buffer - either the parent object or the delta data. This memcpy(3) engine approach is enough to delete bytes, add bytes, and copy bytes.

Conclusion

Git objects can be stored either as zlib DEFLATE compressed "loose object" files or inside a "pack file". The pack file reduces the need to access many files and also features deltas to reduce storage requirements.

The pack index file has a fairly straightforward layout for taking a SHA1 and finding the offset into the pack file. The layout is cache-friendly because it co-locates fields into separate tables instead of storing full records in a single table. This way the CPU cache only reads data that is being searched, not the other record fields that are irrelevant to the search.

The pack file format works as an immutable archive. It is optimized for lookups and does not support arbitrary modification well. An advantage of this approach is that the design is relatively simple and multiple git processes can perform lookups without fear of races.

I hope this gives a useful overview of git data storage internals. I'm not a git developer so I hope this description matches reality. Please point out my mistakes in the comments.