Git for Mathematicians (2): The Theory

Published
Part 1: Preliminaries
/post/git-1-preliminaries
Part 2: The Theory
/post/git-2-theory
Part 3a: The Practice
/post/git-3-practice

This post is the second in a series in which I will try to explain how to use Git to write papers, with an audience of professional mathematicians in mind. The first part, which was about why one would want to use Git, is here. Let us now dive into the second part, in which I explain a little what’s going “under the hood” of Git.

While it is not strictly necessary to know all this to use Git, I think that understanding the mechanics helps in actually using it correctly and efficiently. Commands like git push or git pull are actually a bit complex and it is useful to know what words like “commit”, “branch”, “remote”, etc. refer to, especially when there is a conflict between branches.

Note

Of course, I will not be able to explain everything about Git’s inner workings! That is what the reference documentation is for.

Commits

Since this post is intended for mathematicians, I hope that I can get away with some mathematical terminology. For a given Git repository, the history is stored as a rooted directed graph (without directed cycles) with marked vertices. Quite a mouthful, isn’t it? Let me try to explained what this all means.

From all the metadata of a commit, Git computes a commit ID. This commit ID is an SHA-1 hash, which typically looks like this: 4303a91e4e4f5fedceead0d4dfe939471451e65d. These commits IDs depend on all the metadata, including the parent commits; since the parent commits depend themselves on their own parents, and so on, up until the initial commit. A reference to a commit thus depends on the whole history of the repository up to that point. This is useful to note, as some commands can be used to rewrite history; but any rewrite will change all the commit IDs.

Staging

Alright, now we know what the history of a Git repository looks like. But in practice, how does one actually append changes into that history?

In an addition to the history (which contains the commits), Git has a notion called the “staging area”. As you modify files in a repository, your actual files will diverge from what Git considers to be the latest version of the repository. Before actually committing changes to the history, you need to explicitly “stage” them. Concretely, this means that you select the changes that you want to insert into the history. Once these changes are selected and you are satisfied, you would then insert the changes to history, creating a new commit whose parent is the previous commit, with a message that explains your changes. A typical workflow looks like this:

The modify-stage-commit cycle.

This notion of staging is useful for various reasons, compared to blindly committing everything that’s changed in your repository:

Git has some features that help you with this:

Pointers

Branches

As we saw before, the history of a Git repository can get pretty complicated. The “latest version” can be difficult to determine: in the image with the forked history (before the merge commit), is the “latest version” the commit labeled “More work on Section 2”, or the one “Even more work on Section 3”?

Another (more “advanced” but related) thing that you may want to consider is when you want to start working on a file without touching to what is considered the “main” version of the paper. For example, you may have an idea for a new proof of Proposition 3, and you want to start rewriting it, but you may want to be able to easily go back to the “main” version. Moreover, while you are going off on your tangent, you may also want to make changes to the “main” version of the paper and immediately make them available to your coauthors.

Branches are a solution to these questions. A branch is merely a named pointer to a specific commit. Nothing more, nothing less. Every Git repository typically starts with a single branch called master (in the sense of master record). Nowadays, the default branch is sometimes called main. The name is not important.

Whenever you commit something to Git, you are actually doing two things:

  1. you commit the staged changes to the history, creating a new “point in time” with all its associated metadata;
  2. you move the pointer of the current branch to that new point in time.

For example, in the first image, the master branch was pointing to “Started working on the paper” at first. Then as more commits were added, it moved to the right each time, until it pointed to “Posted to arXiv”, which I drew in green to indicate that it was the commit referenced by the master branch.

In the second image, instead of imagining that two authors have been working on the paper, it’s possible to imagine that a single author has been working with two different branches. One can imagine that the story went this way:

  1. The author wrote the two commits “Started working on the paper” and “Wrote proof of Proposition 3”.
  2. Then, the author had an idea for Section 3, but wasn’t sure that the idea would make this way to the final version of the paper. The author thus decided to create a new branch, named for example super-idea, and wrote the commit “Worked on Section 3”. At that point, the master branch still points to “Wrote proof of Proposition 3”, but super-idea points to the new commit.
  3. Then, the author noticed an important issue that requires immediate fixing in Section 2. The author switches back to master, commits the fix, and calls it “Worked on Section 2”. Some time later, the author commits “More work on Section 2”.
  4. After some more time, the author decide to start working again on Section 3. She switches back to super-idea, and commits “More work on Section 3” then “Even more work on Section 3”.
  5. At this point, the history looks like the second image. The author has a choice:
    • Either she’s happy with the changes to Section 3 and decides to merge the changes into the main branch (taking care of conflicts if any). She calls the appropriate Git command and creates a new commit, called “Merge!” in the third image. The master branch now points to this merge commit. The super-idea branch has become unnecessary: the whole history of the branch is now part of the history of the master branch. She can now safely delete it, or keep it around for sentimental reasons.
    • Or she decides that the changes to Section 3 were not worth it and keeps the master branch as it is. She can delete the super-idea branch. The yellow commits will remain in the Git repository, but will not be accessible from any branch. Git will notice this and eventually delete them to free up space. Or she can keep the branch around, in case a later idea allows the changes to be re-incorporated into the article, but continue working on the master branch in the meantime.

For a concrete example, this is exactly what I did while writing this post. I created a branch (unimaginatively) called git2 and started writing the post. But while writing it, I noticed that I forgot to call the FontAwesome script asynchronously. I switched back to the master branch and committed my change. This allowed me to immediately change my website without putting an incomplete article online. Then, when I was done with this article, I merged the git2 branch into the master branch. Git was smart enough to notice that there was no conflict: the header file that I modified for the script was not modified as part of this article. Thus, Git just merged to two changes gracefully. This resulted in this merge commit.

Remark

There exists a lightweight version of pointers called *tags. Tags live separately from branches. There are two mains difference between tags and branches: 1. a tag is typically immutable: when you make new commits to a branch, the tag stays where it is; 2. in addition to the name of the tag, one can add a message to a tag, much like a commit message. This is useful to keep track of special points of history. For articles, I have found useful to create tags such as arXiv-v1, arXiv-v2… for the version submitted to arXiv, submitted-v1, submitted-v2… for the versions submitted to the journal, etc.

Remotes

You may have noticed something while reading the previous section. Suppose that two authors are working on the same article. Let’s say that they start from the same commit (for example, they copied the files around). Then they start working on the article and committing changes to the master branch. At this point, the two authors both have a branch called master, but they refer to different things! How to reconcile them?

This is where remotes come in. Remember when I said in the first post that Git is distributed? A remote is just someone else (another user, a server…) that also has a copy of your repository and that you can access, typically through the network. This remote also has a full copy of the history of the repository, and their own branches.

You can essentially do two things with a remote:

This is illustrated in the following diagram:

Working with remotes

Now, what’s a good choice for a remote? Strictly speaking, a remote doesn’t have to be a central server that all your collaborators work with. You could work on your local copy of the repository, then you could meet with your collaborator and exchange commits and merge branches using some flash drive or whatever. As I said, a remote is nothing special: it’s just another copy of the repository.

This is, however, highly unpractical. In general, one does work with a central server such as GitHub or Bitbucket. Everyone agrees to push/pull to that central server which works 24/7.

Wrapping up

Alright, I hope this helps you in understanding how Git works. In the next post, I hope to be able to explain how this all works in practice. In the meantime, there are some resources online, such as the Pro Git Book, that can be of use.

As you may know, this entire website is hosted in a Git repository on GitHub. If you see anything wrong above, please raise an issue there 🙂.