Difference between different git search commands

I think it helps to start with a clear definition of the word history. The word means different things to different people, and without a definition we could become quite confused.

Git stores one kind of history only. In Git, history is commits; commits are history. There is no real distinction between the two. Yet git log can show you file history—well, sort of—and git blame can show you line history within a file. Using git diff, you can see a sort of change history. What Git does is produce, or synthesize, those histories on demand, using the commits as the actual history.

To see how this works, we must look at the parts of a commit. The most direct way to do that is to look at some actual commit. Try running:

git rev-parse HEAD

and then:

git cat-file -p HEAD

in a repository. Here is an example (with an extra sed to replace @ with space, to possibly cut down on spam a bit) in the Git repository for Git itself:

$ git rev-parse HEAD
08da6496b61341ec45eac36afcc8f94242763468
$ git cat-file -p HEAD | sed 's/@/ /'
tree 27fee9c50528ef1b0960c2c342ff139e36ce2076
parent 07f25ad8b235c2682a36ee76d59734ec3f08c413
author Junio C Hamano <gitster pobox.com> 1570770961 +0900
committer Junio C Hamano <gitster pobox.com> 1570771489 +0900

Eighth batch

Signed-off-by: Junio C Hamano <gitster pobox.com>

This commit’s unique hash ID is 08da6496b61341ec45eac36afcc8f94242763468. This big ugly string of letters and digits represents a unique 160-bit number for this commit, and no other commit. Every commit anyone ever made has one of these, and no two commits ever have the same one.1

My clone of the Git repository for Git has a commit with this hash ID. That hash ID is this commit, in effect. I give Git the hash ID and Git can extract the data you see above as the output of git cat-file -p.

Inside the commit, then, we see these lines:

  • tree plus a big ugly hash ID: this is how Git saves the snapshot for the commit. You don’t need to know a lot more about this at this point—or maybe ever, except that the term tree refers to this kind of saved snapshot.

  • parent plus a big ugly hash ID: this is how Git knows which commit comes before commit 08da6496b61....

  • author and committer lines: these tell you who made the commit and when. They’re typically the same—or almost the same—like this, but if one user writes an early draft of a commit, and some other user actually puts the final version into the repository, you get two different names. (The earlier draft commit is probably still in some clones of the repository, at least for a few months.)

  • There can be a few other header lines (there’s one optional one for encoding, for instance), and then there is a blank line; the remainder of the lines are the commit subject and message body.

The most important parts of all of this for our purpose here is that each commit stores a snapshot of source code, a log message, and a parent hash ID. Some commits store two parent hash IDs, and at least one stores no parent hash ID.2 The parent hash ID is just a raw hash ID, as you can see from the text above.

Given the hash ID of any commit, Git can extract that commit’s contents, just as we saw above. That gives Git the hash ID of the snapshot of the commit, which lets Git extract all the files that are in that commit:

  • Every snapshot holds a full copy of every file—well, of every file that’s in that commit.

  • If you add a totally new file and make a new commit, obviously the old commits don’t have the new file. Only the new commit has the new file. The new commits continue to have all the old files though, because every snapshot has every file.

  • If you remove a file and make a new commit, the new commit doesn’t have that file. The old commits still do. The new commit still has every other file.

So, if you give Git two commit hash IDs, Git can extract both snapshots, then compare the two. Most files might be the same, in which case Git can just say nothing about them. Maybe you added a new file in the newer commit, and/or removed an old file. Git can say that the new file is added, and the old file is removed. And, of course, maybe you changed the contents of some file between the old and new commit, in which case Git can tell you that the file was modified. Not only that, Git can compare the old snapshot of that file to the new snapshot of the file, and tell you which lines changed.


1Technically, it is OK for two different commits to have the same hash, as long as they never meet. I like to refer these as doppelgänger commits. You won’t find any of these in real situations. They can be found by brute force, though, and to head off even this chance, Git is eventually moving to even bigger and uglier hash IDs. Until then, they are SHA-1 checksums.

Here is how that works. The commit I showed above is 280 bytes long, and if you calculate the SHA-1 hash of the string commit 280 followed by an ASCII NUL followed by the bytes of the text above, you get the hash ID:

$ python3
...
>>> import subprocess
>>> data = subprocess.check_output("git cat-file -p HEAD", shell=True)
>>> header="commit {}\0".format(len(data)).encode('ascii')
>>> header
b'commit 280\x00'
>>> import hashlib
>>> h = hashlib.sha1()
>>> h.update(header)
>>> h.update(data)
>>> h.hexdigest()
'08da6496b61341ec45eac36afcc8f94242763468'

which is the hash ID of the commit.

This is why you cannot change any part of any commit. If you try, the data that go into the hash function above change, which means that the hash ID changes. The result is a new commit! The old commit continues to exist, and all you did was add one more new commit to the repository.

2A commit with two or more parents is a merge commit. A commit with no parents is a root commit. We won’t worry about these much here.


Now that we know about commits and hash IDs, let’s look at their connections

We just saw that a commit holds the hash ID of its parent commit. That might not seem like much, but it’s actually almost everything else we need.

Imagine you have a small repository with just three commits in it right now. Those three commits have three big ugly hash IDs, and we have no idea what they are, but we can just pretend their IDs are A, B, and C in that order.

We can draw this repository:

A <-B <-C

Commit C’s hash ID is whatever it really is, and inside C we find B‘s actual hash ID. So commit C points to commit B. Git can use the hash ID from C to read commit B. Inside this commit, we find A‘s actual hash ID, so B points to A. Git can use A‘s hash ID to read it, and since it’s the very first commit, it has no parent and Git knows that it’s the first commit and can stop here.

To find C‘s actual hash ID, Git needs a bit of help. This is where a branch name comes in. A name like master just holds the hash ID of the last commit in the chain. In other words, if C is the last commit, the name master holds the actual hash ID of C. We say that the name master points to C, and we can draw that in:

A--B--C   <-- master

The “arrows”—hash IDs—stored inside commits can’t be changed, as we saw in footnote 1, so we can get lazy and draw them as connecting lines. The arrows in branch names, however, do change. To make a new commit, we have Git write out a new snapshot—all the files—and grab our name and email address and so on, and a log message. Git needs to write all of this, plus the hash ID of commit C, into a new commit, which will get a new unique big ugly hash ID that we’ll just call D here. Commit D will point back to C:

A--B--C   <-- master
       \
        D

and now Git will make commit D be the last commit in the chain by writing D‘s hash ID—whatever it really is—into the name master, so that master points to D instead of C:

A--B--C--D   <-- master

This is how Git works. Commits hold snapshots, plus parent hash IDs, so commits point to their parents. Branch names point to the last commit and that’s where Git starts: at the end. The last commit points one step back, to its parent. Its parent points one step back, to another earlier commit. That commit points one step back, and so on. We follow each commit, one at a time, and eventually we get to the root commit A and stop.

To make a new branch, we just create a new name, pointing to any existing commit—usually the one we have out right now, such as D from master:

A--B--C--D   <-- master, new-branch

Now we need one more thing in our drawing. We were on commit D, and we’re still on commit D, but which branch are we on? We add the name HEAD, attached to one of these branch names, to remember that:

A--B--C--D   <-- master, new-branch (HEAD)

Now if we make a new commit E, Git will update the name to which HEAD is attached, so we’ll get:

A--B--C--D   <-- master
          \
           E   <-- new-branch (HEAD)

If we go back to master, by attaching HEAD to the name master and picking commit D to work from, and make a new commit F, that will write F‘s hash ID into master:

           F   <-- master (HEAD)
          /
A--B--C--D
          \
           E   <-- new-branch

Note that no existing commit changes. Note further that commits A through D are now on both branches.3 That’s because Git does not really think very much of branches. It’s the commits that matter. The branch names are only there to find the ends.

Drawing the commits like this, with their interconnections, produces the commit graph. In math/CS theory, a graph is defined as G = (V, E), where V is a set of vertices or nodes, and E is a set of edges connecting the nodes. The nodes here are the commits, and the edges are the one-way arrows that point backwards.4

Starting from these various ends—or, if you give Git a raw hash ID, starting from any commit—Git can always work backwards, towards the beginning of history. In general, that is the sort of thing Git does. As in graph theory and graph algorithms, we call this walking the graph.

Note that when we do walk this graph, we get pairs of commits at a time: from F, we move back to D, so that we have (D, F) as a pair. Then from D we move back to C, so that we have (C, D) as a pair. This repeats, and is all pretty simple, with a graph like this, until we get to the start: there’s nothing before A to pair with. To make it work, we have Git pretend that (_, A) is a pair: Git fakes it, with _ being a sort of fake empty commit: a commit with the empty tree as its snapshot.

If we create a merge commit, we have a problem when walking backwards. Consider this little graph bit:

          I--J
         /    \
...--G--H      M--...
         \    /
          K--L

We start at the end as usual, and work our way back to M. But then what? We can go to J or L. As we’ll see in a moment, Git usually does both, but this gets pretty tricky.


3Git is a little weird this way: a commit does not remember which branch you were on when you made it. Many other version control systems do remember, keeping that information around forever. Git, in effect, argues that this information is worse than useless: that it’s just noise, interfering with valuable signal. You may agree or disagree, but that’s what Git does: it records only the backwards-looking chain hash IDs, not the branch names.

4When the edges are one-way arrows, the graph theory people call them arcs. This kind of graph, with directed edges, is a directed graph. Git further constrains the graph to be devoid of cycles, which makes this a Directed Acyclic Graph or DAG. DAGs have a number of nice properties, and Git depends on them.


Now (finally!) we can answer questions about the various commands

Let’s start with just these Git commands from your list:

  • git log: this walks commit history, displaying commits. It has a lot of options.

  • git rev-list: this is basically just just git log in disguise (and/or vice versa). The difference between them is how they are intended to be used: git log is what Git calls porcelain, or a user-oriented command, while git rev-list is what Git calls plumbing, or a command that’s designed to be used to build other commands. Rev-list is a key workhorse for Git, that implements some of the innards of git push and git fetch for instance.

    In general, you use git rev-list the same way as you use git log, except that git rev-list automatically prints just the hash ID of each commit. This is particularly useful as input (or arguments) to another Git command, one that needs hash IDs.5

  • git grep: this looks at one snapshot, or one at a time, anyway.

  • git diff: this generally looks at two snapshots and compares them. (There are a lot of variations on this theme, because git diff can look at things that aren’t quite snapshots, and also has a few special purpose modes that we won’t get into here.)

To these we can add:

  • gitk: this is an extension that comes with Git, but is not really part of Git. It uses Tcl/Tk to draw a graphical representation of your commits, with more information, and can run various Git commands on commits (or, for showing diffs, pairs of commits). It actually works by running git rev-list in the background, collecting its output, and dynamically updating displayed information until git rev-list finishes. I don’t use it all that much. It’s quite handy sometimes for browsing commits, and might be able to do more than that, but since I don’t really use it, I can’t say that much more about it.

5Note that git rev-list can produce hash IDs for things that are not commits, but by default, it shows only commit hash IDs. By contrast, git log can only really print stuff about commits. Hence, they are related, but definitely not identical, despite being built from a single main-driver source file (with much of the rest of Git linked in, including the git diff engine).


git log

As a graph-walker, git log can do a lot of rather amazing things.

We already noted that it starts at the end(s) and works backwards, and when it does that, it generally gets commits in pairs. Let’s look at the ramifications of these two items:

  • By starting at the end and working backwards, git log can show us each commit’s log message. That’s its default action: show the hash ID, the author and/or committer, and the log message … then move on to the previous commit, and show the hash ID and the author and log message, and move on again.

  • Because it has the parent of each commit in hand as it looks at each commit, git log can invoke git diff on the parent-and-child pair to find the difference in the two snapshots. The difference, if any, between the parent and the child shows what changed in that commit.

  • We can have git log not print some commit(s). This is actually hugely useful. Suppose we have git log walk the history, from the end all the way to the beginning, one commit at a time, looking at commit-pairs. As it does so, we have it invoke git diff on the parent and child. At the root commit, we have it diff the empty tree against the root commit, so that every file is added.

    Meanwhile, we instruct git log not to print anything about the commit unless that diff says that file interesting.ext has changed, or been added or removed. Our git log will walk all the commits it can reach by stepping back one at a time, but it will only tell us about interesting commits: the ones that modified (or created or removed) the interesting file.

    This looks like file history. It isn’t—it’s just selected commit history—but it’s usually exactly what we want when we ask for file history.

  • Or, we can have git log look at the commit message. If the commit message contains some particular word(s), we have it show the commit. Otherwise, we have it not show the commit. This is git log --grep.

  • Or, we can have git log run the parent-vs-child diffs as before, but this time, instead of asking did file interesting.ext change, we ask it: does the diff text, regardless of which file(s) are changed, have some string or regular expression in it? These are git log -G and git log -S.

    The difference between -G and -S is that -G looks for its regular-expression argument in the diff at all, while -S looks for its argument—which is a string by default, rather than a regular expression—to have a different number of occurrences in the parent and child. Given a source language in which you write func(args), git log -G func will find any diff showing any call to func that has changed its arguments, while git log -S func will find any place where you added a new call to func or removed an existing call to func, but not one where you went from func(true) to func(false), for instance.

There is much more, including stuff I won’t touch on here, but there’s one important caveat to using git log. Remember that a merge commit, like our example commit M with parents I and K both, has two parents instead of just one. That’s a problem, and git log has many peculiarities as a result.

First, when it comes to diffing, git log generally just gives up. To see what happened in a regular commit, git log diffs the parent commit vs the child commit. A merge has at least two parents, possibly more. There is no easy way to compare all the parents to the child (but see git diff below), so by default, git log doesn’t even try. It just doesn’t diff them at all. This means that all of your “check the diffs” options—git log -G and git log -S, mainly—just don’t do anything here either.

Second, in order to follow both parents, git log uses a priority queue. In fact, it uses this same mechanism to deal with a command like:

git log master feature

where you are telling Git to start its graph-walk from two commits. Git can’t do that, so instead, it puts each commit’s hash ID into a queue. One of these two commits becomes more important than the other, and git log will pick that one next, for its “get parent(s), maybe do some diffs, etc” step.

The priority order for commits depends on git log arguments, such as --date-order, --author-date-order, and --topo-order. Using git log --graph forces --topo-order. These are all a bit complicated and I won’t go into detail here. The important thing to remember is that whenever git log has two commits to show, it still just shows them one at a time:

          I--J   <-- branch1
         /
...--G--H
         \
          K--L   <-- branch2

Running git log branch1 branch2 picks one of the two commits J and L. This one comes out of the queue, which now holds the other commit. Git shows—or doesn’t show—the chosen commit, comparing with its parent I or K as appropriate. Then it puts the parent—I or K—into the queue. Now the queue has whichever of the two it didn’t show, plus the parent of the one it did show. It picks one of those two and shows (or doesn’t show) it, and puts the parent of that commit into the queue. Eventually it puts H into the queue, and usually by this time it has shown, or is about to show, I or K next or very soon. This would put H in again, but since that’s redundant, it doesn’t. So now there is only H in the queue: git log takes H out of the queue (which becomes empty), shows H (as compared to its parent G), and then puts G in the queue, which now has just one commit in it.

The same process happens when traversing backwards through a merge: git log puts all the parents into the queue. When one gets to the front of the queue, it gets taken out, shown or skipped as desired, and its parent(s) go into the queue, and the process repeats. A root commit has no parent, so no parent goes into the queue, which lets the queue drain and lets git log stop.

The git log command can do what Git calls History Simplification. This mainly consists of not putting in all parents when traversing merge commits. There are other kinds of history simplification. To learn about them, read the documentation. The stuff on simplification is complicated and hard to explain, and the documentation could use a lot more examples.

If you run git log with no starting-point commit, git log uses HEAD to find its starting commit. Since that’s just one commit, all the complexity of the priority queue vanishes, at least until you hit a merge commit in the history-walk.

git rev-list

The short way to describe this is that it’s like git log except that you never use it, you just feed its output to another Git command. Unlike git log, git rev-list requires a starting point, so to use it like git log, you will generally run git rev-list HEAD. Also, be aware that the documentation for both git log and git rev-list is generated from common source files. This means that some options that only make sense in, or are only allowed in, one of the two commands, leak into the other command’s documentation.

git grep

The git grep command is built to search files, generally as found in commits. However, like git diff below, you can have it use your work-tree or your index. (We have not touched on Git’s index yet; see git diff below.)

You can give git grep a commit identifier. Any identifier will do: a name, like branch-a, resolves to a commit hash ID, which specifies a snapshot. The name HEAD resolves to the hash ID of the commit you have checked out right now. A raw commit hash ID is a commit hash ID, which specifies a snapshot.

The grep command will search the associated files. It has a lot of options; see its documentation.

git diff

In general, git diff compares two commits. Any two commits will do: just give it two commit hash IDs, and it extracts the snapshot from the left side hash ID and the snapshot from the right side hash ID, and compares those two snapshots.

The output of git diff is a set of instructions: make these changes to this file, for each file. If you take the left-side snapshot and make the changes shown, you’ll get the same file that’s in the right-side snapshot. That’s not necessarily how someone actually changed the files, but it will have the same effect. If the old file and new file are the same, Git does not need to mention the file at all: there’s nothing to change in it.

It’s pretty useful to compare the current commit—the one in HEAD—to what’s in your work-tree, so you can do that. But Git actually builds new commits from what is in the index or staging area.

The index / staging-area—these are two names for the same thing, plus a third one that’s mostly fallen out of use, where it’s called the cache—initially holds a copy6 of each file taken from the HEAD commit, i.e., the one you last checked out. When you change files in your work-tree, that doesn’t affect the index copy. This is why you must constantly git add files: that copies the file from the work-tree into the index. The index then holds the proposed next commit, and when you run git commit, Git turns the index copies of the files into the snapshot copies. Now the new commit matches the index, and we’re back to the situation you had when you checked out the commit that’s now the parent of the new commit you just made: the index copies of all of your files match the committed copies.

So:

  • git diff compares what’s in the index / staging-area—what is in your proposed next commit right now—with what’s in your work-tree. As with comparing two commits, you get a set of instructions that tell you how to change the index copy of each file into the work-tree copy of that file. If two files are the same, no instructions are needed, and git diff says nothing about those two copies.

  • git diff --cached or git diff --staged compares the HEAD commit—what’s currently committed, in other words—to the index / staging-area. That is, if you made a commit right now, this is what would be different. Note that the work-tree is irrelevant here!

  • git diff HEAD compares the HEAD commit to your work-tree.

  • git diff commit compares the given commit to the work-tree. The commit argument can be anything that names a commit, including a branch name or a raw hash ID.

  • git diff commit1 commit2 compares the two given commits.

Now, git diff has a couple of special syntax tricks. One of them is that the form A..B, which means one thing to git rev-list and git log, means something entirely different to git diff. In fact, it means the same thing as if you replaced the two dots with a space:

  • git diff A..B just means git diff A B.

When you use three dots, though, git diff does something a lot fancier. This answer is too long to go into the details.

The command git show is pretty closely related to git diff. While it has a lot of other things it can do, its primary effect is to run git diff from the parent of the commit you name, to the commit you name. So git show shows you what changed. Like git log, it first shows the commit hash ID, author, and log message.

The last thing to mention is that git diff—and thus git show—has one last sneaky trick up its sleeve. Remember that we mentioned that git log normally doesn’t try to handle merge commit diffs, because they’re hard. But git diff / git show is willing to work a lot harder. If you git show a merge commit, Git will extract each parent, one at a time, and compare it to the child. Usually there are just two parents, which means it runs two internal git diffs. And then it combines the diffs.

A combined diff is very tricky. In an attempt to be useful, Git drops, from this combined diff, many of the actual differences. Let’s say that merge M has parents J and L, and that the diff from J to M says to change file_J.txt, but not file_L.txt, and the diff from L to M says to change file_L.txt but not file_J.txt. This combined diff will now say nothing about either file. But the diff from J to M says to change file_both.txt, and so does the diff from L to M. The combined diff will usually say something about file_both.txt. I believe the goal of this is to show you just the files where merge had to work harder, but sometimes, this is not what you wanted at all.

The main thing you should remember about a combined diff is that it omits some files. To see which ones, consult the documentation (this link goes to git diff-tree, which is a plumbing variant of git diff that can produce combined diffs pretty easily).

You can get git log to produce combined diffs using --cc or -c, but remember that these omit files. You can get git log to do fancier things with -m, but I really need to stop writing now. 😀


6Technically, the index holds references to internal, Git-format, frozen and compressed files, the way they do or will appear in the current or next commit, rather than actual copies of the data. But most of the time, you can’t really tell the difference, so you can just think of it as having a full copy of each file and not be all that far off.

Leave a Comment

tech