Introduction

Git is hard.

Well, not inherently, but it can often be hard to understand the underlying simplicity, when confronted with the endless stream of metaphors used to explain it and the various trickle-down commands from other developers.

Often, I find myself using git in a ritualistic manner, repeating that which has worked in the past mindlessly, without much of an understanding of what I'm really doing.

I want to actually, truly, understand git.

And it seems to me that if I am to truly understand git — grok it — I have to first grok how a git repository works. And what better way to do that than to put myself… ourselves… in the place of git, and understand what it takes to create a repository.

Creating a Git Repository from Scratch

Let's create a folder for us to work in.

1
2
$ mkdir grok
$ cd grok

Now, we have a folder. A folder isn't a git repository, as we can verify.

1
2
$ git status
fatal: not a git repository (or any of the parent directories): .git

As the status subcommand tells us, neither is the directory we currently created, of any of it parent directories a git repository.

If that wasn't the case for you, that would be because you created the folder grok as a subdirectory of a git repository.

The error seems to mention a .git folder. Turns out, when speaking of a git repository, we are actually referring to this folder in most cases¹. So, in theory, it should be simple enough to create a git repository right, just create a .git folder and we are good to go?

1
2
3
$ mkdir .git
$ git status
fatal: not a git repository (or any of the parent directories): .git

Ohh, apparently not so.

If we were to look up the definition of a repository in the git glossary, we would see the following.

A collection of refs together with an object database containing all objects which are reachable from the refs, possibly accompanied by meta data from one or more porcelains. A repository can share an object database with other repositories via alternates mechanism.

Ahh, simple. Seems like we need to have some refs and an object database with all the objects refs can reach. Also, there might be meta data from porcelains?

Ehh, okay. Don't get overwhelmed!

That text is simply written in such an intimidating way to scare away the tech bros that thought this was a Luke Smith article. Now that they are gone, I'll let you know that this is actually quite simple. But I'm not just gonna explain it by 8 terse paragraphs, instead I'll simply show you that it's simple in a much more verbose way!

First, we need an object database. Well, database sounds a bit hard, we could just make a folder for it instead in our git repository.

1
$ mkdir .git/objects

Now, we also need a collection of refs. No issue, just make a folder for them.

1
$ mkdir -p .git/refs/heads

What are heads? Don't worry, we'll get to it. Now we have the bare minimum for a repository thou, right?

Well, let's try to find out!

1
2
$ git status
fatal: not a git repository (or any of the parent directories): .git

Ohh… we don't. We actually need one more thing before we get ahead of ourselves.

1
$ echo "ref: refs/" > .git/HEAD

Now, is this finally a repository? Let's see what git thinks the status is.

1
2
3
4
5
6
$ git status
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Seems like it is! …So now we have a git repository, that was easy.

Committing to a Repository

Creating Objects

Having a hand made git repository is a neat party trick, but it's not worth much if we can't fill it up with all our files.

We likely will want to fill this repository with text files. For the occasion, you can use fortune to generate something, but if you use what I got, you'll be able to compare your hashes to the ones in the article.

"If a listener nods his head when you're explaining your program, wake him up.

That is very wise indeed. We want to add this to the repository. But how?

Well, one way to add it is the following, using the git hash-object subcommand.

$ echo "If a listener nods his head when you're explaining your program, wake him up." | git hash-object --stdin -w
665e95f1674e9466cb429bdfebaf1b8792ef0eec

Okay, two questions now. What is 665e95f1674e9466cb429bdfebaf1b8792ef0eec, and what did hash-object just do?

Let's try to inspect our git repository, it might give us some clues. One way to do this is to show the directory tree structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ tree .git
.git/
├── HEAD
├── objects
│   └── 66
│       └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
└── refs
    └── heads

4 directories, 2 files

Here, we see that our object “database” in the objetcs folder has seen some change. There is a folder, 66, with a file that has an equally vexing name, 5e95f1674e9466cb429bdfebaf1b8792ef0eec.

I wonder what is in this file. Let's try to take a look at it.

1
2
3
4
5
$ cat .git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec
WKX
  Y$b   `݆!rB-8}ɢ,xSOŋE598}y?

                            %

Ohh, not plaintext. This seems like some sort of binary format, let's take a look and see if we can figure out what the file actually contains. We do this using the file command.

1
2
$ file .git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec
.git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec: zlib compressed data

Apparently, it's some data compressed with zlib. That means it's likely DEFLATE compressed (RFC 1951) in a zlib wrapper (RFC 1950)².

There is actually a neat hack we can use to view this data³.

1
2
3
4
$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec | gzip -dc
blob 78If a listener nods his head when you're explaining your program, wake him up.

gzip: stdin: unexpected end of file

We found our string it seems, but what's this blob and 78 at the start? blob specifies the git object's file type, the 78 is the blobs size in bytes⁴.

Objects in git have 3 primary types:

blob
tree
commit

When we ran the hash-object subcommand, we created a blob. Regarding the name, that is just the sha1 hash (RFC 3174) of the object, and some other metadata and such⁵. The actual output of the command was this hash. It is 40 characters long.

The name of the directory in the object database is the first two characters of the hash, that is 66, and the actual object files name is the 38 other characters.

We probably don't wanna decompress all the objects manually just to inspect them. We can use the hash we got, 665e95f1674e9466cb429bdfebaf1b8792ef0eec, as a way to inspect the newly created object.

Inspecting Objects

We do this with the git cat-file subcommand. To see the file type of an object, we use the type flag -t.

1
2
$ git cat-file -t 665e95f1674e9466cb429bdfebaf1b8792ef0eec
blob

That means that the type (-t) of the object we made is a blob, as we saw by inspecting it. What if we wanna see the contents of the blob? We use -p for print.

1
2
3
$ git cat-file -p 665e95f1674e9466cb429bdfebaf1b8792ef0eec

If a listener nods his head when you're explaining your program, wake him up.

Neat, that's a lot easier than remembering the gzip hack from earlier!

Also, another useful flag to know is -s, for size.

1
2
$ git cat-file -s 665e95f1674e9466cb429bdfebaf1b8792ef0eec
78

Ohh, if it isn't that 78 from earlier. As mentioned, this is the size of the object in bytes.

Index

Okay, enough about the blob. It seems like we have added an object – with some text – into the git repository now. We might wonder if this is reflected in the git status of the repo.

1
2
3
4
5
6
$ git status
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Ohh, well… so what if we just added a bunch of files? If we aren't even aware of these objects, how do we even know if a repo is full of useless stuff? Well, that's actually a little to complicated for now, but a answer that might satisfy you is we don't really, so git gc exists (gc for garbage collection).

Okay, that aside, how do we actually commit this object? Well, first, we have to do something else.

1
git update-index --add --cacheinfo 100633 665e95f1674e9466cb429bdfebaf1b8792ef0eec truth.txt

The long string is the hash of the blob object we created, and truth.txt is a fitting filename.

What about 100633? Looking at man git update-index we see that…

--cacheinfo <mode>,<object>,<path>,
--cacheinfo <mode> <object> <path>

Directly insert the specified info into the index.
For backward compatibility, you can also give these
three arguments as three separate parameters, but
new users are encouraged to use a single-parameter
form.

As you can see, we used the separate parameter form — as we are encouraged to do. But also, that 100633 is the mode of the file. Mode here refers to the file system permissions, and is taken from the way UNIX modes work… sorta, with limitations. You can think of this as octal permissions.

But crucially, for blobs in git, we only have 3 modes available⁶:

100644 a normal file
100755 a executable file
120000 a symbolic link

Let's take a look at the repo structure again.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ tree .git
.git/
├── HEAD
├── index
├── objects
│   └── 66
│       └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
└── refs
    └── heads

4 directories, 3 files

Here we notice that there now is an index file. So all we did was add some index file? …well, slow down. Let's do a sanity check with git status.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   truth.txt

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    deleted:    truth.txt

Aha!

So we did something, we staged (added) the truth.txt file… but we also have unstaged changes saying we deleted it?

Well, yes. We have told git that the blob object we created is in the working-tree, and it has the actual blob for this in its object database… but apparently can't find the file we told it about. This makes sense, since we just created the blob from stdin and didn't actually add any files.

Since it can't find it, it thinks we must have deleted it, how presumptious!

Hmm, while we are at it, let's take a look at .git/index, just to see what that is about.

1
2
$ cat .git/index
DIRCf^gNfB  truth.txtqblr33il%

Welp, that's not a textfile either. But what is it?

1
2
$ file .git/index
.git/index: Git index, version 2, 1 entries

Ohh, it's a Git index, version 2 with 1 entries. That must be the entry we created! Now that we… wait, you also wanna see what's inside the binary? Fine, if you insist.

1
2
3
4
5
6
7
8
$ hexdump -C .git/index
44 49 52 43 00 00 00 02  00 00 00 01 00 00 00 00  |DIRC............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 66 5e 95 f1  67 4e 94 66 cb 42 9b df  |....f^..gN.f.B..|
eb af 1b 87 92 ef 0e ec  00 09 74 72 75 74 68 2e  |..........truth.|
74 78 74 00 c1 b9 71 62  b4 ab c1 c2 6c 72 33 33  |txt...qb....lr33|
92 69 92 99 ac 6c 8e ff                           |.i...l..|

(I removed the offset from the output)

Hmm, it seems like the first 4 bytes, DIRC, are a magick number. Then, after three 00⁷ bytes, a 02. This must be the version number. Then, a whole bunch of 00 padding, some garbled bytes and the truth.txt file name.

I bet if we added another file, we'd be able to discern a pattern in this. And I wonder how the .git/index looks in e.g. a large project. Hmm, you'll have to attempt the [exercises]{.spurious-link target=“Exercises”} after reading this blog post I guess if you wanna find out :^)

Tree Objects

Back on track, we've now made the index thingy, and we'll have to write it with the write-tree subcommand.

1
2
$ git write-tree
a6325f064bac723691f20c0b1ed2bea82a1728fd

git status does not seem to have changed, but if we check what's in repo's file structure, we'll see something did change.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$  tree .git
.git
├── HEAD
├── index
├── objects
│   ├── 66
│   │   └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
│   └── a6
│       └── 325f064bac723691f20c0b1ed2bea82a1728fd
└── refs
    └── heads

5 directories, 4 files

We notice the a6325f064bac723691f20c0b1ed2bea82a1728fd sha-1 hash refers to a new object. Interesting, wonder what this is.

Of course, because we are trying to get to the bottom of this object, we're gonna do it the hard way.

1
2
$ file .git/objects/a6/325f064bac723691f20c0b1ed2bea82a1728fd
.git/objects/a6/325f064bac723691f20c0b1ed2bea82a1728fd: zlib compressed data

Okay, that's zlib compressed data. We know that, so let's use that same hack to figure out what is inside of this blob.

1
2
3
$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/a6/325f064bac723691f20c0b1ed2bea82a1728fd | gzip -dc
tree 37100644 truth.txtf^gNfB
gzip: stdin: unexpected end of file

Ohh. That isn't a blob. That's a tree, one of the 3 file types. That number, 37100644 might be hard to read, but it's actually 37 the bytes size, and 100644, the file mode. Then, we seem to have the filename, and some weird f^gNfB thing.

That f^gNfB thing we actually have seen before, in the hexdump -C of the .git/index. What it is is left as an excercise to the reader.

Let's look at the contents the canonical way.

1
2
$ git cat-file -p a6325f064bac723691f20c0b1ed2bea82a1728fd
100644 blob 665e95f1674e9466cb429bdfebaf1b8792ef0eec    truth.txt

An Experiment with Tree Objects

Something that is too important to leave to the exercises is what happens when we have multiple files in a tree object. Let's create another blob.

1
2
$ echo "AMOGUS" | git hash-object --stdin -w
f58617716d903fb842b5606a335ff1406b9a21d3

And add it to our index file.

1
$ git update-index --add --cacheinfo 100633 f58617716d903fb842b5606a335ff1406b9a21d3 amogus.txt

Now, let's look at the repository. We have 3 objects in our objects database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ tree .git
.git
├── HEAD
├── index
├── objects
│   ├── 66
│   │   └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
│   ├── a6
│   │   └── 325f064bac723691f20c0b1ed2bea82a1728fd
│   └── f5
│       └── 8617716d903fb842b5606a335ff1406b9a21d3
└── refs
    └── heads

6 directories, 5 files

Let's dump the index binary.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15

$ hexdump -C .git/index
44 49 52 43 00 00 00 02  00 00 00 02 00 00 00 00  |DIRC............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 f5 86 17 71  6d 90 3f b8 42 b5 60 6a  |.......qm.?.B.`j|
33 5f f1 40 6b 9a 21 d3  00 0a 61 6d 6f 67 75 73  |3_.@k.!...amogus|
2e 74 78 74 00 00 00 00  00 00 00 00 00 00 00 00  |.txt............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 66 5e 95 f1  67 4e 94 66 cb 42 9b df  |....f^..gN.f.B..|
eb af 1b 87 92 ef 0e ec  00 09 74 72 75 74 68 2e  |..........truth.|
74 78 74 00 54 52 45 45  00 00 00 06 00 2d 31 20  |txt.TREE.....-1 |
30 0a 94 c6 2b 91 4b 84  9d 9a 2e 3c 20 e4 3e 93  |0...+.K....< .>.|
1b 69 3f 19 3d bd                                 |.i?.=.|

Nothing too spectacular here, we notice that we see both amogus.txt and truth.txt. We also see TREE at the end of the file, that must be our tree object.

However, what happens when we run write-tree?

1
2
$ git write-tree
aee76412ed220742aeaf02ca1c50519bcea013e1

Let's dump the index again.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ hexdump -C .git/index
44 49 52 43 00 00 00 02  00 00 00 02 00 00 00 00  |DIRC............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 f5 86 17 71  6d 90 3f b8 42 b5 60 6a  |.......qm.?.B.`j|
33 5f f1 40 6b 9a 21 d3  00 0a 61 6d 6f 67 75 73  |3_.@k.!...amogus|
2e 74 78 74 00 00 00 00  00 00 00 00 00 00 00 00  |.txt............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 66 5e 95 f1  67 4e 94 66 cb 42 9b df  |....f^..gN.f.B..|
eb af 1b 87 92 ef 0e ec  00 09 74 72 75 74 68 2e  |..........truth.|
74 78 74 00 54 52 45 45  00 00 00 19 00 32 20 30  |txt.TREE.....2 0|
0a ae e7 64 12 ed 22 07  42 ae af 02 ca 1c 50 51  |...d..".B.....PQ|
9b ce a0 13 e1 07 2b 99  c5 b6 0e 3d 53 33 c9 21  |......+....=S3.!|
dd b1 75 41 41 84 b1 d8  ec                       |..uAA....|

Hmm, seems like the end grew a bit, after the TREE. Let's try to inspect the contents of the newly written tree.

1
2
3
$ git cat-file -p aee76412ed220742aeaf02ca1c50519bcea013e1
100644 blob f58617716d903fb842b5606a335ff1406b9a21d3    amogus.txt
100644 blob 665e95f1674e9466cb429bdfebaf1b8792ef0eec    truth.txt

Now we seem to have two files inside of the tree.

Time to Commit

So now, it's time to create a commit. Above, when we did write tree, we created the tree object aee76412ed220742aeaf02ca1c50519bcea013e1. This includes both of the blobs we made.

So how do we commit that? It's actually really simple, we just use git commit-tree.

1
2
git commit-tree aee76412ed220742aeaf02ca1c50519bcea013e1 -m "initial commit"
87a1aa833dccca5ea503e9a7ff81c51fe82c85c6

If we now look at the .git repository, we will see a new object, 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 created. What kind of object is this?

1
2
git cat-file -t 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
commit

Unsurprisingly, it's a commit object. If we look what is inside, we see something that may look familiar to people that regularly use git.

1
2
3
4
5
6
git cat-file -p 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
tree aee76412ed220742aeaf02ca1c50519bcea013e1
author Christina Sørensen <christina@cafkafk.com> 1715792150 +0200
committer Christina Sørensen <christina@cafkafk.com> 1715792150 +0200

initial commit

A fun thing to do now is to try and run git log.

1
2
$ git log
fatal: your current branch appears to be broken

Seems like our branch is broken hu? Let's fix that. Let's quickly remind ourself of the contents of .git/HEAD.

1
2
$ cat .git/HEAD
ref: refs/

This doesn't refer to anything. We can easily fix this, let's make a new branch. But what git subcommand will we use this time? None actually, as it turns out, branches are just files in .git/refs/heads/ that contain the sha-1 hash of some commit. The name of the file becomes the name of the branch.

1
$ echo 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 > .git/refs/heads/main

Now we just need to switch to that branch. Instead of doing git switch main we can just change the reference in the repository directory .git.

1
$ echo "ref: refs/heads/main" > .git/HEAD

If you're following along, and you have some PS1 git branch feature, you may have noticed something incredible just after running that command.

First, if we run git status now, we see that there no longer are any changes to be committed.

1
2
3
4
5
6
7
8
9
$ git status
On branch main
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    deleted:    amogus.txt
    deleted:    truth.txt

no changes added to commit (use "git add" and/or "git commit -a")

Also, we can now run git log!

1
2
3
4
5
6
git log
commit 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 (HEAD -> main)
Author: Christina Sørensen <christina@cafkafk.com>
Date:   Wed May 15 18:55:50 2024 +0200

    initial commit

Something else that's cool is we can run git log --format=raw, and see output similar to what we got from git cat-file -p on the commit objects sha-1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ git log --format=raw
commit 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
tree aee76412ed220742aeaf02ca1c50519bcea013e1
author Christina Sørensen <christina@cafkafk.com> 1715792150 +0200
committer Christina Sørensen <christina@cafkafk.com> 1715792150 +0200

    initial commit

$ git cat-file -p 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
tree aee76412ed220742aeaf02ca1c50519bcea013e1
author Christina Sørensen <christina@cafkafk.com> 1715792150 +0200
committer Christina Sørensen <christina@cafkafk.com> 1715792150 +0200

initial commit

But what about those files that are deleted? We can solve that with git checkout like this.

1
$ git checkout HEAD -- amogus.txt truth.txt

Now if we run git status we get.

1
2
3
$ git status
On branch main
nothing to commit, working tree clean

And just like that, you've created a commit from scratch!!!

Conclusions

From nothing, we have now tried to create a git repository, added files to it, created a branch, and even commited our files.

Is this even remotely efficient? No, this could have been just:

1
2
3
4
git init
echo <text> > <filename>.txt
git add .
git commit -m "<commit message>"

HOWEVER! Hopefully, you should now have a much deeper understanding of the porcelain of git, and what's actually going on under the hood, how the git objects work, how branches work, how commits work. At least, getting to this level of git really made my understanding a lot deeper and less superficial, and has helped me internalize things and make other things fall into place, in much more robust ways than they did before.

There are still many open questions to be asked. How does a pull work? What about a push? How do you manually rebase a branch. And those are very interesting, but painful questions to answer, and thus, are left as an excercise for the curious reader to explore :3

Hope you got something out of reading this!

Exercises

What does the hexdump -C look like of the .git/index if we add another blob with hash-object and update-index?
Can you discern any patterns from this. What more can you learn about the index file format.
What does the zlib compressed data look like inside of the tree object after if we add another blob?
What does the .git/index look like after running update-index with a tree blob in the object database?
What does the .git/index look like after running update-index and write-tree with a tree blob in the object database?
Can you figure out what the myserious f^gNfB means?
What does the hexdump -C look like of the .git/index inside a larger project.
Can you discern any patterns from this. What more can you learn about the index file format.
Does changing the size of the blob from 78 to some other number inside of the zlib compressed data influence git cat-file -s? You will need to decompress and recompress the data from and into the proper zlib format⁸.

Footnotes

We don't have to put our git repo in .git. We could use the GIT_DIR environment variable, or the --git-dir=<path> flag. ↩︎
the zlib wrapper (RFC 1950) — unlike gzip wrapper (RFC 1952) — doesn't store file name and other file system information, which is fine, considering how git manages this elsewhere. ↩︎
From this unix.stachexchange answer. Here, we concatenate the gzip magic number and compression method, and concatenate (the actual reason for cat existing) this with the file. We then pipe it into gzip, who can now understand and decompress it. Still, we didn't finish the file with the 8 byte footer, so gzip gets confused, but that doesn't matter, we get to see the data regardless. ↩︎
https://git-scm.com/book/en/v2/Git-Internals-Git-Objects ↩︎
If you're interested in finding out how the hash is generated, start here. ↩︎
Read more herehttps://git-scm.com/book/sv/v2/Git-Internals-Git-Objects ↩︎
The mortal enemy of C has many names: empty bytes, null bytes, nop bytes. ↩︎
One approach could be to use the hack, with the additional padding at the end, to extract the file. Then, after changing the number, compressing it again, and removing the prepend and appended gzip magic numbers. ↩︎

Grok Git Repos

A way to deep dive into git internals