Introduction
Git is hard.
Well, not inherently, but it can often be hard to understand the underlying simplicity, when confronted with the endless stream of metaphors used to explain it and the various trickle-down commands from other developers.
Often, I find myself using git in a ritualistic manner, repeating that which has worked in the past mindlessly, without much of an understanding of what I'm really doing.
I want to actually, truly, understand git.
And it seems to me that if I am to truly understand git — grok it — I have to first grok how a git repository works. And what better way to do that than to put myself… ourselves… in the place of git, and understand what it takes to create a repository.
Creating a Git Repository from Scratch
Let's create a folder for us to work in.
|
|
Now, we have a folder. A folder isn't a git repository, as we can verify.
|
|
As the status
{.verbatim} subcommand tells us, neither is the directory
we currently created, of any of it parent directories a git repository.
If that wasn't the case for you, that would be because you created the folder
grok
{.verbatim} as a subdirectory of a git repository.
The error seems to mention a .git
{.verbatim} folder. Turns out, when
speaking of a git repository, we are actually referring to this folder
in most cases1. So, in theory, it should be simple enough to create a
git repository right, just create a .git
{.verbatim} folder and we are
good to go?
|
|
Ohh, apparently not so.
If we were to look up the definition of a repository in the git glossary, we would see the following.
A collection of refs together with an object database containing all objects which are reachable from the refs, possibly accompanied by meta data from one or more porcelains. A repository can share an object database with other repositories via alternates mechanism.
Ahh, simple. Seems like we need to have some refs and an object database with all the objects refs can reach. Also, there might be meta data from porcelains?
Ehh, okay. Don't get overwhelmed!
That text is simply written in such an intimidating way to scare away the tech bros that thought this was a Luke Smith article. Now that they are gone, I'll let you know that this is actually quite simple. But I'm not just gonna explain it by 8 terse paragraphs, instead I'll simply show you that it's simple in a much more verbose way!
First, we need an object database. Well, database sounds a bit hard, we could just make a folder for it instead in our git repository.
|
|
Now, we also need a collection of refs. No issue, just make a folder for them.
|
|
What are heads? Don't worry, we'll get to it. Now we have the bare minimum for a repository thou, right?
Well, let's try to find out!
|
|
Ohh… we don't. We actually need one more thing before we get ahead of ourselves.
|
|
Now, is this finally a repository? Let's see what git thinks the status is.
|
|
Seems like it is! …So now we have a git repository, that was easy.
Committing to a Repository
Creating Objects
Having a hand made git repository is a neat party trick, but it's not worth much if we can't fill it up with all our files.
We likely will want to fill this repository with text files. For the occasion, you can use fortune to generate something, but if you use what I got, you'll be able to compare your hashes to the ones in the article.
"If a listener nods his head when you're explaining your program, wake him up.
That is very wise indeed. We want to add this to the repository. But how?
Well, one way to add it is the following, using the git
hash-object
{.verbatim} subcommand.
$ echo "If a listener nods his head when you're explaining your program, wake him up." | git hash-object --stdin -w
665e95f1674e9466cb429bdfebaf1b8792ef0eec
Okay, two questions now. What is
665e95f1674e9466cb429bdfebaf1b8792ef0eec
{.verbatim}, and what did
hash-object
{.verbatim} just do?
Let's try to inspect our git repository, it might give us some clues. One way to do this is to show the directory tree structure.
|
|
Here, we see that our object “database” in the objetcs folder has seen
some change. There is a folder, 66
{.verbatim}, with a file that has an
equally vexing name,
5e95f1674e9466cb429bdfebaf1b8792ef0eec
{.verbatim}.
I wonder what is in this file. Let's try to take a look at it.
|
|
Ohh, not plaintext. This seems like some sort of binary format, let's
take a look and see if we can figure out what the file actually
contains. We do this using the file
{.verbatim} command.
|
|
Apparently, it's some data compressed with zlib. That means it's likely DEFLATE compressed (RFC 1951) in a zlib wrapper (RFC 1950)2.
There is actually a neat hack we can use to view this data3.
|
|
We found our string it seems, but what's this blob
{.verbatim} and
78
{.verbatim} at the start? blob
{.verbatim} specifies the git
object's file type, the 78
{.verbatim} is the blobs size in bytes4.
Objects in git have 3 primary types:
- blob
- tree
- commit
When we ran the hash-object
{.verbatim} subcommand, we created a blob.
Regarding the name, that is just the sha1 hash (RFC 3174) of the object,
and some other metadata and such5. The actual output of the command
was this hash. It is 40 characters long.
The name of the directory in the object database is the first two
characters of the hash, that is 66
{.verbatim}, and the actual object
files name is the 38 other characters.
We probably don't wanna decompress all the objects manually just to
inspect them. We can use the hash we got,
665e95f1674e9466cb429bdfebaf1b8792ef0eec
{.verbatim}, as a way to
inspect the newly created object.
Inspecting Objects
We do this with the git cat-file
{.verbatim} subcommand. To see the
file type of an object, we use the type flag -t
{.verbatim}.
|
|
That means that the type (-t
{.verbatim}) of the object we made is a
blob, as we saw by inspecting it. What if we wanna see the contents of
the blob? We use -p
{.verbatim} for print.
|
|
Neat, that's a lot easier than remembering the gzip hack from earlier!
Also, another useful flag to know is -s
{.verbatim}, for size.
|
|
Ohh, if it isn't that 78
{.verbatim} from earlier. As mentioned, this
is the size of the object in bytes.
Index
Okay, enough about the blob. It seems like we have added an object –
with some text – into the git repository now. We might wonder if this
is reflected in the git status
{.verbatim} of the repo.
|
|
Ohh, well… so what if we just added a bunch of files? If we aren't
even aware of these objects, how do we even know if a repo is full of
useless stuff? Well, that's actually a little to complicated for now,
but a answer that might satisfy you is we don't really, so
git gc
{.verbatim} exists (gc
{.verbatim} for garbage collection).
Okay, that aside, how do we actually commit this object? Well, first, we have to do something else.
|
|
The long string is the hash of the blob object we created, and
truth.txt
{.verbatim} is a fitting filename.
What about 100633
{.verbatim}? Looking at
man git update-index
{.verbatim} we see that…
--cacheinfo <mode>,<object>,<path>,
--cacheinfo <mode> <object> <path>
Directly insert the specified info into the index.
For backward compatibility, you can also give these
three arguments as three separate parameters, but
new users are encouraged to use a single-parameter
form.
As you can see, we used the separate parameter form — as we are
encouraged to do. But also, that 100633
{.verbatim} is the mode of the
file. Mode here refers to the file system permissions, and is taken from
the way UNIX modes work… sorta, with limitations. You can think of
this as octal permissions.
But crucially, for blobs in git, we only have 3 modes available6:
- 100644 a normal file
- 100755 a executable file
- 120000 a symbolic link
Let's take a look at the repo structure again.
|
|
Here we notice that there now is an index
{.verbatim} file. So all we
did was add some index file
{.verbatim}? …well, slow down. Let's do
a sanity check with git status
{.verbatim}.
|
|
Aha!
So we did something, we staged (added) the truth.txt file… but we also have unstaged changes saying we deleted it?
Well, yes. We have told git that the blob object we created is in the working-tree, and it has the actual blob for this in its object database… but apparently can't find the file we told it about. This makes sense, since we just created the blob from stdin and didn't actually add any files.
Since it can't find it, it thinks we must have deleted it, how presumptious!
Hmm, while we are at it, let's take a look at .git/index
{.verbatim},
just to see what that is about.
|
|
Welp, that's not a textfile either. But what is it?
|
|
Ohh, it's a Git index, version 2
{.verbatim} with
1 entries
{.verbatim}. That must be the entry we created! Now that
we… wait, you also wanna see what's inside the binary? Fine, if you
insist.
|
|
(I removed the offset from the output)
Hmm, it seems like the first 4 bytes, DIRC
{.verbatim}, are a magick
number. Then, after three 007 bytes, a 02
{.verbatim}. This must
be the version number. Then, a whole bunch of 00 padding, some garbled
bytes and the truth.txt
{.verbatim} file name.
I bet if we added another file, we'd be able to discern a pattern in
this. And I wonder how the .git/index
{.verbatim} looks in e.g. a large
project. Hmm, you'll have to attempt the [exercises]{.spurious-link
target=“Exercises”} after reading this blog post I guess if you wanna
find out :^)
Tree Objects
Back on track, we've now made the index thingy, and we'll have to
write it with the write-tree
{.verbatim} subcommand.
|
|
git status
{.verbatim} does not seem to have changed, but if we check
what's in repo's file structure, we'll see something did change.
|
|
We notice the a6325f064bac723691f20c0b1ed2bea82a1728fd
{.verbatim}
sha-1 hash refers to a new object. Interesting, wonder what this is.
Of course, because we are trying to get to the bottom of this object, we're gonna do it the hard way.
|
|
Okay, that's zlib compressed data. We know that, so let's use that same hack to figure out what is inside of this blob.
|
|
Ohh. That isn't a blob. That's a tree, one of the 3 file types. That
number, 37100644
{.verbatim} might be hard to read, but it's actually
37
{.verbatim} the bytes size, and 100644
{.verbatim}, the file mode.
Then, we seem to have the filename, and some weird f^gNfB
{.verbatim}
thing.
That f^gNfB
{.verbatim} thing we actually have seen before, in the
hexdump -C
{.verbatim} of the .git/index
{.verbatim}. What it is is
left as an excercise to the reader.
Let's look at the contents the canonical way.
|
|
An Experiment with Tree Objects
Something that is too important to leave to the exercises is what happens when we have multiple files in a tree object. Let's create another blob.
|
|
And add it to our index file.
|
|
Now, let's look at the repository. We have 3 objects in our objects database.
|
|
Let's dump the index binary.
|
|
Nothing too spectacular here, we notice that we see both amogus.txt and truth.txt. We also see TREE at the end of the file, that must be our tree object.
However, what happens when we run write-tree
{.verbatim}?
|
|
Let's dump the index again.
|
|
Hmm, seems like the end grew a bit, after the TREE. Let's try to inspect the contents of the newly written tree.
|
|
Now we seem to have two files inside of the tree.
Time to Commit
So now, it's time to create a commit. Above, when we did write tree, we
created the tree
{.verbatim} object
aee76412ed220742aeaf02ca1c50519bcea013e1
{.verbatim}. This includes
both of the blobs we made.
So how do we commit that? It's actually really simple, we just use git commit-tree
{.verbatim}.
|
|
If we now look at the .git
{.verbatim} repository, we will see a new
object, 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
{.verbatim} created.
What kind of object is this?
|
|
Unsurprisingly, it's a commit object. If we look what is inside, we see something that may look familiar to people that regularly use git.
|
|
A fun thing to do now is to try and run git log
{.verbatim}.
|
|
Seems like our branch is broken hu? Let's fix that. Let's quickly
remind ourself of the contents of .git/HEAD
{.verbatim}.
|
|
This doesn't refer to anything. We can easily fix this, let's make a
new branch. But what git subcommand will we use this time? None
actually, as it turns out, branches are just files in
.git/refs/heads/
{.verbatim} that contain the sha-1 hash of some
commit. The name of the file becomes the name of the branch.
|
|
Now we just need to switch to that branch. Instead of doing
git switch main
{.verbatim} we can just change the reference in the
repository directory .git
{.verbatim}.
|
|
If you're following along, and you have some PS1 git branch feature, you may have noticed something incredible just after running that command.
First, if we run git status
{.verbatim} now, we see that there no
longer are any changes to be committed.
|
|
Also, we can now run git log
{.verbatim}!
|
|
Something else that's cool is we can run
git log --format=raw
{.verbatim}, and see output similar to what we got
from git cat-file -p
{.verbatim} on the commit objects sha-1.
|
|
But what about those files that are deleted? We can solve that with
git checkout
{.verbatim} like this.
|
|
Now if we run git status
{.verbatim} we get.
|
|
And just like that, you've created a commit from scratch!!!
Conclusions
From nothing, we have now tried to create a git repository, added files to it, created a branch, and even commited our files.
Is this even remotely efficient? No, this could have been just:
|
|
HOWEVER! Hopefully, you should now have a much deeper understanding of the porcelain of git, and what's actually going on under the hood, how the git objects work, how branches work, how commits work. At least, getting to this level of git really made my understanding a lot deeper and less superficial, and has helped me internalize things and make other things fall into place, in much more robust ways than they did before.
There are still many open questions to be asked. How does a pull work? What about a push? How do you manually rebase a branch. And those are very interesting, but painful questions to answer, and thus, are left as an excercise for the curious reader to explore :3
Hope you got something out of reading this!
Exercises
- What does the
hexdump -C
{.verbatim} look like of the.git/index
{.verbatim} if we add another blob withhash-object
{.verbatim} andupdate-index
{.verbatim}? - Can you discern any patterns from this. What more can you learn about the index file format.
- What does the zlib compressed data look like inside of the tree object after if we add another blob?
- What does the
.git/index
{.verbatim} look like after runningupdate-index
{.verbatim} with a tree blob in the object database? - What does the
.git/index
{.verbatim} look like after runningupdate-index
{.verbatim} andwrite-tree
{.verbatim} with a tree blob in the object database? - Can you figure out what the myserious
f^gNfB
{.verbatim} means? - What does the
hexdump -C
{.verbatim} look like of the.git/index
{.verbatim} inside a larger project. - Can you discern any patterns from this. What more can you learn about the index file format.
- Does changing the size of the blob from 78 to some other number
inside of the zlib compressed data influence
git cat-file -s
{.verbatim}? You will need to decompress and recompress the data from and into the proper zlib format8.
Footnotes
We don't have to put our git repo in
.git
{.verbatim}. We could use theGIT_DIR
{.verbatim} environment variable, or the--git-dir=<path>
{.verbatim} flag. ↩︎the zlib wrapper (RFC 1950) — unlike gzip wrapper (RFC 1952) — doesn't store file name and other file system information, which is fine, considering how git manages this elsewhere. ↩︎
From this unix.stachexchange answer. Here, we concatenate the gzip magic number and compression method, and concatenate (the actual reason for cat existing) this with the file. We then pipe it into gzip, who can now understand and decompress it. Still, we didn't finish the file with the 8 byte footer, so gzip gets confused, but that doesn't matter, we get to see the data regardless. ↩︎
If you're interested in finding out how the hash is generated, start here. ↩︎
Read more herehttps://git-scm.com/book/sv/v2/Git-Internals-Git-Objects ↩︎
The mortal enemy of C has many names: empty bytes, null bytes, nop bytes. ↩︎
One approach could be to use the hack, with the additional padding at the end, to extract the file. Then, after changing the number, compressing it again, and removing the prepend and appended gzip magic numbers. ↩︎