There are 4 type of objects in Git storage: blob, tree, commit and branch/tag. In this post, we try to understand these objects in a relational model.
In Git, each object has an OID (Object ID). The ID is the SHA-1 hash of the object content. Since OID is used frequently in the following sections, we defind it as a type.
CREATE TYPE oid AS binary(256);
Blob is a File
The first object is
blob. Blob represents a file in the repository.
CREATE TABLE blob (
The prefix underscore in
_oid indicates that it is not physically stored in the object. On the other hand,
content is stored in the object and it is the only payload of the blob object.
Note that there is no filename stored in the blob object, nor the other metadata of the file. In fact, this metadata is stored in the tree object. We will discuss it later.
An important difference between this SQL schema and the actual Git storage is that the content can be only indexed by
_oid, i.e. the
SHA1(content). That means we can not search a blob by its content directly.
Tree is a Directory
The second object is
tree. Tree represents a directory in the repository. We browse the content of a tree object by
git cat-file command.
> git cat-file -p e0504e788345f65315e6a53f992b40f503937618
040000 tree ea7cab952a09f0c8d6c3d74e3f72c011aec794e0 .github
100644 blob 6240da8b10bfc3ab9dc4564c4169453cf143db7f .gitignore
100644 blob 92280800c38a7edb2e5dd3a89602aa4857adcbe2 .prettierrc
040000 tree 3b444578da46570923e19673c309503f6c42752c .vscode
100644 blob f6d4ce044edd974f0d5d752ce8591f0282f86ed0 LICENSE
100644 blob c7cf6d75fc413de1386b1c005da03dd4df1c95be astro.config.mjs
100644 blob 5d3d5a9d14deff76470cf0105a8e233986235cd9 package-lock.json
100644 blob 8c44bc97117ec7b43cd15edb6c562f25db871304 package.json
040000 tree 5cf636ff3d0cf422423e4a09f7bdd7a5636f9851 public
040000 tree 3a470abd49534a41d5231a16bb4f2040b26593f9 src
100644 blob 8358535e3131c2f8fe2f34ce16aaacd333037628 tailwind.config.cjs
100644 blob 6befff5128e6b06c8984595d7b786bee41d02367 tsconfig.json
This content can be described by the following table.
CREATE TABLE tree (
Each tree object has one or more children, so we use a relation table to store the tree. Since a tree object has nothing except its children items, we just ignore the tree entity table.
Each row in the
tree table represents a child of a tree object, which can be a blob object or another tree object. Thus
child_oid field is reference to
tree(_oid). However this type of reference is not supported by SQL, so we just omit the foreign key constraint. Instead, we use field
type to indicate the type of the child object.
path is the filename or subdirectory name. Note that a tree only stores its direct children, so the path is relative to the tree object and does not contains
/. Another metadata of child, the privilege mode, is also stored in tree object.
An empty directory is not stored in Git. Thus there is no empty tree object.
If two files have the same content, they will have the same OID. Different tree objects might have the same child OID. Thus the
child_oid is not unique.
Commit is a Pointer to a Tree
We can now talk about the most important object in Git: commit. A commit is a snapshot of the repository, i.e. the repository at a certain time. Thus a commit contains an OID of a tree object, which is the root directory of the repository at that time.
CREATE TABLE commit (
root_tree_oid oid REFERENCES tree(_oid),
... /* other metadata */
A commit object contains an OID of the root tree object with other human-readable information such as commit message, author and committer. We can use
git cat-file command to browse the content of a commit object.
> git cat-file commit 246f96e522e28bedc4440a7975c2740299f9db1e
author qsliu <email@example.com> 1700130352 +0800
committer qsliu <firstname.lastname@example.org> 1700130352 +0800
post: ci system the users perspective
But wait, what’s the
parent in the content of the commit object?
(Cont.) Commit is also a Pointer to Parent Commit(s)
Recall that Git is a version control system. The snapshots of repository in different time are not enough. To represent the revoluation and branches of the repository, we need the relationship between snapshots. Thus commit object also contains OID of its parent commit(s). This relationship can be represented by the following table.
CREATE TABLE commit_parent (
_commit_oid oid REFERENCES commit(_oid),
parent_commit_oid oid REFERENCES commit(_oid)
A commit might have zero, one or more parents.
- If a commit has zero parent, it is the first commit of the branch.
- If a commit has one parent, it is a normal commit.
- If a commit has more than one parent, it is a merge commit.
Branch/Tag is an Alias to a Commit
Finally, we discuss branch/tag. However they are NOT objects and do not have an OID. Instead, they are just aliases to a commit object.
We can browse the structure of the
.git/refs directory to see the branch/tag.
> tree refs
│ └── main
│ └── origin
│ ├── HEAD
│ └── main
5 directories, 3 files
> cat refs/heads/main
> cat refs/remotes/origin/HEAD
> cat refs/remotes/origin/main
As we can see, branches and tags are stored in the
refs directory. And the path to that file is also the branch reference name. For example, the branch
main can be referenced by
refs/heads/main. Each branch or tag just simply contains the OID of the commit object, or the reference to another branch/tag.
The difference between branch and tag is that branch is mutable, while tag is immutable. Branch commit is moved when we commit a new snapshot, while tag commit is fixed.
We can also tell that remote branches has no difference with local branches. They are just stored in different directories and updated by different commands.
In this post, we discuss objects in Git storage and try to understand them in a relational model. Blob object is content of a file while tree object is a directory. Commit object is a snapshot of the repository, pointing to both the root tree object and its parent commit(s). Branch/tag is a pointer to a commit object but is not an object itslef.
Git Internals - Git Objects chapter in the Pro Git book has the most detailed explanation of Git objects.
”But wait, I thought commit is diff?”
Most Git clients display a commit by the diff of the commit and its parent commit(s), and usually we only care about the diff. Commits are snapshots, not diffs explains why commit is a snapshot and not a diff. And the author also discusses how the
mergedo when commit is actually a snapshot.
The idea of seeing Git as a database comes from Git’s database internals, where the author sees Git as a key-value database and discusses more about how to query the Git objects.