Git-annex for data management

Background

You probably know what git is - it tracks versions of files. The full history of every file is kept. When something is recorded in git-annex, the raw data is a separate storage area, and only links to that and the metadata is distributed using regular git. So, all clones know about all files, but don’t necessarily have all data. Using git annex get, one can get the raw data from another repo and make it available locally.

For example, this is a ls -l of a real git repository which has a small-file.txt and a large-file.dat. You see that the small file is just there, but the large file is a symlink to .git/annex/objects/XX/YY/...:

$ ls -l
lrwxrwxrwx 1 darstr1 darstr1 200 Feb  4 11:08 large-file.dat -> .git/annex/objects/X4/xZ/SHA256E-s10485760--4c95ccee15c93531c1aa0527ad73bf1ed558f511306d848f34cb13017513ed34.dat/SHA256E-s10485760--4c95ccee15c93531c1aa0527ad73bf1ed558f511306d848f34cb13017513ed34.dat
-rw-rw-r-- 1 darstr1 darstr1  21 Feb  4 11:06 small-file.txt

If the repository has the file, the symlink target exists. If the repository doesn’t have the file, it’s a dangling symlink. git add works like normal, git annex add makes the symlink.

Now let’s git annex list here. We see there are two repositories, here and demo2. large-file.dat is in both, as you can see by the Xs. (“web” and “bittorrent” are advanced features, not used unless you request… but give you the idea of what you can do):

here
|demo2
||web
|||bittorrent
||||
XX__ large-file.dat

The basic commands to distribute data are git annex get, git annex drop, git annex sync, and so on. The basic principles of git-annex are data integrity and security: it will try very hard to prevent you from using git/git-annex commands to lose the only copy of any data.

Navigating the documentation

The main git-annex site is https://git-annex.branchable.com/ . There are many special topics articles here, but the main reference page is the manual page, which can be a good starting point if you roughly know what you are looking for (and a lot of information is only here). It links to manual pages on every other sub-command and their descriptions. It also lists all configuration options, which are very important to refer to.

Other pages (linked from the main page) can describe broader use cases or introductions to concepts, but you often need to refer to the command manuals anyway.

Basic setup

After you have a git repository, you run git annex init to set up the git-annex metadata. This is run once in each repository in the git-annex network:

$ git init
$ git annex init 'triton cluster'   # give a name to the current repo

Level 1: locally locking and tracking data

You can add small files like normal using git (full content in git), and large files with git annex add, which replaces the file with a symlink to its locked content:

$ git add small-file.txt
$ git annex add large-file.dat
$ git commit           # metadata: commit message, author, etc.

Now, your content is safe: it is a symlink to somewhere in .git/annex/objects and it is almost impossible for you to accidentally lose the data. If you do want to modify a file, first run git annex unlock, and then commit it again when done. The original content is saved until you clean it up (unless you configure otherwise). The largefiles settings will determine the behavior of git add, you can set which files should always be committed to the annex (instead of git).

At this point, git push|pull will only move metadata around (the commit message and link to .git/objects/AA/BB/HHHHHHHH, with the hash HHHHH a unique hash of the file contents). This is what is stored in the primary git history itself.

Structured metadata (arbitrary key/value pairs) can be assigned to any files with git annex metadata (and can be automatically generated when files are first added, such as the date of addition). Files can be filtered and transferred based on this metadata. Structured metadata helps us manage data much better once we get to level 3.

So now, with little work, we have a normal git repository that provides a history (metadata) to other data files, keeps them safe, and can be used like a normal repository.

Relevant commands:

git annex init: activate existing git repo for git-annex.
git annex add: add file to the annex, possibly depending on various rules
git annex unannex: opposite of git annex add
git annex unlock: unlock an annexed file, so that it’s a normal file and can be edited.
git annex lock: opposite of git annex lock
git annex metadata: show or set per-file metadata
git annex info: info on various things
Configuration annex.largefiles - rules for what should be automatically annexed

Level 2: moving data

Data in one place isn’t enough, so let’s do more. Just like git remotes, git-annex remotes allow moving data around in a decentralized manner.

Regular git remotes work, if the git-annex shell tools are installed.
Git-annex special remotes, which essentially serve as key-value stores. Options include S3, cloud drives, rsync, and many, many more.

Regular git remotes are set up with git annex init on the remote side. Special remotes are created with git annex initremote. Every remote has a unique name and UUID to manage data locations.

Once the remotes are set up, you can move data around:

$ git annex get data/input1.dat                # get data from any available source
$ git annex copy --to=archive data/input2.dat

You can remove data from a repo, but git-annex will actively connect to other remotes to verify that other copies of the file exist before dropping it:

$ git annex drop data/scratch1.txt

These commands more around data in .git/annex/objects/ and update tracking information on the special git-annex branch so that git-annex knows which remotes have which files - very important to avoid a giant mess!

Special remotes can be created like such:

$ git annex initremote NAME type=S3 encryption=shared host=a3s.fi

And enabled in other git repositories to make more links within the repository network:

$ git annex enableremote NAME

Note that special remotes are client-side encrypted unless you set encryption=none, and also chunked to deal with huge files even on remotes which do not support them.

Relevant commands:

git annex get: use available knowledge to get a copy of files from remotes.
git annex drop: delete a file from current repo. By default, make sure other copies exist before doing this.
git annex move: move file contents
git annex copy: copy file contents
git annex list: list of files including where contents are stored
git annex find: list files matching pattern
git annex initremote: initialize a special remote (info will be synced)
git annex enableremote: use synced info to prepare an existing special remote for use.

Level 3: synchronizing data

Moving data is great, but when data becomes Big, manually managing it doesn’t work. Git-annex really shines here. The most basic command is sync --content, which will automatically commit anything new (to git or the annex depending on the largefiles rules) and distribute all data everywhere reachable (including regular git-tracked files). Without --content, it syncs only metadata and regular commits:

$ git annex sync --content

But, all data everywhere doesn’t scale to complex situations: we need to somehow define what goes where. And this should be done declaratively. One of the most basic declarations in the minimum number of copies allowed numcopies. Git-annex won’t let you drop a file from a repository without being very sure that this many copies exist in other repositories. This setting is synced through the entire repository network:

$ git annex numcopies N

The next level is preferred content, which specifies what files a given repository wants. git annex sync --content will use these expressions to determine what to send where:

$ git annex wanted . 'include=*.mp3 and (not largerthan=100mb) and exclude=old/*'
$ git annex wanted archive 'anything'
$ git annex wanted cluster 'present or copies=1'

Repository groups and standard groups allow you to more easily define rules (the standard groups list lets you see the power of these expressions). Various built-in background processes can automatically watch for new files and run git annex sync --content automatically for you, which can make your data management a fully automatic process. Repository transfer costs can allow git-annex to fetch data from a nearby source, rather than a further one. Client-side encryption can allow you to use any available storage with confidence.

Relevant commands:

git annex sync [--content]: automatically commit/move data around based on the rules defined below
git annex numcopies: set default number of copies for every annexed file (minimum redundancy level)
git annex trust: mark a repo as being trusted (it won’t lose data so you don’t have to verify contents before deleting locally)
git annex untrust: opposite of git annex trust
git annex wanted: set files which will be automatically synced to a repo.
git annex group: set a repo as part of a group
git annex groupwanted: same as git annex wanted but for groups
git annex required: similar to git annex wanted but prevents you from dropping the content unless you force it
git annex unused: find older versions of files which are no longer referred to in the current version and can be dropped
git annex schedule: manage background processes that git annex sync
git annex watch: monitor current repo for changes and git annex sync when they happen

Git-annex for data management

Background

Navigating the documentation

Basic setup

Level 1: locally locking and tracking data

Level 2: moving data

Level 3: synchronizing data

See also