Matteo Visconti di Oleggio Castello, Ph.D. | Data version control with DataLad and git-annex, a modest example

Git is my Captain’s Log. I can get back to a project that I left months, perhaps years ago, and retrace my steps to get ready to work on that project again. I can rewind and go back in time, then continue on a parallel history. Git works great for code. Except that in science not everything is code. How about data?

Committing binary data to your git history is a big mistake (one that I made too many times). Because git keeps track of the history, even if you delete a file it will still be there. This is great for code, not so great for data. This was my life three years ago:

$ du -sh .git
1.3G     .git

Ouch. Not something you can share on github.

Committing your binary data to git is a big no-no. But we still want to save the data with our project, to make it fully reproducible and shareable. Luckily somebody thought about it, and invented git-annex.

Git-annex is an extension of git, written by Joey Hess, that allows you to commit big files to git. It’s based on a smart trick: whenever you “annex” a file with git-annex, the original file will be moved into .git/annex, and only a symlink to it will be committed to history. Whenever you clone the repository, only the symlink will be copied, and then git-annex will take care of getting the data for you. So, the data will be under version control, just like your code.

Oh, and it will also check if it’s really the file that you committed by comparing checksums. Oh, and it will also tell you if you can safely remove it from your laptop by checking if you have enough copies somewhere else. I think that’s pretty cool.

Git-annex solves the problem of committing large files to git, but it would be nice to separate the code from the data. One can use git submodules to create separate git repositories for data and code, but submodules are notoriously a pain to deal with. Luckily, DataLad comes to the rescue.

DataLad, created by Yaroslav Halchenko and Michael Hanke, builds on top of git-annex to make data versioning and sharing embarassingly easy. Plus, it makes git submodules manageable. For example, consider this folder from one of my projects:

~/exp/hauntedhouse_mne (master*) $ tree -L  1
.
├── data
├── docs
├── notebooks
├── singularity
└── src

Everything is under git, but data, docs, and src are three git submodules.

~/exp/hauntedhouse_mne (master*) $ git submodule
+6024620ae841790bb243a78fa90b39e86d628eb4 data (heads/master)
-d23ef6506bebfea912ccb2c0253da00beedf3b47 docs
+e64d5e31be82ba1436f85f91dca3b1e230abe97e src (heads/master)

This would be still OK with git. But under data, I have another submodule called derivatives, where I put the results of the analyses (following BIDS conventions)

~/exp/hauntedhouse_mne/data (master) $ git submodule
 86290b4c5eebf182390538349afb1c699c414467 derivatives (heads/master)

Oh, and also under src, I keep submodules for third-party code that I need

~/exp/hauntedhouse_mne/src (master) $ git submodule
-c756256692ab50d71f1ebf5eb7824a31ab772237 3rd/jr-tools

Now we have some problems. Keeping all these submodules synced with git would be a big pain. I would need to traverse all the git repositories and commit the changes both in the child and the parent repositories. Not with DataLad.

First, notice how DataLad is aware of the hierarchy of submodules

~/exp/hauntedhouse_mne (master*) $ datalad ls -Lr
.                  [annex]  master  - 2017-10-07/17:00:39  X  46.6 MB/64.1 MB
data               [annex]  master  - 2017-10-07/17:00:38  OK  52.1 GB/365.1 GB
data/derivatives   [annex]  master  - 2017-10-07/17:00:37  OK  1.0 GB/1.1 GB
docs               not installed
src                [git]  master  - 2017-10-05/17:15:21  OK
src/3rd/jr-tools   not installed

and also notice other cool features like

showing that both data and data/derivatives are git-annex repositories, since they store a lot of data, but src is a git repository because it contains only code;
showing that the parent repository has some untracked changes (notice the “X”);
showing that some of the submodules are not cloned (or installed in DataLad’s terminology);
showing how big the git-annex repositories are, and how much of the data is in there (yep, data potentially contains a lot of data, but I don’t need it all, so I only keep here what I need, and the rest is backed up on another server).

So what about if I were adding a new file in the deepest submodule, data/derivatives?

~/exp/hauntedhouse_mne (master*) $ touch data/derivatives/myawesomenewanalysis.txt

If I now traverse the git repositories starting from the deepest one, I see that I should first commit my change in data/derivatives…

~/exp/hauntedhouse_mne/data/derivatives (master*) $ git status
...
Untracked files:
...
myawesomenewanalys.txt

…then I should commit the change in the data submodule…

~/exp/hauntedhouse_mne/data (master) $ git status
...
Changes not staged for commit:
...
    modified:   derivatives (untracked content)

…and finally commit the change in the parent repository.

~/exp/hauntedhouse_mne (master*) $ git status
...
Changes not staged for commit:
...
    modified:   data (modified content)

How about with DataLad? Nothing easier, no need to think about submodules at all, just one command:

~/exp/hauntedhouse_mne (master*) $ datalad add -d . data/derivatives/myawesomenewanalysis.txt
add(ok): data/derivatives/myawesomenewanalysis.txt (file)
save(ok): /home/mvdoc/exp/hauntedhouse_mne/data/derivatives (dataset)
save(ok): /home/mvdoc/exp/hauntedhouse_mne/data (dataset)
save(ok): /home/mvdoc/exp/hauntedhouse_mne (dataset)
action summary:
  add (ok: 1)
  save (ok: 3)

and we can use DataLad to check what changed

~/exp/hauntedhouse_mne (master*) $ datalad diff --revision @~1 -r
   modified(dataset): data
   modified(dataset): data/derivatives
         added(file): data/derivatives/myawesomenewanalysis.txt

Now there is much more to DataLad than this, and this post is really just scratching the surface. I just wanted to show an example of I integrate git, git-annex, and DataLad in my everyday workflow. If you want to know more, I strongly recommend the awesome examples on the DataLad website that will show you how to use DataLad to create repositories, publish your data, or use a one-liner to download data from OpenfMRI.

Big thanks to Yarik Halchenko and Michael Hanke for creating this remarkable piece of software, Joey Hess for creating git-annex, and Olivia Guest and Ariel Rokem for the inspiration of writing this post.