Reproducible C++ builds by logging git hashes
Sometimes I am in the difficult situation where I have written a program which writes some kind of output to disk, and I want to remember which version of my program produced this output. This is really common for me at the moment due to my research, which always seems to involve a lot of trial and error algorithm design. I think that similar problems exist in all kinds of other areas, but particularly during rapid development, because once software has been properly deployed and versioned it’s quite trivial to just put the version number in the logs.
For a slightly more, but not very, concrete example: I’m working on an algorithm implementation right now. I won’t say too much about the details yet, but it takes a number of configuration options. These, of course, I can quite easily write to the log file. The program also has lots of implementation details that can be tweaked, really dozens of things I could change, and I keep coming up with new ideas I want to try. This means that I end up with a folder full of outputs generated by code that probably doesn’t even exist anymore, and which I wouldn’t be able to reproduce purely by running the current version of the program with whatever configuration options are specified in the log file.
This also isn’t the first time I’ve had a very similar problem. I assume (hope) it’s not just me, so I thought I’d write up the solution I came up with.
Git commit hashes
As you likely know, you can identify git commits by their hash, which are long
strings of hexadecimal digits, such as b5a994c260105b7cc979aead986532b51c37df75.
Specifically, they are 40 characters long, and are the result of hashing the
repository with SHA-1.
My idea is pretty simple: make the program write the current commit’s hash to the log file. Then, given any log file, I can see the commit used to generate it, and go back in the git history to see exactly what my code was doing at that point.
Basic Implementation
How to integrate this into the logs? A super easy but incorrect approach would be to invoke git from my program directly, retrieve the hash of the current commit, and write it to the log file. This doesn’t work though, because that will give us the git commit state at runtime, whereas we want to know what commit was used when compiling the code.
What we actually need to do is integrate the commit hash into the build system. Since I’m writing my code in C++, the natural way to implement compile-time information like this is to #define it, so let’s start by writing a script which builds a C++ header file to do just that:
#!/usr/bin/bash
commit_hash=$(git rev-parse HEAD)
echo "#pragma once"
echo "#define GIT_COMMIT_HASH \"${commit_hash}\""
Fairly simple: we’re just defining a macro GIT_COMMIT_HASH with a string literal of whatever git rev-parse HEAD says, which will be the hash of whatever the current checked out commit is. I will say, there’s probably a “proper” C++ way to define compile-time literals like this, with “proper” type checking, or something, #define is good enough.
The final step really is to, in some way, run this script at every compilation. For reference (because I always forget), CMAKE_BINARY_DIR is where you run cmake from, which for my is /build; and CMAKE_SOURCE_DIR is the root of the cmake project, i.e. where the CMakeLists.txt is. I appended the following to my CMakeLists.txt.
add_custom_target(git_info ALL
COMMAND scripts/gen_git_info.sh > ${CMAKE_BINARY_DIR}/git_info.h
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
COMMENT "Generating git info header"
)
add_dependencies(my_program git_info)
include_directories(${CMAKE_BINARY_DIR})
This target just tells cmake that I want to run that specified command, which runs our script and writes the output to a new header file in the build directory. Since I’m writing the header file to the build directory, I have to then add that as an include directory for my program. Of course, if you’re using a plain makefile, you need some other method. It’s probably even simpler, maybe make a phony target for running the script and producing git_info.h, and make it a dependency.
From C++ it’s then very simple:
#include "git_info.h"
std::string git_info = GIT_COMMIT_HASH;
std::cout << "git_commit_hash: " << git_info << "\n";
Nice! In my particular program, I redirect stdout to my log file, so this is sufficient for me…
…Almost.
Uncommitted Code?
Most of you will have noticed by now that there is a problem here. It is assumed, or hoped, that I’m always compiling from committed code. This definitely isn’t always the case during rapid development, but I can definitely constrain myself to only run “proper” experiments using code that I’ve actually committed. Still, I don’t want to confuse myself by incorrectly thinking that some code from a “not proper” experiment is compiled from a certain commit directly.
The fix I chose is the simplest possible option: I will append “-dirty” to the commit hash if the compiled code has not been committed:
#!/usr/bin/bash
commit_hash=$(git rev-parse HEAD)
dirty=$(git diff --quiet || echo "-dirty")
echo "#pragma once"
echo "#define GIT_COMMIT_HASH \"${commit_hash}${dirty}\""
And just to make everything extra explicit (since I don’t want to forget to commit code if I want to run a “proper” experiment), I can add the following:
if (git_info.ends_with("dirty")) {
std::cout << "note: you're running a build with non-committed changes, "
"which may limit reproducability\n";
}
The way my code works, this cout runs before stdout starts being redirected to a log file, so I can see this warning on the command-line, allowing me to quickly stop and recompile if I want to. Or I can just run it anyway, if I don’t care.
Improvements
This system works pretty nicely for me. It doesn’t have to be that professional, because it’s just a research project. I doubt anyone else will look at the log files, let alone the code. However, it definitely could be improved.
Mainly, it would be nice to actually record which files are dirty, in the case of a dirty build. This would again be defining a new macro as a list of those files. It could even define it as a C++ vector type, for easy printing!
On a similar note, we only care about commits which modify source code. My repository has a few other files, like some Python scripts to plot the output, and also some configuration files for other things. If those are changed, I don’t want to claim that it’s a dirty build. Working around this would be a bit more work, but if I did have a list of dirty files, I could just check if any of those are in the src/ or include/ directories, for example.
Finally, it could save even richer information, like diffs of the dirty files (so that I could reproduce dirty builds), and even library version numbers.
But for now, this is good enough.