Git for Data: Not a Silver Bullet

File this under "things written by Michael Kaminsky that I couldn't have possibly said better myself." (This has become a large folder over the years.)

Here's his conclusion:

I’m broadly sympathetic to the goals that people who are working on “git for data” projects have. However, I continue to believe that it’s important to keep code separate from data and that if your data system is deterministic and append-only, then you can achieve all of your goals by using version-control for your code and then selectively applying transformations to subsets of the data to re-create the data state at any time. The motto remains: Keep version control for your code, and keep a log for your data.

I have always been just slightly confused as to the "git for data" concept but never dug in deeply. It just doesn't mirror the experience I have of doing large-scale production data work! That is not at all to say that it doesn't have an important role to play in certain ML workflows, but I think it's important to have a clear understanding of where it's relevant and where Kaminsky's approach (which is what I've always practiced) is preferable.

Very open to being told I'm wrong here! Just hit reply.


Want to receive more content like this in your inbox?