File this under "things written by Michael Kaminsky that I couldn't have possibly said better myself." (This has become a large folder over the years.)
Here's his conclusion:
I’m broadly sympathetic to the goals that people who are working on “git for data” projects have. However, I continue to believe that it’s important to keep code separate from data and that if your data system is deterministic and append-only, then you can achieve all of your goals by using version-control for your code and then selectively applying transformations to subsets of the data to re-create the data state at any time. The motto remains: Keep version control for your code, and keep a log for your data.
I have always been just slightly confused as to the "git for data" concept but never dug in deeply. It just doesn't mirror the experience I have of doing large-scale production data work! That is not at all to say that it doesn't have an important role to play in certain ML workflows, but I think it's important to have a clear understanding of where it's relevant and where Kaminsky's approach (which is what I've always practiced) is preferable.
Very open to being told I'm wrong here! Just hit reply.Read more...