Application databases are generally designed to only track current state. (...) But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. (...) To accomplish these use cases we need a data model that tracks historical state.
In this post, I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).
This post does an excellent job of tackling what can be a very dry subject. I want to add two things:
- Capturing historical state in the warehouse is a superpower most data folks don't know that they have! It has saved my ass multiple times. If you're not familiar with dbt's snapshots feature, check it out.
- It's neat to see Shopify using dbt, and it's also neat to see how much simpler it is to accomplish this use case in dbt than in the corresponding PySpark code!