I spend a lot of time thinking about data pipelines. One of my main gripes about the leading tools is how heavyweight they are, leading to slow iteration cycles and a lack of experimentation. This article hits the nail on the head:
Regular pipeline tools like Airflow and Luigi are good for representing static and fault tolerant workflows. A huge portion of their functionality is created for monitoring, optimization and fault tolerance. These are very important and business critical problems. However, these problems are irrelevant to data scientists’ daily lives.
Yes!! Pipelines need to be fast, responsive, and flexible—the less configuration, the better—so that users can quickly make adjustments and experiment. This article introduces a more agile pipeline called DVC and walks through its use in collaborative data science settings. It does so through the lens of improving your productivity as a data scientist—it turns out that tooling like this can absolutely make significant improvements in how, and how quickly, you can work.
Highly recommended 👍👍
Full disclosure: at Fishtown Analytics, we're building a tool called dbt that empowers SQL analysts with the same type of workflow. I use dbt all day every day, which is why I'm so bullish on this approach.Read more...