Airbnb | Scaling a Mature Data Pipeline: Managing Overhead

medium.com

I found this post fascinating. The author is making the point that in mature data engineering pipelines there is meaningful overhead associated with doing a bunch of things that are not actually executing computation: spinning up environments, disk I/O, etc. The scale of this problem experienced by the Airbnb Payments team was striking to me: 2 hours of overhead to run a DAG that processed almost no data. Yikes.

Regardless what type of data processing you're doing, overhead is the enemy. Every bit of overhead introduces friction and should be consistently a focus for optimization. We deal with this every day in the maintenance of dbt: our compilation time is the overhead in the system, as each interactive run today requires an entire recompilation of an entire project. This is a problem for the workflow, and is one of the reasons we've spent a tremendous amount of time over the past several months building partial parsing, the ability to only re-parse parts of the DAG that have changed since the last compilation. This cuts the time-to-first-model-build by 90-95%. (If you're a dbt user, this is going into GA within the next month-ish and it's going to be a big deal.)

The post presents solutions that the Payments team went through to address the issue, and I think they're interesting. IMO, though, every solution will be different—the bigger point is that overhead is the silent killer. Measure it, crush it.

Read more...
Linkedin

Want to receive more content like this in your inbox?