Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency

Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question — “Can I run a check myself to understand what data is behind this metric?”
Now, imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by few critical customer facing services (e.g. billing). You are about to make structural changes to the data and want to know who and what downstream to your service will be impacted.

The two scenarios described above are huge problems in at-scale data-driven companies, and they're both problems of data lineage. Data lineage—the ability to understand where data comes from and goes to—is an extremely hot area at the moment, and it's a hard problem when your infrastructure looks like Netflix's does (the image above). Lyft just published about their internal Amundsen lineage tool and there was quite a buzz around it at last month's DataCouncil conference. The importance of this problem is why we built dbt Docs late last year, and it's why we're very focused on going deeper on this area.

Anyway, this is a fascinating topic that is currently playing out in real-time, and this post outlines Netflix's current approach.


Want to receive more content like this in your inbox?