Data’s Inferno: 7 Circles of Data Testing Hell with Airflow

Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data.

This is potentially the best post I have ever seen on data integrity testing. I believe data engineering reliability is probably the single biggest area of weakness in sophisticated data organizations today, and almost every organization could learn a lot from the CI practices outlined in this post.


