Uber: Monitoring Data Quality at Scale with Statistical Modeling


Conventional wisdom says to use some variant of statistical modeling to explain away anomalies in large amounts of data. However, with Uber facilitating 14 million trips per day, the scale of the associated data defies this conventional wisdom. Hosting tens of thousands of tables, it is not possible for us to manually assess the quality of each piece of back-end data in our pipelines. 
To this end, we recently launched Uber’s Data Quality Monitor (DQM), a solution that leverages statistical modeling to help us tie together disparate elements of data quality analysis. Based on historical data patterns, DQM automatically locates the most destructive anomalies and alerts data table owners to check the source, but without flagging so many errors that owners become overwhelmed.

Data quality and data catalogs are, IMO, the most interesting areas in data right now. The lack of a good catalog and a good data quality management system becomes a problem for organizations specifically because of the scale of their investment in data. As organizations become more mature and have more existing data assets, these become recurring themes.

This post from Uber is a good walkthrough of a DQM system, what it can do, how it works... Something like this is coming to your team in the not terribly distant future.


Want to receive more content like this in your inbox?