How Materialize and other databases optimize SQL subqueries This is a post that I had queued up to read for the past couple of months and am only now getting back to. It's one of the deeper blog posts on the link between sql you wrote and explain plan your database ran.

This is one of the biggest areas that I see new analytics engineers struggle with and is probably the deepest that an AE has to go on a purely technical / CS fundamentals continuum. In fact, you can skip this knowledge and just kinda cross your fingers that the optimizer will give you good results for a little while...but if you want to truly feel confident traversing any dataset, this is knowledge you need.

The post focuses on how different database engines optimize correlated subqueries. Here's just the tip of the iceberg to give you a taste:

The easiest way to execute this is to run the subquery once for each row in the outer query, but this is potentially very inefficient. Databases rely on being able to collect, reorder and batch operations to reduce interpreter overhead and optimize memory access patterns. Running the same query many many times in a nested loop reduces that optimization freedom.

The author is a true expert in the field and is quite good at making somewhat arcane concepts accessible IMO.


Want to receive more content like this in your inbox?