Using AWK and R to Parse 25TB

The post Nick references in the tweet above both describes an impressive project and is impressive in its own right. He uses a bunch of tools pipelined together to achieve that result, none of which would I ever think to recommend, but his final result was 🔥🔥🔥

His summary references the novelty of the pipeline he built, with all of the pros and cons that implies:

The final solution is bespoke and almost assuredly not the optimal one. For risk of sounding unbearably cheesy this was about the journey. I want others to realize that these solutions don’t pop fully formed into peoples head’s but they are a product of trial and error.

Nick indicated in a Twitter thread that AWS was a hard requirement, but were that not true I think BQ could have saved some hassle here. I find that this ability to select for the best tool across cloud providers is more often a pain point today than it was 2-4 years ago. Cloud tools have differentiated (Athena vs Bigquery is almost silly) while organizations have more cloud lock-in. Data engineers end up confined to a subset of tools and have to get creative.


Want to receive more content like this in your inbox?