Apache Arrow: The Hidden Champion of Data Analytics

maximilianmichels.com

Arrow is used by open-source projects like Apache Parquet, Apache Spark, pandas, and many commercial or closed-source services. It provides the following functionality: a) in-memory computing, b) a standardized columnar storage format, c) an IPC and RPC framework for data exchange between processes and nodes respectively.

I've been following Apache Arrow for a long time (2016!) and have been / continue to be bullish on it. This post is a great intro if you're not familiar, but it also has some excellent performance data in it that was new to me.

IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark

...heh, wow.

Read more...
Linkedin

Want to receive more content like this in your inbox?