Apache Arrow: The Hidden Champion of Data Analytics


Arrow is used by open-source projects like Apache Parquet, Apache Spark, pandas, and many commercial or closed-source services. It provides the following functionality: a) in-memory computing, b) a standardized columnar storage format, c) an IPC and RPC framework for data exchange between processes and nodes respectively.

I've been following Apache Arrow for a long time (2016!) and have been / continue to be bullish on it. This post is a great intro if you're not familiar, but it also has some excellent performance data in it that was new to me.

IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark

...heh, wow.


