Apache Arrow and the "10 Things I Hate About pandas"


In this post I hope to explain as concisely as I can some of the key problems with pandas's internals and how I've been steadily planning and building pragmatic, working solutions for them.

Wes McKinney knows a thing or two about pandas—he started writing it in his spare time almost a decade ago and continues to be deeply involved. In this post, he tells a story that starts with pandas but goes further, leading eventually to him pulling together a coalition around an Apache project called Arrow. 

The motivating force behind the story has been performance at scale. Wes talks about how a bunch of design decisions in pandas still plague the project today. Arrow attempts to provide a "columnar data middleware" that provides zero-copy access between tools like Impala, Kudu, Spark, and Parquet. In his own words: "I strongly feel that Arrow is a key technology for the next generation of data science tools."

I mentioned Arrow here almost two years ago at this point when the project first started. It's made a ton of headway since then, and it's well worth checking out if it's new to you.


Want to receive more content like this in your inbox?