Introducing Apache Arrow Flight: A Framework for Fast Data Transport

This post introduces Arrow Flight, a framework for building high performance data services. We have been building Flight over the last 18 months and are looking for developers and users to get involved.

I've been interested in Apache Arrow for years: Wes McKinney's next project after Pandas has a tremendous amount of promise to move the entire data processing landscape forwards. This post announces their latest release: Arrow Flight is a mechanism to move data quickly.

This sounds somewhat boring relative to the things that data analysts and scientists more commonly think about. Why does this matter?

The speed of data transport is a bit like a "law of physics" for data processing: this speed determines how all downstream data applications are built. For example, slow data transport is one of the primary reasons that industry is currently moving towards ingesting all organizational data into a single warehouse/lake and doing all processing from there.

With Arrow (and Arrow Flight), though, this "data locality" constraint begins to relax. All the sudden, you can think about building applications differently. Dremio is a product that's been built for this paradigm (and utilizes Arrow under the hood). It's enabling data processing against non-local and heterogeneous data stores at speed. The article itself is fantastic if you'd like to dig into how the technology actually works.

Keep an eye on this space—I really think this is one of the most interesting projects in all of data right now.


Want to receive more content like this in your inbox?