Arrow Native Storage

May 28, 2021

I saw a research paper fly by that talked about an Arrow-native storage system. I understood Arrow to be an in-memory columnar format that solves the problem of serializing and deserializing data. My initial reaction after seeing the title was that I never thought Arrow could be used at the storage layer. The Arrow project even goes as far as saying that Parquet is complementary to Arrow in the FAQ.

After digging into the paper, the path the researchers took was using Parquet at the storage layer. They then had the storage layer to the work of taking the data from Parquet and serializing it into Arrow upon access. It seemed like a novel approach to a problem that they articulate as client machines, in particular client CPUs, are going to be the bottleneck when working with large data volumes.

I could see this being a feature on S3. If you have Parquet files saved in your S3 bucket, you could access them in Arrow formats. Might be interesting for the right use case.