Pandas Replatforms onto Arrow

Performance bump or schism in scientific computing?

6 min readApr 5, 2023

Photo Credit (unsplash.com/photos/ZfCVTJ30yoc)

The most popular data science tool, Pandas (a NumFOCUS project), is switching backends from NumPy (NumFOCUS) to Arrow (Apache).

However, the rest of the NumFOCUS scientific computing toolset (e.g. sklearn and scipy) is built on top of NumPy.

🎥 Practical Summary Video

🏎️ Performance

You’ll hear a lot of people talking about the performance gains, but is that the real takeaway here?

On one hand, yes, there are improvements in terms of memory utilization. On the other hand, in terms of speed, NumPy itself is fast; think 15x-20x faster than Pandas. Although it’s great that the new Arrow-based backend provides performance improvements without the need to leave the Pandas API — upon closer inspection, we’ll see that this change is about much more than low-level serialization speeds-n-feeds.

Dimensions of a Dataset (Image by Author)

Data Structures

🧮 Pandas has always been an outlier compared to the rest of the NumFOCUS ecosystem in that it focused on human-readable, 2D/tabular dataframes.

Great at representing simple data (e.g. big spreadsheets) as DataFrames
Strong emphasis on columnar data where each column can have its own heterogenous type.
Similar to a Python dictionary where keys=columns and values=rows
Commonly used file formats for persisting dataframes include: csv, tsv, and the highly efficient, typed, columnar format known as Parquet

🧊 NumPy provides a fantastic interface for working with multi-dimensional (n-dimensional) arrays.

Great at representing complex data: time series (3D), images (3D), video (4D), or chunking what would normally be 2D data into multiple subsets
Arrays behave more like a nested Python lists
The entire array bears the same homogenous type. This is great for deep learning where all data eventually gets encoded to float type, but extremely cumbersome for columnar/typed tabular use cases.
The main file format for persisting arrays is npy

So it always felt like a mismatch that Pandas was built on top of NumPy.

🧮 Meanwhile, Arrow is best known for its high-performance, in-memory tables for working with 2D/tabular data. It also provides the pyarrow interface that is widely used as a go-between (read/write) for Pandas dataframes and Parquet files.

So the Pandas migration to Arrow doesn’t really come as a surprise

💸 Follow the Money

This shift isn’t purely about technology.

Now, I’m not saying this to paint Voltron in a negative light. To the contrary, I’m a firm believer that the best way to support meaningful progress is a good business model. And I’m sure that if you actually look at the people involved in the company, there are many prominent open source contributors. I am, however, saying that this represents a shift in the way things historically get built in the open source data science ecosystem.

🧱 Implications

Pandas is extremely popular. I’ll spare you the charts about StackOverflow tags and developer survey responses. It is by far the most widely used NumFOCUS project from an end-user perspective.

So what are the implications of Pandas switching away from NumPy while the rest of the projects are still built on NumPy?

Will the other projects replatform on Arrow?

Maybe. Most of these projects are really mature and this would be a major, low-level rewrite. These projects typically talk about that kind of work in terms of how much funding it will take.

For example, let’s assume that if the Dask companies (Coiled and Saturn Cloud) budget for some refactoring, they can largely continue using whatever API Pandas exposes. However, Dask also parallelizes sklearn, which utilized NumPy heavily. Is sklearn going to replatform as a whole?

Path for Adopting the Array API spec · Issue #22352 · scikit-learn/scikit-learn

I have been experimenting with adopting the Array API spec into scikit-learn. The Array API is one way for scikit-learn…

github.com

Will interoperability suffer?

In many cases, libraries like sklearn require input to be in NumPy format. However, 2D operations will often make exceptions for Pandas DataFrames given that it’s easy to convert to NumPy under the hood. It will probably be easy to convert Arrow-based in-memory formats to np. However, will lower-level dtype functionality break? Although there appears to be backward compatibility in the new Pandas-Arrow typing, will that always be the case?

At the same time, what would incentivize a data scientist building a new tool that makes use of Arrow-based functionality to ensure that their tools are compatible with other NumFOCUS projects… nothing? That’s not to say that an Arrow-based ecosystem would replace the PyData stack overnight, but they do already have Pandas… Why would Voltron want to help competitors like Coiled and Saturn Cloud?

What about NumPy’s multi-dimensional array functionality?

Well, what about tensors? It seems like most of the new development in the ndim data space is being driven by deep learning teams like PyTorch (Facebook) and TensorFlow (Google). Aside from a few experimental sklearn methods, NumFOCUS favors tree-based machine learning over deep learning, so it doesn’t really makes sense for them to invest in and compete with [just because you’re open source doesn’t mean you are not a product] FB/Google in that space.

Although its not what they are known for, Arrow does have Tensor and Array classes.

Arrays - Apache Arrow v11.0.0

This data structure is a self-contained representation of the memory and metadata inside an Arrow array data structure…

arrow.apache.org

Other Data Structures - Apache Arrow v11.0.0

Our Flatbuffers protocol files have metadata for some other data structures defined to allow other kinds of…

arrow.apache.org

Looking forward, will we see more Arrow development on ndim formats?

Will other languages see more use in data science?

Arrow places high emphasis on being a language-agnostic, highly performant format.

So if performance is the real goal, will this be an opportunity for other non-Python projects based on high-performance languages in the short term? Or will this make Python more defensible in the long term?

Recently, we’ve seen the rise of Polars, which is written in Rust and people seem to like its API in comparison to the traditional Pandas API.

Or will this be a watershed moment for Julia?

Arrow.jl

A pure Julia implementation of the apache arrow memory format specification.

arrow.juliadata.org

🤔 Reflection

Is Arrow to blame for hijacking Pandas?

Is Pandas to blame for wedging itself into the NumPy-based community in the first place?

Personally–and this may be naive–I feel like NumPy is to blame for not adapting.

NumPy should have embraced columnar use cases. I don’t think it would have been that difficult to handle heterogenously typed data.

Thinking critically, all dimensions of data are manifestations of 2D data.
- 1D is just a 2D table with a single column.
- 3D is just a group of 2D tables.
- 4D is just a group of 3D groups of 2D tables.
If we make the assumption that the column order and typing is uniform in each 2D array (which is almost always the case), then all data is columnar.

Heterogenous typing could have been supported by implementing some kind of DataFrame-like view where:

A 3D array where each dtype is stored in a 2D array, and operations are routed accordingly.
Or start with a low-performance hack where the 2D array is rotated behind the scenes (so that the typed rows are now columns) and route the operations accordingly.

Anything!

The more I think about it, the more I begin to view Pandas 2.0 is a schism in not only data science but also Python.
In my mind, that’s a net negative for the next several years. What makes the Python data science community beautiful is NOT speed. It’s the fact that everything works together without low-level tinkering, and that it arose in a unified open source ecosystem.
However, in the long run, will the change to Arrow prove necessary given the fact that Python data science reached product maturity years ago and hasn’t done enough about performance?