Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy


Upvotes: DownVotes:
Age: a year     Page Views: 203
Votes / View: 5    Wilson Score: 0.21

Apache Arrow established a standard for columnar in-memory analytics toredefine the performance and interoperability of most Big Datatechnologies in early 2016. Since then implementations in Java, C++,Python, Glib, Ruby, Go, JavaScript and Rust have been added. AlthoughApache Arrow (``pyarrow``) is already known to many Python-Pandas usersfor reading Apache Parquet files, its main benefit is the cross-languageinteroperability. With feather and PySpark, you can already benefit fromthis in Python and R-Java via the filesystem or network. While theyimprove data sharing and remove serialization overhead, data still needsto be copied as it is passed between processes.In the 0.23 release of Pandas, the concept of ExtensionArrays wasintroduced. They allow the extension of Pandas DataFrames and Serieswith custom, user- defined typed. The most prominent example is``cyberpandas`` which adds an IP dtype that is backed by the appropriaterepresentation using NumPy arrays. These ExtensionArrays are not limitedto arrays backed by NumPy but can take an arbitrary storage as long asthey fulfill a certain interfaces. Using Apache Arrow we can implementExtensionArrays that are of the same dtype as the built-in types ofPandas but memory management is not tied to Pandas' internalBlockManager. On the other hand Apache Arrow has a much more wider setof efficient types that we can also expose as an ExtensionArray. Thesetypes include a native string type as well as a arbitrarily nested typessuch as ``list of …`` or ``struct of (…, …, …)``.To show the real-world benefits of this, we take the example of a datapipeline that pulls data from a relational store, transforms it and thenpasses it into a machine learning model. A typical setup nowadays mostlikely involves a data lake that is queried with a JVM based queryengine. The machine learning model is then normally implemented inPython using popular frameworks like CatBoost or Tensorflow.While sometimes these query engines provide Python clients, theirperformance is normally not optimized for large results sets. In thecase of a machine learning model, we will do some featuretransformations and possibly aggregations with the query engine but feedas many rows as possible into the model. This will lead then to resultsets that have above a million rows. In contrast to the Python clients,these engines often come with efficient JDBC drivers that can cope withresult sets of this size but then the conversion from Java objects toPython objects in the JVM bridge will slow things down again. In ourexample, we will show how to use Arrow to retrieve a large result in theJVM and then pass it on to Python without running into thesebottlenecks.