Strongly typed datasets in a weakly typed world


Follow to receive video recommendations   a   A
Upvotes: DownVotes:
Age: 7 months     Page Views: 93
Votes / View: -43    Wilson Score: 0.09

We at Blue Yonder use Pandas quite a lot during our daily data scienceand engineering work. This choice, together with Python as an underlyingprogramming language gives us flexibility, a feature-rich interface, andaccess to a large community and ecosystem. When it comes to preservingthe data and exchanging it with different software stacks, we rely onParquet Datasets - Hive Tables. During the write process, there is ashift from a rather weakly typed world to a strongly typed one. Forexample, Pandas may convert integers to floats for many operationswithout asking, but parquet files and the schema information storedalongside them dictate very precise types. The type situation may geteven more 'colorful', when datasets are written by multiple codeversions or different software solutions over time. This then results inimportant questions regarding type compatibility.This talk will first represent an overview on types at different layers(like NumPy, Pandas, Arrow and Parquet) and the transition between thislayers. The second part of the talk will present examples of typecompatibility we have seen and why+how we think they should be handled.At the end there will be a Q+A, which can be seen as the start of apotentially longer RFC process to align different software stacks (likeHive and Dask) to handle types in a similar way.