Using Pandas and Dask to work with large columnar datasets in Apache Parquet


Apache Parquet is a binary, efficient columnar data format. I

Follow to receive video recommendations   a   A


t uses various techniques to store data in a CPU and I/O efficient way like row groups, compression for pages in column chunks or dictionary encoding for columns. Index hints and statistics to quickly skip over chunks of irrelevant data enable efficient queries on large amount of data.

 Apache Parquet files can be read into Pandas DataFrames with the two libraries fastparquet and Apache Arrow. While Pandas is mostly used to work with data that fits into memory, Apache Dask allows us to work with data larger then memory and even larger than local disk space. Data can be split up into partitions and stored in cloud object storage systems like Amazon S3 or Azure Storage.

Using Metadata from the partiton filenames, parquet column statistics and dictonary filtering allows faster performance for selective queries without reading all data. This talk will show how use partitioning, row group skipping and general data layout to speed up queries on large amount of data.


Editors Note:

I would like to work with open source projects to create a branch of the tree with all of the best videos for your open source project. Please send me an email if you are interested.