DataFrames: scaling up and out

Loading

Follow to receive video recommendations   a   A


By Ondřej Kokeš

DataFrames have become ubiquitous when it comes to fast analyses of complex data. They go beyond SQL by not adhering to a strict schema and offer a rich API, where you chain methods, which fosters exploratory analytics. While newcomers to Python usually learn about pandas early on, they sometimes struggle as their underlying data grow in size. Given the in-memory nature of pandas' storage system, one can usually only scale up. I'd like to outline several workflows for adapting to the ever-increasing size of datasets: Changing application logic to handle streams rather than loading the whole dataset into memory. Actually scaling up – locally by buying more memory and/or faster disk drives, or by deploying servers in the cloud and SSH tunneling to remote Jupyter instances. Scaling your data source and utilizing pandas' SQL connector. This will help in other areas as well (e.g. direct connections in BI). Using a distributed DataFrame engine – Dask or PySpark. These scale from laptops to large clusters, using the very same API the whole way through. I will cover the various differences between these approaches and will outline their set of upsides (e.g. scaling and performance) and downsides (DevOps difficulties, cost).



Editors Note:

I would like to work with open source projects to create a branch of the tree with all of the best videos for your open source project. Please send me an email if you are interested.