Desktop Site (Beta)

Introductory Video

Adapting from Spark to Dask: what to expect

Loading

Follow to receive video recommendations   a   A
Speaker: Are you the speaker?

Until very recently, Apache Spark has been a de facto standard choice of a framework for batch data processing. For Python developers, diving into Spark is challenging, because it requires learning the Java infrastructure, memory management, configuration management. The multiple layers of indirection also make it harder to debug things, especially when throwing the Pyspark wrapper into the equation. With Dask emerging as a pure Python framework for parallel computing, Python developers might be looking at it with new hope, wondering if it might work for them in place of Spark. In this talk, I’m using a data aggregation example to highlight the important differences between the two frameworks, and make it clear how involved the switch may be. Note: Just in case it's unclear, there's no Java of any kind in this talk. All the code / examples use Python (PySpark).

Editors Note:

I am looking for editors/curators to help with branches of the tree. Please send me an email  if you are interested.  

Comment On Twitter