Until very recently, Apache Spark has been a de facto standard choice of a framework for batch data processing. For Python developers, diving into Spark is challenging, because it requires learning the Java infrastructure, memory management, configuration management. The multiple layers of indirection also make it harder to debug things, especially when throwing the Pyspark wrapper into the equation.
With Dask emerging as a pure Python framework for parallel computing, Python developers might be looking at it with new hope, wondering if it might work for them in place of Spark. In this talk, I’m using a data aggregation example to highlight the important differences between the two frameworks, and make it clear how involved the switch may be.
Note: Just in case it's unclear, there's no Java of any kind in this talk. All the code / examples use Python (PySpark).
I am looking for editors/curators to help with branches of the tree. Please send me an email if you are interested.