Scikit-learn traditionally centered its data model around numpy arrays. However, in an important subset of scikit-learn's use cases, the original data in the machine learning pipeline is tabular: heterogeneously typed and labeled. In the meantime, pandas has become very popular, and increasingly used to represent such tabular data, but scikit-learn does not always play well with heterogeneous DataFrames.
This talk will give an overview of the challenges and current bottlenecks when working with tabular data and scikit-learn. Then it will show the ungoing developments in sckikit-learn to improve this situation and highlight some third-party libraries that try to ease those problems.
I am looking for editors/curators to help with branches of the tree. Please send me an email if you are interested.