We discuss our experience with dimension reduction for big datasets. We investigate the controlled performance decrease of our public sentiment models under transformations that reduce the number of features in the dataset. This feature reduction speeds up our real-time data science tools and helps to counter the curse of dimensionality. We outline the Python workflow that both produces and validates the quality of these transformations at scale in the AWS ecosystem, and we detail our programming and design choices, touching on the scikit-learn API, configuration versus code, SQL templatization, and our open source API client.
I am looking for editors/curators to help with branches of the tree. Please send me an email if you are interested.