Safe Handling Instructions for Missing Data


Follow to receive video recommendations   a   A

In machine learning tasks, it is common to handle missing data by simply removing observations with missing values, or just replacing missing data with the mean value for its feature. To show why this is problematic, we use listwise deletion and mean imputing to recover missing values from artificially created datasets, and we compare those models against ones with full information. Unless quite strong independence assumptions are met, we observe large biases in the resulting coefficients and an increase in the model's prediction error. We conclude by repeating the experiment on a real dataset, and showing the appropriate diagnostic and correction steps to handle missing values. Link to the github repo with the code demonstrated in this video:

Editors Note:

I would like to work with open source projects to create a branch of the tree with all of the best videos for your open source project. Please send me an email if you are interested.