Missing data can arise from various sources, including human error, technical failures, or data corruption. It is important to address missing values before training ML models, as most algorithms cannot handle them directly, and most scikit-learn methods won’t even execute when they are detected in your training data. Sometimes, with large enough datasets, we can simply drop the records that contain missing values with little impact on the resulting model, but this isn’t always viable. Thankfully, scikit-learn provides several strategies for imputing missing values, allowing practitioners to fill in gaps with estimated values based on available data. This recipe introduces three of the most commonly used methods for imputing missing values in a dataset with scikit-learn.
Getting ready
To begin, we will create a toy dataset composed of random, quantitative data, 10 features, and several missing data values randomly spread throughout. We will...