Introduction to Outlier and Novelty Detection
Many practitioners new to ML systems often assume that the data used during training will resemble what the model encounters in production. However, real-world data can contain rare or previously unseen observations (or, as stated previously, malicious data designed to inhibit proper model training). These data are typically categorized as either outliers or novelties. Outliers are data points that deviate significantly from other observations in the training set, while Novelties are previously unseen data points that occur only at prediction time. Detecting these values is essential for preventing misleading predictions and ensuring robustness, particularly in applications such as fraud detection, industrial monitoring, and medical diagnostics.
In this recipe, we’ll explore the purpose and context of outlier and novelty detection within ML pipelines. We’ll also introduce scikit-learn tools and algorithms that allow us to identify...