What Is It?
Anomaly Detection identifies points that are statistical outliers or have extreme values for combinations of features. Use Anomaly Detection to identify rare events, label them within the plot, and notify relevant stakeholders.
Why Is This Important?
Anomaly Detection is especially useful when extreme data points are not visually obvious, which might be the case with datasets containing a large number of features. For example, if you are looking at sensor data to analyze system failures, you could use Anomaly Detection to identify points where several sensors had extreme values that lead to a system failure. You could also use Anomaly Detection to automatically flag any clinical testing results that fall outside of the acceptable range.
There are two Anomaly Detection routines available in Explore: Threshold-based and Standard Deviation-based. The Threshold method is well-suited for finding data points with extreme values across any number of different features, while the Standard Deviation method will help find data points that are statistical outliers for one or a few features.
Steps for using Anomaly Detection:
- Open the Anomaly Detection panel by clicking the icon in the toolbar or selecting Anomaly Detection from the Data Analytics menu.
- Determine which method you would like to use:
- Add the features you would like to use to determine outliers by dragging and dropping them into the Add Features area or using the Input All button.
- (Optional) Remove unwanted features by hovering over the feature and clicking the "x" button.
- Tip: Remove features used to calculate the target and sparse features (with many missing values). Click the red pound sign to remove sparse features.
- Click the "Run" Button.
Anomalies are automatically shown with halos. A new (binary) feature is generated called “(X) Anomaly Result”, where X corresponds to the number of times you have run Anomaly Detection. If you would like to save this anomaly flag as a new column in your dataset, you can drag the newly created Anomaly Result feature into the Feature List.
To view just the points flagged as anomalies, right-click anywhere on the plot (not on a data point) and select “Show Only Haloed”.
Threshold Method Technical Details
The Threshold method runs PCA (Principal Component Analysis) to generate 3 principal components from the input features. The top N% of points are then determined based on the distance of each point from the centroid of that 3 dimensional PCA space.