What Is It?
Anomaly Detection identifies points that are statistical outliers or have extreme values for combinations of features. Use Anomaly Detection to identify rare events, label them within the plot, and notify relevant stakeholders.
Why Is This Important?
Anomaly Detection is especially useful when extreme data points are not visually obvious, which might be the case with datasets containing a large number of features. For example, if you are looking at sensor data to analyze system failures, you could use Anomaly Detection to identify points where several sensors had extreme values that lead to a system failure. You could also use Anomaly Detection to automatically flag any clinical testing results that fall outside of the acceptable range.
How?
There are two Anomaly Detection routines available in Explore: Threshold-based and Standard Deviation-based. The Threshold method is well-suited for finding data points with extreme values across any number of different features, while the Standard Deviation method will help find data points that are statistical outliers for one or a few features.
Steps for using Anomaly Detection:
- Open the Anomaly Detection panel by clicking the icon in the toolbar or selecting Anomaly Detection from the Data Analytics menu.
- Determine which method you would like to use:
-
Threshold (default)
- The top N% of points are returned, which are ranked based on how far away they are from the "average" point (see below for more technical details).
- You can choose the value of N by using the slider, and the default threshold is 1%.
-
Standard Deviation
- Data that falls more than N standard deviations from the mean for the selected input features will be labeled as outliers.
- You can choose the value of N by using the dropdown which ranges from .5 to 5.
- If “And” is selected, points that are outside the Nth standard deviation for all input features will be identified as outliers. Note that rows with missing values will always be identified as non-outliers.
- If “Or” is selected, points that are outside the Nth standard deviation for one or more of the input features will be identified as outliers, even if that point contains missing values for some of the input features.
- You can also select ‘+’, ‘-‘, or ‘+/-‘ to determine which extremes are considered: upper extremes ‘+’, lower extremes ‘-’, or both ‘+/-’.
-
Threshold (default)
- Add the features you would like to use to determine outliers by dragging and dropping them into the Add Features area or using the Input All button.
- (Optional) Remove unwanted features by hovering over the feature and clicking the "x" button.
- Tip: Remove features used to calculate the target and sparse features (with many missing values). Click the red pound sign to remove sparse features.
- Click the "Run" Button.
Additional Details
Anomalies are automatically shown with halos. A new (binary) feature is generated called “(X) Anomaly Result”, where X corresponds to the number of times you have run Anomaly Detection. If you would like to save this anomaly flag as a new column in your dataset, you can drag the newly created Anomaly Result feature into the Feature List.
To view just the points flagged as anomalies, right-click anywhere on the plot (not on a data point) and select “Show Only Haloed”.
Threshold Method Technical Details
The Threshold method runs PCA (Principal Component Analysis) to generate 3 principal components from the input features. The top N% of points are then determined based on the distance of each point from the centroid of that 3 dimensional PCA space.