# Anomaly Detection

Anomaly Detection identifies statistical outliers for combinations of features. There are two anomaly detection routines available in VIP: threshold-based and standard-deviation-based.

Threshold-based anomaly detection will find the desired percentage of most extreme values based on the input features. The threshold method works by selecting the points with the largest distance from the top principal components (weighted by the explained variance; see PCA).

In standard-deviation-based anomaly detection, the algorithm identifies data that is N standard deviations away from the mean of each input feature. See the advanced controls section below to adjust how the outliers are determined.

#### Threshold Anomaly Detection in VIP

To access threshold-based anomaly detection, click on the icon on the toolbar. Select Threshold as the Method. Next, select the features you would like to use to determine outliers by dragging them into the Features field. You can also click on Input All to use all the features. Hovering on a feature and clicking on the X will let you remove it from the list, and you can remove all features with many missing values by clicking on the “Remove sparse features” button next to Input All. The trash can icon allows you to remove all input features to start fresh. This routine works with numerical features but does not allow features where all the values are the same. Anomalies are automatically shown with halos. When the process runs, it will generate a new (binary) feature which will be listed as “(X) Anomaly Result” where X corresponds to the number of times you’ve run anomaly detection. You can use this new feature to visually identify outliers or simply filter the data.

#### Threshold Advanced Controls

The threshold-based anomaly detection tool ranks the data points by distance from the top principal components for the input features. The **top N%** of points by this distance rank are labeled as the outliers. You can choose the value of N by using the slider, and the default threshold is 1%. Please note that rows with missing values are excluded before identifying outliers. This means that the percentage selected for the threshold is used only on points that are remaining after rows with missing values are removed. Please also note that if there are several points with equal distance rank relative to the top principal components, it is possible that the number of anomalies returned is slightly more than the requested percentage.

#### Standard Deviation Anomaly Detection in VIP

To access Anomaly Detection, click on the icon on the toolbar. Select Standard Deviation as the Method. Next, select the features you would like to use to determine outliers by dragging them into the Features field. You can also click on Input All to use all the features (not recommended for the Standard Deviation method of Anomaly Detection, this is better-suited for the Threshold method described below). Hovering on a feature and clicking on the X will let you remove it from the list, and you can remove all features with many missing values by clicking on the “Remove sparse features” button next to Input All. The trash can icon allows you to remove all input features to start fresh. This routine works with numerical features but does not allow features where all the values are the same. Anomalies are automatically shown with halos. When the process runs, it will generate a new (binary) feature which will be listed as “(X) Anomaly Result” where X corresponds to the number of times you’ve run anomaly detection. You can use this new feature to visually identify outliers or simply filter the data.

#### Standard Deviation Advanced Controls

Data that falls more than N standard deviations from the mean for the selected input features will be labeled as outliers. You can choose the value of N by using the dropdown which ranges from .5 to 5. The default value of N is 2.

When selecting features to include in the Anomaly Detection, you can choose “And” or “Or” to determine the style of outlier detection. Select “And” to identify outliers that are outside the Nth standard deviation for **all** input features. If “And” is selected, rows with missing values will always be identified as non-outliers. Select “Or” to identify outliers that are outside the Nth standard deviation for **any** input features, even if that row contains missing values in other columns.

You can also select ‘+’, ‘-‘, or ‘+/-‘ to determine what side of the distributions to pull from when determining the outliers. When ‘+’ is selected, the data that is **above** the Nth standard deviation for the input features will be labeled as outliers. When ‘-‘ is selected, the data that is **below** the Nth standard deviation for the input features will be labeled as outliers. When ‘+/-‘ is selected, data that is **above or below** the Nth standard deviation for the input features will be labeled as outliers.