Clustering is an unsupervised machine learning technique used to group points by similarity. Different clustering algorithms use different distance functions and metrics to determine the separation and similarity between points.
Clustering in VIP
The underlying algorithm used in VIP is k-means – a common and highly effective spatial clustering routine. To access Clustering, click on the icon in the toolbar. Next, select the number of clusters (between 2 and 16) that you would like to generate. If you are not sure how many to choose, you can opt to leave the field blank and have VIP automatically determine the number of clusters (each set of clusters is scored using clustering distortion error – the average distance between all data points and the cluster centroids). Next, select the features you would like to use to determine the clusters by dragging them to the Features field. You can also click on Input All to use all the features. Hovering on a feature and clicking on the X will let you remove it from the list, and you can remove all features with many missing values by clicking on the “Remove sparse features” button next to Input All. The trash can icon allows you to remove all input features to start fresh. Clustering works with numerical features but does not allow features where all the values are the same.
The clusters are automatically calculated and the generated pie chart shows the breakdown of how many points belong to which clusters. The computed clusters are also mapped to the Color dimension. Additionally, if you would like to save the cluster result, you can drag the newly created clustering result feature into the Feature List and use it on any of the other dimensions.