What Is It?
Clustering is a machine learning technique used to group points by numerical similarity. The computed clusters are automatically mapped to the Color dimension and can be saved as a feature.
Why Is This Important?
Use Clustering to quickly reveal distinct groups within your data. Clustering is highly applicable in a variety of use cases:
- Identify customer segments within sales data.
- Categorize new products based on a set of product features to price products for the market.
- Characterize target patient populations for a new health program.
- Categorize IT tickets to identify common problems and spot opportunities to automate.
- Improve the performance of recommender systems.
How?
The underlying algorithm used in Explore is the k-means algorithm. Clustering works with numerical features but does not allow features where all the values are the same.
Steps for running Clustering:
- Open the Clustering panel by clicking the icon in the toolbar or selecting Clustering from the Data Analytics menu.
- The Clustering routine in Explore will automatically determine the best number of clusters using the elbow method.
- If you would like to specify the exact number of clusters to find, you may enter a number between 2 and 16 in the Number of Clusters input.
- Select the features you would like to use to determine the clusters by either dragging and dropping them into the Add Features area or using the Input All button.
- (Optional) Remove unwanted features by hovering over the feature and clicking the "x" button.
- Tip: Removing sparse features (with many missing values) will produce more complete results. Click the red pound sign to remove sparse features.
- Click the "Run" button.
Additional Details
The computed clusters are automatically mapped to the Color dimension. A new feature is generated called "(X) Cluster Result", where X corresponds to the number of times you have run Clustering. If you would like to save the resulting clusters as a new column in your dataset, you can drag the newly created Cluster Result feature into the Feature List.