What Is It?
Principal Component Analysis (PCA) is a data analysis technique used for dimensionality reduction, which is frequently used when the dataset has a very large number of features. The components in PCA aim to capture the largest amount of variance in the dataset.
Why Is This Important?
PCA is important for reducing the complexity of a problem by condensing information from many different features into just a few representative features. Using PCA, you can effectively arrive at insights for extremely complex datasets. One example would be an IoT use case where you are analyzing data from thousands of different sensors to analyze the efficiency of a system. PCA would help by transforming your dataset of thousands of columns into just a few components so that you can rapidly characterize the system's inefficiencies. PCA also makes it easier to analyze gene expression data to understand which genes out of hundreds are expressed by different common pathogens.
How?
Unlike Smart Mapping which is a supervised machine learning routine, PCA is an unsupervised machine learning routine. This means that you do not need a target feature in mind when running PCA, which can be very helpful to get a broader understanding of the trends and relationships in your data.
Steps for running PCA:
- Open the PCA panel by clicking the icon in the toolbar or selecting PCA from the Data Analytics menu.
- By default, PCA will look for 3 principal components in your data since typically the variance in large datasets is captured in the first 3 components.
- Tip: You can adjust the number of components by entering a number from 1 to 10.
- Select the features on which you would like to run PCA by either dragging and dropping them into the Add Features area or using the Input All button.
- (Optional) Remove unwanted features by hovering over the feature and clicking the "x" button.
- Tip: Removing sparse features (with many missing values) will produce more complete results. Click the red pound sign to remove sparse features.
- Click the "Run" button.
Additional Details
A visualization will be generated to show you the first three principal components on the X, Y and Z axes. You will also find a visualization of the key features identified as strongly influencing the first three principal components by toggling through the Suggested Mappings.
You can view the aggregate rank of features' importance by selecting the options menu at the top of the PCA panel and selecting "Show Feature Importance". This gives an idea of which features are contributing the most to differences between points in your dataset.
To view the feature importance per principal component, select the options menu at the top of the PCA panel, hover over "Show Importance By Component", then select the component you would like to investigate. Successive components capture the largest possible variance not accounted for by previous components.