Principal Component Analysis (PCA) is a data analysis technique used for dimensionality reduction; it is frequently used when the dataset has a very large number of features. Each principal component attempts to capture the largest amount of variance in the dataset; successive components contain the largest possible variance not accounted for by previous components. In other words, PCA transforms the data into features that are minimally correlated by taking linear combinations of the input features.
PCA in VIP
To access PCA, click on the icon in the toolbar. Drag the features you would like to include in the routine into the Features field. You can also click on Input All to use all the features. Hovering on a feature and clicking on the X will let you remove it from the list, and you can remove all features with many missing values by clicking on the “Remove sparse features” button next to Input All. Then enter the number of principal components you would like to compute. This is set to 3 by default to allow intuitive spatial visualization. PCA works with numerical features but does not allow features where all the values are the same.
Clicking Run computes the principal components (note that rows with missing values will be ignored) and suggests two different plots: the first one is based on the principal components, and the second one is based on the most relevant original features among the ones used as input (choosing the most relevant for each of the first three components). Each component will have a score corresponding to the variance from the input variables that the component captures. Click Apply to visualize the first 3 components, or the most relevant original features, on X, Y and Z. If you have a feature on Color then this will persist in the new visualization; all other dimensions will be reset when you click Apply.
You can select individual components to view the relative importance of each input variable for that component.