Smart Sampling suggests a subsample of data to perform further analysis and can take into account similarity across data points, parity with the original dataset, and anomalous data points.
Customers that have large datasets want to use all of their data for analysis, however the tools that are currently available have limitations that prevent this, and there is no standard approach to selecting subsamples of data.
Running Smart Sampling:
-
From the Virtualitics Predict Home page, open the Smart Sampling Flow.
-
(Optional) Enter a Flow Label and / or Flow Notes
-
Click the Start Flow button.
-
Select an existing distributed compute platform connection from the dropdown menu and then click the Update button.
-
Select a data source connection from the dropdown menu and then click the Update button.
-
Enter a table name or SQL query for the Dataset Connection and then click the Update button.
-
The Dataset Preview card will show and update with the first 100 rows of data.
-
Click the Next button to go to the Dataset Filtering page.
-
Select any columns to keep from the dropdown menu and then click the Update button.
-
Select any columns to filter on from the dropdown menu and then click the Update button.
-
Select any values for any of the columns that were selected in the previous step and then click the Update button.
-
The Filtered Dataset Preview card will show and update with the first 100 rows of data.
-
Click the Next button to go to the Sampling Configuration Page.
-
Select the units for the sample size selection (number of rows or percent to keep) and click the Update button.
-
Use the slider or enter a value for the desired sample size.
-
Select an algorithm that is available from the dropdown menu and then click the Update button.
-
Select additional algorithm specific parameters from the available dropdown(s).
-
Select the output format(s) for the dataset.
-
(Optional) Provide an output name for the dataset then click the Next button.
-
Select any action for the output format that was selected:
-
Download the Dataset - select Cancel or Download
-
Intelligent Exploration - click the Explore Data button, then Submit.
-
Save to Database - select an output data source connection from the dropdown menu then click the Update button
-
Additional Details
The following algorithms are available:
-
Random Sampling - This sampling method is the fastest option. Use this when you want to get a sample quickly and using a random sample is sufficient.
-
Spatial Sampling - This sampling method takes longer to run than random sampling because it distributes the sample spatially across the selected columns. Use this when you want to prioritize selecting points in your sample that provide spatial coverage over the selected columns.
-
Validation-Based Sampling - This sampling method takes longer to run than random sampling because it performs custom-built validation of the sample. Use this when you want to prioritize that the sample and original dataset have similar relationships to a target (or KPI), and have a similar proportion of anomalies.