Persistence in Virtualitics Predict is handled by the StoreInterface
object.
Two types of objects that may be saved in StoreInterface
are Assets and Outputs. These are two key components that also have several key differences, outlined below.
Assets
Assets are a form of data persistence that enable information to flow across different Flows.
Assets may be used to pass information between Flows, whereas Outputs may only be accessed within Flows.
There are numerous types of Assets, including Datasets and Models.
Assets are saved with a Label, which can be used to look up the Asset from storage. Additionally, Assets are assigned an Alias, which serves as a version to help distinguish if there are multiple Assets with the same Label.
Finally, Assets automatically save additional metadata, such as when the Asset was created.
Using Assets
Assets are useful when you want to save results, models, or Python objects that are costly to reproduce or recompute. These Assets can then be made available in other Flows and Dashboards.
Asset Examples
Here are some examples of when you may wish to use an Asset:
Example #1: Update a Lookup Table
Suppose you regularly run a Flow that requires a lookup table which is saved as an Asset.
If you need to make updates to the lookup table, you can simply upload the new table and use the same Asset label to replace the existing Asset.
Next time you run your flow, it can be configured to use this latest version of the lookup table.
Example #2: Reusing a Machine Learning Model
Suppose you built a Flow that makes use of a machine learning model that takes roughly five hours to train. You need to frequently use the model for inference with different datasets.
It does not make sense to retrain this model each time you need to use it. Instead, you can save the machine learning model as an Asset (of type Model). Then, in a different Flow, this Model Asset can be fetched and immediately used for inference.
As an added benefit, Assets allow for improved governance over modeling and results by providing monitoring options through the Asset Management Console in the Virtualitics AI Platform.
Example #3: Persisting a Custom Python Class Object
Let’s say you create a custom Python class (MyClass
). You would like to persist instances of MyClass
between Flows. For example, MyClass
could be a class that is used to generate/store reports.
To save a MyClass
class object as an Asset, set type
to AssetType.OTHER
when instantiating the Asset. Refer to the example code below to see how to do this.
/* my_class_obj = MyClass()
my_asset = Asset(my_class_obj, type = AssetType.OTHER, label = "my label", name = "my name")
Creating Assets
To add and access Assets:
- In the left navigation bar in Virtualitics Predict, click Assets. This will open the Asset Management Console.
- If you do not yet have Assets, a button will appear to Upload an Asset.
- Enter a Label for your Asset, which can help categorize Assets of similar nature or scope.
- Enter a unique Name for your Asset.
- Enter a Description.
- Choose an Asset Type to further help categorize your Asset.
- Upload an Asset File.
- Click Create.
It is also possible to save intermediate objects (datasets, models) in a Flow Step using Outputs.
Reviewing Assets
To review any Assets that have already been created:
- In the left navigation bar in Virtualitics Predict, click Assets. This will open the Asset Management Console.
- By default, you will be able to see All Assets. Assets are listed in a table with relevant identifiers to help you differentiate. Alternatively, click My Assets to view any Assets that you've created.
- For more information about each Asset, click the Expand button ( ) to the left of the Asset name.
- Each Asset also has an associated Actions button ( ). Click the Actions button to download or delete the Asset.
- Customize your dashboard using the buttons above the table:
- Columns: Toggle on or off the display of certain columns
- Filter: Filter down to specific Assets using Columns, Operator, and Value
- Density: Toggle the width and spacing between rows in the table for visibility
- Export: Export a table of your Assets in .csv format.
Outputs
Outputs are a form of data persistence that enable information to flow between Steps within a Flow.
Outputs can be considered temporary objects that are only useful within a single Flow.
For example, if you apply a series of preprocessing operations to an input DataFrame in a Flow, you may want to save the resulting DataFrame as an Output (and use it in subsequent Steps) so that you do not have to apply the same set of operations to the input DataFrame multiple times.
You likely would not want to persist the DataFrame outside of the Flow (in which case you would save it as an Asset instead).
Each Output is saved with a name that uniquely identifies the Output.
Using Outputs
Outputs are useful when you have intermediate objects in a Flow (i.e., objects that exist between the Flow input/output that are used by more than one Step).
If you have multiple data processing steps in a Flow (e.g., data cleaning, feature engineering, modeling), we recommend writing each Step as an independent Step in the Flow.
To persist objects between Steps in the Flow, save/load objects as Outputs.
Output Example
Here’s an example of when you may wish to use an Output:
-
Let’s say you have written a Flow that consists of the following Steps:
-
Data Load
-
Data Processing
-
Feature Engineering
-
Modeling
-
Report Generation (Results)
-
-
Assume that your input data is a CSV that is loaded/saved as a DataFrame.
-
At the end of each of the intermediate Step (“Data Processing”, “Feature Engineering”, “Modeling”), you have a new DataFrame that you would like to use as input to the subsequent Step.
-
You would save each DataFrame using the
StoreInterface.save_output()
method. -
You would load a DataFrame in the subsequent Step using the
StoreInterface.get_input()
method. -
These DataFrames would not be accessible outside of the current Flow you are running.
Code Example
Refer to the code below to see how to use Outputs within Flow Steps.
from predict_backend.flow.step import Step
from predict_backend.store.store_interface import StoreInterface
def apply_preprocessing(data: pd.DataFrame):
"""Apply preprocessing operations.
:param data: Input DataFrame
:return: DataFrame containing preprocessed data
"""
# Apply preprocessing operations (pp_df is the name of the
# resulting DataFrame)
.
.
.
return pp_df
class DataUpload(Step):
def run(self, flow_metadata):
pass
class CreateOutput(Step):
def run(self, flow_metadata):
# Get store interface/current page
store_interface = StoreInterface(**flow_metadata)
page = store_interface.get_page()
# Get data (assume it was uploaded by the user in a previous step)
data = store_interface.get_element_value(
data_upload_step.name,
"Example Dataset"
)
# Apply preprocessing operations
pp_data = apply_preprocessing(data)
# Save pp_data as output from current step
store_interface.save_output(pp_data, "Preprocessed Data")
.
.
.
# Update page
store_interface.update_page(page)
class UseOutputAsInput(Step):
def run(self, flow_metadata):
# Get store interface/current page
store_interface = StoreInterface(**flow_metadata)
page = store_interface.get_page()
# Get pp_data from previous step
pp_data = store_interface.get_input("Preprocessed Data")
.
.
.
# Update page
store_interface.update_page(page)
# Add code for Flow below this line
For more information about Assets and Outputs, see the Virtualitics SDK documentation.
Previous Article |
Next Article |