Persistence
Persistence in the Virtualitics AI Platform is handled by the StoreInterface
object.
Two types of objects that may be saved in StoreInterface
are Assets and Outputs; we provide an overview of and explain the differences between each below.
Assets
Assets are a form of data persistence that enable information to flow across different Flows.
- Assets may be used to pass information between Flows, whereas Outputs may only be accessed within Flows.
There are numerous types of Assets including Datasets and Models.
- Any pickle-able Python object can be turned into an asset.
Assets are saved with a label, which is effectively that is used to look up the Asset from storage.
Additionally, Assets are assigned an alias, which serves as a version to help distinguish if there are multiple Assets with the same label.
Finally, Assets save additional metadata, like when the Asset was created, for example.
Why are Assets useful and when would I use them?
Assets are useful when you want to save results, models, or Python objects that are costly to reproduce or recompute.
- These can be made available in other flows and dashboards using Assets.
Here are a couple examples of when you may wish to use an Asset:
-
Example #1: Update a Lookup Table
- Suppose you regularly run a Flow that requires a lookup table which is saved as an Asset.
- If you need to make updates to the lookup table, you can simply upload the new table and use the same Asset label to replace the existing Asset.
- Next time you run your flow, it can be configured to use this latest version of the lookup table.
-
Example #2: Reusing a Machine Learning Model
- Suppose you built a Flow that makes use of a machine learning model that takes roughly five hours to train.
- You need to frequently use the model for inference with different datasets.
- It does not make sense to retrain this model each time you need to use it.
- Instead, you can save the machine learning model as an Asset (of type Model).
- Then, in a different Flow, this Model Asset can be fetched and immediately used for inference.
- As an added benefit, Assets allow for improved governance over modeling and results by providing monitoring options through the Asset Management Console in the Virtualitics AI Platform.
-
Example #3: Persisting a Custom Python Class Object
- Let’s say you create a custom Python class (
MyClass
). You would like to persist instances ofMyClass
between Flows.- For example,
MyClass
could be a class that is used to generate/store reports.
- For example,
- To save a
MyClass
class object as an Asset, settype
toAssetType.OTHER
when instantiating the Asset. Refer to the example code at the bottom of this section to see how to do this.
- Let’s say you create a custom Python class (
my_class_obj = MyClass()
my_asset = Asset(my_class_obj, type = AssetType.OTHER, label = "my label", name = "my name")
How do I use Assets?
The easiest way to add and access Assets is through the Asset Management Console.
The Asset Management Console in the Virtualitics AI Platform.
Users may upload datasets or models as Assets from the Asset Management Console. Here, a user uploads a lookup table to map zip codes to geo-coordinates. Once this Asset has been uploaded, it may be accessed within any Flow in this the Virtualitics AI Platform instance. Assets may also be deleted from the Asset Management Console.
It is also possible to save intermediate objects (datasets, models) in a Flow Step using Outputs.
To learn how to use Assets in your code, check out these knowledge-base articles:
Outputs
Outputs are a form of data persistence that enable information to flow between Steps within a Flow.
- Note that Assets should be used to pass information between Flows.
- Think of Outputs as temporary objects that are only useful within a single Flow.
- For example, if you apply a series of preprocessing operations to an input DataFrame in a Flow, you may want to save the resulting DataFrame as an Output (and use it in subsequent Steps) so that you do not have to apply the same set of operations to the input DataFrame multiple times.
- You likely would not want to persist the DataFrame outside of the Flow (in which case you would save it as an Asset instead).
- Any pickle-able Python object may be saved as an Output (pandas Series/DataFrame, dict, int, float).
- Each Output is saved with a name that uniquely identifies the Output.
Why are Outputs useful and when would I use them?
Outputs are useful when you have intermediate objects in a Flow (i.e., objects that exist between the Flow input/output that are used by more than one Step).
- Recall that if you have multiple data processing steps in a Flow (e.g., data cleaning, feature engineering, modeling), we recommend write each step as an independent Step in the Flow.
- To persist objects between Steps in the Flow, save/load objects as Outputs.
Here’s an example of when you may wish to use an Output:
- Let’s say you have written a Flow that consists of the following Steps:
- Data Load
- Data Processing
- Feature Engineering
- Modeling
- Report Generation (Results)
- Assume that your input data is a CSV that is loaded/saved as a DataFrame.
- At the end of each of the intermediate Step (“Data Processing”, “Feature Engineering”, “Modeling”), you have a new DataFrame that you would like to use as input to the subsequent Step.
- You would save each DataFrame using the
StoreInterface.save_output()
method. - You would load a DataFrame in the subsequent Step using the
StoreInterface.get_input()
method. - These DataFrames would not be accessible outside of the current Flow you are running.
How do I use Outputs?
Refer to the code below to see how to use Outputs within Flow Steps:
from predict_backend.flow.step import Step
from predict_backend.store.store_interface import StoreInterface
def apply_preprocessing(data: pd.DataFrame):
"""Apply preprocessing operations.
:param data: Input DataFrame
:return: DataFrame containing preprocessed data
"""
# Apply preprocessing operations (pp_df is the name of the
# resulting DataFrame)
.
.
.
return pp_df
class DataUpload(Step):
def run(self, flow_metadata):
pass
class CreateOutput(Step):
def run(self, flow_metadata):
# Get store interface/current page
store_interface = StoreInterface(**flow_metadata)
page = store_interface.get_page()
# Get data (assume it was uploaded by the user in a previous step)
data = store_interface.get_element_value(
data_upload_step.name,
"Example Dataset"
)
# Apply preprocessing operations
pp_data = apply_preprocessing(data)
# Save pp_data as output from current step
store_interface.save_output(pp_data, "Preprocessed Data")
.
.
.
# Update page
store_interface.update_page(page)
class UseOutputAsInput(Step):
def run(self, flow_metadata):
# Get store interface/current page
store_interface = StoreInterface(**flow_metadata)
page = store_interface.get_page()
# Get pp_data from previous step
pp_data = store_interface.get_input("Preprocessed Data")
.
.
.
# Update page
store_interface.update_page(page)
# Add code for Flow below this line