You may need to preprocess your data to improve its quality and readiness for analysis. Many of the functionalities listed below will require loading data into a Jupyter notebook with Python.
The primary tool for processing data tables in Python is a Pandas DataFrame. Data can be read into a Pandas DataFrame from a multitude of data sources:
Click on a topic below to learn more.
Unstacking Data
Virtualitics Explore Desktop currently allows only one feature to be plotted on an axis. To subvert this, we advise to unstack features that you would like to plot on the same axis. For example, consider the daily sales data of four different store branches as shown in the image below.
To compare the sales of each branch, one would have to convert each column to an individual row with a categorical identifier. Hence, the DataFrame below with a shape of 4 columns and 12 rows being converted to the dataframe on the right with 48 rows and 1 column. The column used to identify the Branch has been added as an additional index (this would be a MultiIndex DataFrame).
We can achieve this by loading the data in from a CSV file, unstacking the data frame columns, renaming the new columns, and removing the “Daily Sales -” prefix from the Branch column, all in four lines of code:
df = pd.read_csv('branch_sales.csv', index_col='Date') #read in data with Date as index
df = df.unstack().reset_index() #unstack columns and reset the index
df.columns = ['Branch', 'Date', 'Daily Sales'] #set appropriate columns names
df.Branch = df.Branch.str.replace('Daily Sales -','') #remove prefix from Branch column
We can then load this data into Virtualitics Explore Desktop using the Virtualitics Python API to create the line plot below showing the daily sales for our four different store branches, as shown below.
Pivoting Data
If you need to spread data across more columns, this can be done with a pivot. Consider the Health Indicators dataset (for the US) from the World Health Organization, as shown below.
Given how this data is structured, we should spread the fields from the indicator column to multiple different columns. This will create a column for each unique indicator and make it easier to visualize our features. As a general rule, different fields should not be plotted on the same axis when they cannot be compared. For example, GDP/capita (current US$) and Income share held by lowest 20% should not be plotted on the same axis as one is a dollar-value metric ($) and the other is a percent (%), as shown below.
To accomplish this, we can build a simple pivot with the year as the index, the indicators as the columns and the value column to fill the values in those columns:
df = pd.read_csv('health_indicators', index_col='year')
df = df.pivot(index = 'year', columns = 'indicator', values = 'value')
Merging Datasets
Virtualitics Explore Desktop does not currently support merging/joining data from different sources. However, merging data from multiple data sources is rather straightforward with Python.
Consider two datasets: one containing patient data and another containing data from a clinical trial. In a Python environment, we can load in the data from two separate .csv files:
patient_data = pd.read_csv('patient_data.csv')
patient_data = pd.read_csv('patient_data.csv')
Using the Merge function, we can join the two dataframes on a column they have in common, patient_id. Alternatively, if we have the patient_id set as the index, we could use the Join function as a shortcut to merge directly on the index.
patient_data.merge(clinical_trial_data, on='patient_id')
Sorting Values
Currently, values within an axis are sorted alphabetically and cannot be changed within Virtualitics Explore Desktop.
Consider the graph of chocolate bar sales across different locations, shown below.
Given the alphabetical sorting on the Y axis, the data from the other cities is highly overshadowed by the height of the data in Houston. We can sort these axes by adding a prefix to each of the locations using the code below:
df.Location = df.Location.replace('Los Angeles', '1 - Los Angeles') df.Location = df.Location.replace('New York', '2 - New York') df.Location = df.Location.replace('Houston', '3 - Houston')
Here we are using the Replace function from the Pandas Python package to replace all instances of one text string in the “Location” column with another.
Alternatively, we could achieve this in Virtualitics Explore Desktop using the Find and Replace tool.
By simply adding a numerical prefix, Virtualitics Explore Desktop will now sort the locations in a way that is much more aesthetically pleasing.
Reformatting Datetime Values
Datetime values can often be difficult to work with, given the multitude of different formats that are used.
Virtualitics Explore Desktop can recognize a range of datetime formats, but we find that our clients occasionally have difficulty getting their dates in a format that works with Virtualitics Explore Desktop. Let’s take a look at an example involving Branch Sales Data.
We can create a Pandas DateTime Object by using the pd.to_datetime() function. This will convert our data stored as Strings to a datetime Object containing attributes like year, day of the week, and timestamp (See Figure 1 below).
Storing our dates as objects will encapsulate all of the information we have about the date in a format that can be universally understood when loaded into Virtualitics Explore Desktop. Additionally, the Pandas to_datetime() function can infer the datetime format provided. For getting started with processing datetime data, try using the code below:
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
Where “date” is the name of the column in your dataframe containing your date data. This code will convert the date column in your dataframe to a datetime object and uses infer_datetime_format to guess the datetime format you are using. You can then load your data into Virtualitics Explore Desktop as usual with:
from virtualitics import api VIP = api.VIP() VIP.load_data(df)
or if your data is already loaded into Virtualitics Explore Desktop, simply:
from virtualitics import api VIP = api.VIP() VIP.add_column(df.date)
If the infer_datetime_format parameter is not capturing the format you are using, you can call the format parameter and supply the syntax of the date.
Consider a date input with the format as shown below in Figure 2.
The code we would use to convert these date strings to datetime objects is as such:
df.date = pd.to_datetime(df.date, format="%m/%d/%Y %I:%M:%S %p")
With the format syntax indicated with the table shown below.
More information on string to datetime format syntax can be found here.
Generating Timedeltas
Generating TimeDeltas can be very useful for feature engineering within Virtualitics Explore Desktop. Using datetime objects in Python can simplify this process incredibly quickly. Consider the data relating to purchase orders below.
We are interested in understanding the time difference between when the Purchase Order is approved and issued. We will first convert each column to a datetime object:
df['Purchase Order Approval Date'] = pd.to_datetime(df['Purchase Order Approval Date'],
infer_datetime_format = True) df['Purchase Order Issue Date'] =
pd.to_datetime(df['Purchase Order Issue Date'],infer_datetime_format = True)
Now we can find the TimeDelta between these two date fields by simply subtracting the two datetime objects:
df['Purchase Order Approval Date'] - df['Purchase Order Issue Date']
You will notice that performing a subtraction has converted a Series of datetime objects to a Series of TimeDelta objects. We are interested in just obtaining the number of days between each date. Let’s apply a brief function to extract the “days” attribute from the timedelta objects and add this to our data:
df[‘timedelta’] = (df['Purchase Order Approval Date'] - df['Purchase Order Issue Date']).apply(lambda x: x.days)
Series.apply enables us to apply a function (in this case, we have defined one using the lambda keyword, where x is the input to the function) to extract the days from the Series of timedelta objects. We added this new feature to the original dataframe (df) as “timedelta”.
Previous Article |