Related Categories

DataFrame Manipulation

Loading in Data

Many of the functionalities listed below, will require loading data into a Jupyter notebook with Python. The primary tool for processing data tables in Python is a Pandas DataFrame. Data can be read into a DataFrame from a multitude of data sources:

Unstacking Data

Currently, only one feature can be plotted on an axis in VIP. To subvert this, we advise users to unstack features that they would like to plot on the same axis. For example, consider the daily sales data of four different store branches listed below:

In order to compare the sales of each branch, one would have to convert each column to an individual row with a categorical identifier. Hence, the DataFrame on the left with a shape of 4 columns and 12 rows being converted to the dataframe on the right with 48 rows and 1 column. The column used to identify the Branch has been added as an additional index (this would be a MultiIndex DataFrame)

We can achieve this by loading the data in from a CSV file, unstacking the data frame columns, renaming the new columns, and removing the “Daily Sales -” prefix from the Branch column, all in four lines of code:

df = pd.read_csv('branch_sales.csv', index_col='Date') #read in data with Date as index
df = df.unstack().reset_index() #unstack columns and reset the index

df.columns = ['Branch', 'Date', 'Daily Sales'] #set appropriate columns names

df.Branch = df.Branch.str.replace('Daily Sales -','') #remove prefix from Branch column

We can then load this data into VIP using the Virtualitics Python API to create the line plot below, showing the daily sales for our four different store branches:

 

Pivoting Data

Alternatively, if we needed to spread the data across more columns, this can be done with a pivot. Consider the Health Indicators dataset (for the US) from the World Health Organization:

Given how this data is structured, we should spread the fields from the indicators column to multiple different columns. This will create a column for each unique indicator and make it easier to visualize our features. As a general rule, when different fields cannot be compared they should not be plotted on the same axis. For example, GDP/capita (current US$) and Income share held by lowest 20%  should not be plotted on the same axis as one is a dollar-value metric ($) and the other is a percent (%).

To accomplish this, we can build a simple pivot with the year as the index, the indicators as the columns and the value column to fill the values in those columns:

df = pd.read_csv('health_indicators', index_col='year')
df = df.pivot(index = 'year', columns = 'indicator', values = 'value')

Merging Datasets

VIP does not currently support merging/joining data from different sources.  However, merging data from multiple data sources is rather straightforward with Python. Consider two datasets: one containing patient data and another containing data from a clinical trial. In a Python environment, we can load in the data from two separate .csv files:

patient_data = pd.read_csv('patient_data.csv')

trial_data = pd.read_csv('clinical_trial_data.csv')

Using the Merge function, we can join the two dataframes on a column they have in common, patient_id. Alternatively, if we have the patient_id set as the index, we could use the Join function as a shortcut to merge directly on the index.

patient_data.merge(clinical_trial_data, on='patient_id')