from virtualitics import api
import pandas as pd
data = pd.read_csv('../../../Virtualitics/SampleData/eCommerce Travel Data.csv')
In this notebook, we will analyze an eCommerce dataset. Each row of data contains information about someone that visited the website and may or may not have spent some money on the selection of travel packages that this website hosts. Suppose we are a marketing agency and we are tasked with identifying cohorts of users to incentivize to visit the site again.
vip = api.VIP()
Setting up WebSocket connection to: ws://localhost:12345/api Connection Successful! Initializing session.
vip.load_data(data, "monthly_website_sales")
Data set loaded with name: 'monthly_website_sales (3)'
'monthly_website_sales (3)'
data.head(n=3)
User ID | Location | Hobby | Language | Gender | Age | Married | Kids | Pets | First visit to site (months ago) | Last visit to site (months ago) | Device used | Page visits | Number of visits last month | Household Income (USD) | Time spent (seconds) | Amount spent (USD) | Number of products viewed | Frequency of visits last week | Sources | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 413613 | Urban | Meditation | French | Female | 17 | No | Yes | Yes | 10 | 2 | Tablet | 7 | 219 | 38407 | 2205 | 2060.898000 | 12 | 6 | App |
1 | 463272 | Urban | Beach Activities | German | Female | 15 | Yes | Yes | Yes | 30 | 7 | Desktop | 15 | 120 | 51905 | 2949 | 2224.381362 | 32 | 30 | App |
2 | 437160 | Suburb | Beach Activities | French | Female | 48 | No | No | No | 27 | 3 | Tablet | 10 | 62 | 78967 | 443 | 353.346000 | 2 | 4 | App |
Lets use VIP's Smart Mapping routine to identify the key drivers of users who spent money on the website.
# Here we are running smart mapping with a target set to "Amount spent (USD)"
# Instead of specifying which features to include as input to smart mapping, we specify which
# features to exclude from input.
vip.smart_mapping(data["Amount spent (USD)"],
exclude=["User ID"])
SmartMapping Rank | Feature | Correlated Group | |
---|---|---|---|
0 | 1 | Frequency of visits last week | None |
1 | 2 | Sources | None |
2 | 3 | Number of products viewed | None |
3 | 4 | Married | None |
4 | 5 | Household Income (USD) | None |
Immediately we can start to see how users with heavy spending and lower spending are separated. Its also very interesting that 'Sources', which tracks how the user visited the website, is such a strong driver of spending. Lets map 'Sources' to shape and see if that shows us anything interesting!
print(vip.local_history[0])
####################################### Dataset: monthly_website_sales (3) Plot Name: None # Plot Type: SCATTER_PLOT # Mapped Dimensions: X: Frequency of visits last week Y: Number of products viewed Z: Household Income (USD) Color: Amount spent (USD) Size: Time spent (seconds) Playback: Sources # Plot Settings: X Range Min: 0.0 Y Range Min: 0.0 Z Range Min: 35000.0 X Range Max: 69.0 Y Range Max: 50.0 Z Range Max: 450000.0 X Limit Min: 0.0 Y Limit Min: 0.0 Z Limit Min: 35000.0 X Limit Max: 69.0 Y Limit Max: 50.0 Z Limit Max: 450000.0 X Limit Link: False Y Limit Link: False Z Limit Link: False ColorType: COLOR_BIN Color Bin Count: 4 Color Bin Dist: EQUAL_BINS Color Inverted: False Color Palette Id: 0 X Normalization: None Y Normalization: None Z Normalization: None Size Normalization: None Trend Lines: HIDE Scatter Plot Point Mode: SHOW #######################################
sm_result = vip.local_history[0] # grabs the VipPlot object from the local history
sm_result.playback = data['Sources']
sm_result.size_scale = 2.0
vip.show(sm_result)
# Click on the 'Play' button in VIP to see how the 'Sources' feature effects sales.
Note: 'Trend Lines are not currently supported.'
Its pretty clear that the only 'Sources' value that had substantial sales was the 'App'. Good to know - this makes it easier for us to send notifications to the user about potential deals we can offer.
vip.insights() # opens the insights panel :)
Insight |
---|
Out of 848 records, 42% are <sprite index=69 color=#04D1FFFF> [0 : 124.83] vs. 24% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 16 and 42</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 30</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 62,582 and 450,000</b> |
Out of 652 records, 82% are <sprite index=69 color=#04D1FFFF> [0 : 124.83] vs. 21% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 0 and 1</b> <i><sprite index=112> Number of products viewed</i> is <b>between 0 and 31</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,017 and 56,989</b> |
Out of 599 records, 83% are <sprite index=69 color=#04D1FFFF> [0 : 124.83] vs. 22% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 2 and 53</b> <i><sprite index=112> Number of products viewed</i> is <b>between 0 and 1</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,048 and 450,000</b> |
Out of 591 records, 31% are <sprite index=69 color=#04D1FFFF> [0 : 124.83] vs. 25% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 16 and 41</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 9</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,003 and 62,353</b> |
Out of 566 records, 85% are <sprite index=69 color=#04D1FFFF> [0 : 124.83] vs. 22% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 0 and 1</b> <i><sprite index=112> Number of products viewed</i> is <b>between 0 and 41</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 57,548 and 450,000</b> |
Out of 1,252 records, 39% are <sprite index=69 color=#71FFAAFF> [125.0528 : 516.15256] vs. 23% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 12</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,016 and 62,939</b> |
Out of 841 records, 50% are <sprite index=69 color=#71FFAAFF> [125.0528 : 516.15256] vs. 23% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 2 and 3</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 49</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,007 and 450,000</b> |
Out of 618 records, 35% are <sprite index=69 color=#71FFAAFF> [125.0528 : 516.15256] vs. 24% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 16 and 43</b> <i><sprite index=112> Number of products viewed</i> is <b>between 10 and 16</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,216 and 62,408</b> |
Out of 598 records, 43% are <sprite index=69 color=#71FFAAFF> [125.0528 : 516.15256] vs. 24% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 16 and 68</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 9</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,031 and 62,211</b> |
Out of 2,610 records, 38% are <sprite index=69 color=#FFCA4DFF> [516.334 : 1,074.57] vs. 21% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 13 and 50</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,000 and 57,896</b> |
Out of 1,251 records, 27% are <sprite index=69 color=#FFCA4DFF> [516.334 : 1,074.57] vs. 25% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 12</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,005 and 62,859</b> |
Out of 1,167 records, 33% are <sprite index=69 color=#FFCA4DFF> [516.334 : 1,074.57] vs. 24% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 12</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 62,949 and 450,000</b> |
Out of 602 records, 29% are <sprite index=69 color=#FFCA4DFF> [516.334 : 1,074.57] vs. 25% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 16 and 69</b> <i><sprite index=112> Number of products viewed</i> is <b>between 17 and 50</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,017 and 62,449</b> |
Out of 561 records, 35% are <sprite index=69 color=#FFCA4DFF> [516.334 : 1,074.57] vs. 24% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 13 and 50</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 57,974 and 405,570</b> |
Out of 2,611 records, 42% are <sprite index=69 color=#DC2E3EFF> [1,074.822 : 10,452.75] vs. 20% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 13 and 50</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,003 and 57,914</b> |
Out of 1,166 records, 39% are <sprite index=69 color=#DC2E3EFF> [1,074.822 : 10,452.75] vs. 23% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 2 and 12</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 62,981 and 450,000</b> |
Out of 599 records, 28% are <sprite index=69 color=#DC2E3EFF> [1,074.822 : 10,452.75] vs. 25% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 16 and 68</b> <i><sprite index=112> Number of products viewed</i> is <b>between 17 and 50</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 35,020 and 62,065</b> |
Out of 554 records, 36% are <sprite index=69 color=#DC2E3EFF> [1,074.822 : 10,452.75] vs. 24% in the rest of the data. When: <i><sprite index=111> Frequency of visits last week</i> is <b>between 4 and 15</b> <i><sprite index=112> Number of products viewed</i> is <b>between 13 and 50</b> <i><sprite index=113> Household Income (USD)</i> is <b>between 58,096 and 350,799</b> |
We are specifically interested in finding users that spent a lot of money on the website. From the dropdown menu in VIP, lets click on the RED color to look at the top quartile of spenders.
# The first insight shows a region where users view a lot of products.
# Also, a lot of the datapoints have a lower value for household income.
# Click on 'Select' to filter down to this region of the data.
data['LowIncomeBrowsers'] = vip.get_visible_points()
# The second insight for the top quartile shows a region where users view fewer products
# on the site and have higher incomes compared to our first insight. Click on select
# to filter down to this region of the data.
data["ExpressShoppers"] = vip.get_visible_points()
Lets make an index column to track our known heavy spenders and check some simple statistics!
data['spenders'] = data['LowIncomeBrowsers'] + data['ExpressShoppers']
data[data['spenders'] == 0]["Amount spent (USD)"].describe()
count 7223.000000 mean 578.622430 std 826.127380 min 0.000000 25% 39.430404 50% 315.840000 75% 750.468379 max 10182.390837 Name: Amount spent (USD), dtype: float64
data[data['LowIncomeBrowsers'] == 1]["Amount spent (USD)"].describe()
count 2611.000000 mean 1267.006418 std 1115.234537 min 0.064000 25% 569.639000 50% 865.074000 75% 1661.876058 max 10452.754782 Name: Amount spent (USD), dtype: float64
data[data['ExpressShoppers'] == 1]["Amount spent (USD)"].describe()
count 1166.000000 mean 1162.452620 std 1096.830963 min 0.092536 25% 479.118500 50% 776.132000 75% 1543.196064 max 7682.371263 Name: Amount spent (USD), dtype: float64
It looks like our "low income browsers" and "express shoppers" have nearly double the 'Amount Spent' on our products. Lets see if clustering will give us more information about our user base.
data['user_clusters'] = vip.clustering(exclude=["User ID", "Amount spent (USD)"])
Note: 'Some features were not included because: String features cannot be used (Location, Hobby, Language, Gender, Married, Kids, Pets, Device used, Sources).'
The red and blue clusters are separated on 'Time spent' and 'Products Viewed' but if we look closely we can see that the green cluster is strongly correlated with size, which currently is mapped with 'Household Income'. This is a really useful insight in that these individuals have high enough income that it outweighs their behavior in 'Time Spent' and 'Products Viewed'! We should try to identify our richest customers and send them customized marketing materials.
# First lets add our index column for our spenders to VIP
vip.add_column(data["spenders"])
# Now lets highlight those spenders with the Shape dimension!
sm_result.shape = "spenders"
vip.show(sm_result)
Note: 'New column, spenders, successfully added to the current dataset.' Note: 'Trend Lines are not currently supported.'
We can use VIP's Anomaly Detection tool to identify users with extremely high household income. We may be able to target these users with a special deal.
data['IsRich'] = vip.anomaly_detection(
features=[data['Household Income (USD)']], plus_minus="+", stdev=3.5)
Through our analysis, we've identified 3 cohorts of users that are likely to spend money on the website.
Lets make some subsets of the data based on the 3 cohorts we've identified.
rich_people = data[data['IsRich'] == "1 - Outlier"]
express_shoppers = data[data['ExpressShoppers'] == 1]
browsers = data[data['LowIncomeBrowsers'] == 1]
We will not run the next cell here, but it is shown to give inspiration on how business decision can be taken from the insights identified in this analysis.
app_notification.private_client_offer(rich_people["User ID"])
app_notification.send_lightning_deal(express_shoppers['User ID'])
app_notification.send_package_deals(browsers["User ID"])