Model Visualization and Explainability

Model explainability remains a hurdle towards widespread adoption and understanding of machine learning. In this notebook, we will train and visualize a neural net that predicts credit card defaults based on credit usage and payment history, plus some demographic information. The goal is to explore how we can use VIP to visualize the output of complex machine learning models, and to then explain the results in terms of input features.

In the final portion of the notebook, we also visualize the results of a gridsearch optimization of hyperparameters to determine what combinations of these hyperparameters optimize a gradient boosting machine (GBM).

In [1]:
import pandas as pd
import numpy as np
In [2]:
# Import API
from virtualitics import api
vip=api.VIP()
Setting up WebSocket connection to: ws://localhost:12345/api
Connection Successful! Initializing session.

Import Data and Preprocess

Data from UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. All amounts are given in Taiwanese dollars.

In [3]:
# Import data from local 
path_to_data = '../data/default of credit card clients.csv'
df = pd.read_csv(path_to_data)

# Uncomment next two lines to import data from UCI machine learning repository instead
# link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
# df = pd.read_excel(link, header=1)

df.drop('ID', axis=1, inplace=True)

# Rename categorical class values with descriptions
sex = {1:'Male', 2:'Female'}
education = {1:'graduate', 2:'university', 3:'high school', 4:'other', 5:'other', 6:'other', 0:'other'}
marriage = {1:'married', 2:'single', 3:'other', 0:'other'}

df['SEX'] = [sex[x] for x in df['SEX']]
df['EDUCATION'] = [education[x] for x in df['EDUCATION']]
df['MARRIAGE'] = [marriage[x] for x in df['MARRIAGE']]

# Define target and categoricals
target = 'default payment next month'
features = list(df.columns)[:-1]
categoricals = ['SEX', 'EDUCATION', 'MARRIAGE']

df.head()
Out[3]:
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 20000 Female university married 24 2 2 -1 -1 -2 ... 0 0 0 0 689 0 0 0 0 1
1 120000 Female university single 26 -1 2 0 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 90000 Female university single 34 0 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 50000 Female university married 37 0 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 50000 Male university married 57 -1 0 -1 0 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 24 columns

See Kaggle for further explanation of the dataset: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/

Load and Visualize Data in VIP

In [4]:
vip.load_data(df, 'credit_card_defaults')
In [5]:
vip.smart_mapping(target, features)
vip.normalize()
SmartMapping Rank Feature Correlated Group
0 1 PAY_2 None
1 2 PAY_0 None
2 3 PAY_3 None
3 4 PAY_5 None
4 5 PAY_4 None