The Case Study and the Knime Dashboard

The Case Study is based on a hypothetical bank credit risk data set.
The data set contains 26 variables collected on 10,000 observations:

13 numerical
13 categorical, including the target variable “Esito Negativo”.

The goal of the Case Study is to analyze the data and to predict the Target Variable.

Click here to see the list of the variables of this Case Study, their possible codes and their English Translation.

This Case Study is developed with Knime Analytics Platform. Knime Analytics Platform is an open source software that builds analytical models, develops ETL processes and reports.

The aim of the Knime application, the dashboard, is to visualize interactively the data and the results of the analysis of the data set focused on the study of bank credit risk.

The analysis ends with a comparison among four predictive models, where Gradient Boosting gets the best accuracy in competition with Random Forest and Decisione Tree models.

In particular, this demo uses Knime to generate a WEB App with dynamic pages that follow different Knime work flows and show different charts depending on the selected parameters. Furthermore, it integrate models and code written with Python and R language.

Basically, the Knime app is composed by five pages:

TABLE VIEW: The page is composed by the following tabs:
TABLE VIEW SELECTION, it allows you to select the data set to be visualized between training and test.
COLUMN SELECTION TYPE, it allows you to select the columns to be visualized among all the availables.
DATASET VIEW, it shows you the values of the columns of the selected data set.

QUALITATIVE VARIABLES The page is composed by the following tabs:
CORRELATION TABLE BETWEEN QUALITATIVE VARIABLES
PIE CHART AND FREQUENCY TABLE
CONDITIONAL BAR CHART AND FREQUENCY TABLE

QUANTITATIVE VARIABLES The page is composed by the following tabs:
CORRELATION INDEX AND SCATTER PLOT
SUMMARY STATISTICS
HISTOGRAM AND CONDITIONAL BOX-PLOT

PREDICTIVE MODEL The page is composed by the following tabs:
MODEL SELECTION
CONFUSION MATRIX
SUMMARY STATISTICS
HISTOGRAM AND CONDITIONAL BOX-PLOT

MODEL COMPARISON The page is composed by the following tabs:
MODEL FITTING INDEXES
ROC CURVE COMPARISON

Among the models tested to predict the default indicators there are the following:

LOG = Logistic Regression Model
DT = Decision Tree Model
RF = Random Forest Model
GB = Gradient Boosting Model

The app is composed by five pages here in details.

First tab: Table view

The first page allows the visualization of the values of each column of the data set.

It is possible to choose the variables of interest, the number of observations displayed on the screen and sort the columns into multiple levels. It is also possible to download the data in .csv format.

TABLE VIEW: The page is composed by the following tabs:

TABLE VIEW SELECTION, it allows you to select the data set to be visualized between training and test.

COLUMN SELECTION TYPE, it allows you to select the columns to be visualized among all the availables.

DATASET VIEW, it shows you the values of the columns of the selected data set and it allows you to page and sort the data.

Click on the video below to see the dashboard in action for the visualization of the values of each column of a selected data set and to see a quick view of the component data flow that generates and controls the dashboard.

Click on each picture to enlarge it, or on each video to see it.

Second tab: Qualitative variables

The second page allows the exploration of the distribution and correlations measures between categorical variables of the data set.

First you have to select a couple of variables to be analyzed. You can also download the produced correlation table in CSV format. Second, you can select one of the categorical variables listed under the first chart to produce a pie chart and a frequency distribution.

QUALITATIVE VARIABLES: The page is composed by the following tabs:

CORRELATION TABLE BETWEEN QUALITATIVE VARIABLES, it allows you to calculate the correlation between couples of variables of the training data set.

COLUMN SELECTION, it allows you to select the column to be visualized among all the availables.

PIE CHART AND FREQUENCY TABLE, it allows you to visualize the distribution of the qualitative variable with a pie chart and a frequency table.

CONDITIONAL BAR CHART AND FREQUENCY TABLE, it shows you the values of the columns of the selected data set and it allows you to page and sort the data.
it allows you to visualize the distribution of the target variable "esineg" by the qualitative variable values with a pie chart and a contingency table.

Click on the video below to see the dashboard in action for the visualization of the frequency tables and charts of the qualitative variables and to have a quick view of the component data flow that generates and controls this dashboard.

Click on each picture to enlarge it, or on each video to see it.

Third tab: Quantitative variables

The third page allows the exploration of the distribution and correlations measures between numerical variables of the data set.

Charts and statistical tables are based in the first 2 variables selection as AXES and a single variable to analyse its distribution respect the response variable "esineg". You can also download the produced correlation table in CSV format.

QUANTITATIVE VARIABLES: The page is composed by the following tabs:

AXES SELECTION, it allows you to select the two variables to be used as X and Y axes in the following graphics.

CORRELATION INDEX AND SCATTER PLOT, to see the correlation index (Pearson) and a scatter plot of a the two selected numeric variable of the training data set.

COLUMN SELECTION, it allows you to select the column to be visualized among all the availables.

SUMMARY STATISTICS, it allows you to visualize basic statistics of the selected quantitative variable such as the positional indexes mean, median... and the variability indexes std. dev., variance... plus others stats counts.

HISTOGRAM AND CONDITIONAL BOX-PLOT, it shows you two charts: the former with a distribution into several bin of the numerical values the latter with a conditional box-plot splitting the values of the columns of the selected data set into two box-plots one for each possible result of the target variable "esineg".

Click on the video below to see the dashboard in action for the visualization of the stats, correlation indexes, and the charts of the quantitative variables respect the target variable "esineg", and to have a quick view of the component data flow that generates and controls this dashboard.

Click on each picture to enlarge it, or on each video to see it.

Fourth tab: Predictive models

The fourth page allows the analysis of the model results.

To analyse the results and performance of a model you just have to select it from a list box selector.

PREDICTIVE MODELS: The page is composed by the following tabs:

MODEL SELECTION, it allows you to select the model of which we want to analyse its predictive performance.

CONFUSION MATRIX, to see the confusion matrix, which represents the matrix of the instances in a actual class cross with the instances in a predictive class and the concordant and discordant percentages of the target variable values. It is computed with the test data set.

SUMMARY STATISTICS, it allows you to visualize basic statistics of the selected quantitative variable such as the positional indexes mean, median... and the variability indexes std. dev., variance... plus others stats counts.

HISTOGRAM AND CONDITIONAL BOX-PLOT, it shows you two charts: the former with a distribution into several bin of the numerical values the latter with a conditional box-plot splitting the values of the columns of the selected data set into two box-plots one for each possible result of the target variable "esineg".

Click on each picture to enlarge it, or on each video to see it.

Fifth tab: Model comparison

The fifth page allows the comparison between all models produced by the dashboard including the naive model.

Single model indexes and ROC Curve are the main methods to proceed with the comparison activity.

MODEL COMPARISON: The page is composed by the following tabs:

MODEL FITTING INDEXES, the model fitting indexes were combined for all the predictive models.

ROC CURVE COMPARISON, all the ROC Curves of the four models are overlapped into a single chart. Based on our results, the gradient boosting model with a learning rate equal to 0.15 provides better quality predictive performance

Click on the video below to see the comparison of all the four models.

Click on each picture to enlarge it, or on each video to see it.