Skip to content

alesspaz/preprocessing-profiling-1

 
 

Repository files navigation

Preprocessing Profiling

Preprocessing Profiling is a tool for evaluating the results of different preprocessing techniques on Tabular Datasets. When a dataset with missing values is received, a machine learning algorithm will be tested for a series of possible imputation techniques. When the received dataset has no missing values, the missing values will be created at random (just for exercise purpose). Next, the results of the testing are displayed in an organized report with various visualizations, e.g., Nullity Matrix, Classification Report, Confusion Matrix, and other options explained further in this document.

Installation

Preprocessing Profiling can be installed by running pip install https://github.com/DAVINTLAB/preprocessing-profiling/archive/master.zip.

Usage

Preprocessing Profiling will return its report in the form of a page written in HTML like this one.

Getting started

The use of Jupyter Notebook is recommended as it can make the experience more interactive. The first step is to import the necessary libraries.

import pandas as pd
from preprocessing_profiling import ProfileReport

A pandas dataframe will serve as the dataset that will be used to generate the report. In this example, we are using the Iris dataset.

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", encoding='UTF-8')

In Jupyter Notebook, simply calling the report will display it.

ProfileReport(df)

Otherwise, a file can be written.

ProfileReport(df).to_file(outputfile = "./path/to/file.html")

Visualizations

Nullity matrix

This matrix from missingno is a way of visualizing the distribution of missing values. It is particularly useful in identifying patterns of the missing values in the data. Missing values are displayed in white and regular entries are displayed in black.

alt text

Classification report

The classification report will display the main machine learning classification metrics. The precision, recall, f1-score and support of each individual class can be seen.

alt text

Confusion matrix

The confusion matrix shows the predictions in a matrix. In this matrix, the rows represent the actual classes and the columns represent the predictions that were made. The main diagonal contains the correct predictions and is shown in blue. Every other prediction is shown in red. In the example below, we can see two instances where the class was virginica but the algorithm classified them as versicolor.

alt text

Error distribution

In this stacked bar chart, the distribution of the classes for each prediction can be seen. The actual classes are color coded and each stack represents one of the possible classes for prediction.

alt text

Flow of classes

This sankey diagram will show the flow between the actual classes (left) and the predicted classes (right). Correct predictions are displayed in blue and the incorrect ones are displayed in yellow.

alt text

Multiple strategy flow of classes

A variation of the flow of classes where, instead of a single strategy being covered, all the strategies are displayed side by side. This format favors comparisons between different strategies.

alt text

Matrix of nullity + class prediction error

A variation of the nullity matrix where the prediction errors are color-coded. This visualiztion facilitates the process of identification of correlations between the missing values and the prediction errors.

alt text

Other functionalities

It is possible to choose the model that will be tested.

ProfileReport(df, model = "MLPClassifier")

A scikit-learn classifier will be accepted too.

from sklearn.svm import SVC
svc = SVC(gamma = 'auto')
ProfileReport(df, model = svc)

Performance

Overview of the Results

alt text alt text
alt text alt text

Please find more details about the tests perfomed on here.

Documentation

The documentation can be found here.

Dependencies

Python 3 is required in order to run Preprocessing Profiling. Also, the following Python libraries are used:

Library Version
pandas 0.23.4
numpy 1.15.4
matplotlib 3.0.2
jinja 2.10.0
sklearn 0.21.2

Internet access is necessary to load the JavaScript libraries. The following JavaScript libraries are used:

Library Version
d3 5.9.7
d3 array 1.2.4
d3 path 1.0.7
d3 shape 1.3.5
d3 sankey 0.12.1
jquery 3.4.1
bootstrap 3.3.6

Backlog

It is the first version of this tool. However, we have already identified different items to be considered as part of the backlog for future releases. These items are summarized in the list below.

  • A new feature to allow the user to upload their dataset and then show all the fields with correspondent types and the information if there are missing values or not for each column. Next, the tool should allow the user to select the imputation strategy is desired for each field.
  • Usability enhancements are also planned to increase the interactivity of the visualizations, particularly for "Flow of classes" and "Matrix of nullity + class prediction error" visualizations.
  • Issues with downloading/saving the report in Google Chrome and the Jupyter Notebook versions.
  • A new horizontal menu fixed on the top of the report.

Citation

Please refer to this work by citing the dissertation indicate below. Milani, A. M. P.. Preprocessing profiling model for visual analytics. http://tede2.pucrs.br/tede2/handle/tede/9007. 2019.

About the Authors

We are members of the Data Visualization and Interaction Lab (DaVInt) at PUCRS:

  • Isabel H. Manssour -- Professor Coordinator of DaVInt -- 2017-current.
  • Alessandra M. P. Milani -- Master Student in Computer Science -- 2017-2019.
  • Lucas A. Loges -- Undergraduate Student in Computer Science -- 2019-current.

More information can be found here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 97.8%
  • Python 2.2%