Preprocessing Profiling

Preprocessing Profiling is a tool for evaluating the results of different preprocessing techniques on Tabular Datasets. When a dataset with missing values is received, a machine learning algorithm will be tested for a series of possible imputation techniques. When the received dataset has no missing values, the missing values will be created at random (just for exercise purpose). Next, the results of the testing are displayed in an organized report with various visualizations, e.g., Nullity Matrix, Classification Report, Confusion Matrix, and other options explained further in this document.

Installation
Usage
Performance
Documentation
Dependencies
Backlog
Citation
About the Authors

Installation

Preprocessing Profiling can be installed by running pip install https://github.com/DAVINTLAB/preprocessing-profiling/archive/master.zip.

Usage

Preprocessing Profiling will return its report in the form of a page written in HTML like this one.

Getting started

The use of Jupyter Notebook is recommended as it can make the experience more interactive. The first step is to import the necessary libraries.

import pandas as pd
from preprocessing_profiling import ProfileReport

A pandas dataframe will serve as the dataset that will be used to generate the report. In this example, we are using the Iris dataset.

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", encoding='UTF-8')

In Jupyter Notebook, simply calling the report will display it.

ProfileReport(df)

Otherwise, a file can be written.

ProfileReport(df).to_file(outputfile = "./path/to/file.html")

Visualizations

Nullity matrix

This matrix from missingno is a way of visualizing the distribution of missing values. It is particularly useful in identifying patterns of the missing values in the data. Missing values are displayed in white and regular entries are displayed in black.

Classification report

The classification report will display the main machine learning classification metrics. The precision, recall, f1-score and support of each individual class can be seen.

Confusion matrix

The confusion matrix shows the predictions in a matrix. In this matrix, the rows represent the actual classes and the columns represent the predictions that were made. The main diagonal contains the correct predictions and is shown in blue. Every other prediction is shown in red. In the example below, we can see two instances where the class was virginica but the algorithm classified them as versicolor.

Error distribution

In this stacked bar chart, the distribution of the classes for each prediction can be seen. The actual classes are color coded and each stack represents one of the possible classes for prediction.

Flow of classes

This sankey diagram will show the flow between the actual classes (left) and the predicted classes (right). Correct predictions are displayed in blue and the incorrect ones are displayed in yellow.

Multiple strategy flow of classes

A variation of the flow of classes where, instead of a single strategy being covered, all the strategies are displayed side by side. This format favors comparisons between different strategies.

Matrix of nullity + class prediction error

A variation of the nullity matrix where the prediction errors are color-coded. This visualiztion facilitates the process of identification of correlations between the missing values and the prediction errors.

Other functionalities

It is possible to choose the model that will be tested.

ProfileReport(df, model = "MLPClassifier")

A scikit-learn classifier will be accepted too.

from sklearn.svm import SVC
svc = SVC(gamma = 'auto')
ProfileReport(df, model = svc)

Performance

Overview of the Results

Please find more details about the tests perfomed on here.

Documentation

The documentation can be found here.

Dependencies

Python 3 is required in order to run Preprocessing Profiling. Also, the following Python libraries are used:

Library	Version
pandas	0.23.4
numpy	1.15.4
matplotlib	3.0.2
jinja	2.10.0
sklearn	0.21.2

Internet access is necessary to load the JavaScript libraries. The following JavaScript libraries are used:

Library	Version
d3	5.9.7
d3 array	1.2.4
d3 path	1.0.7
d3 shape	1.3.5
d3 sankey	0.12.1
jquery	3.4.1
bootstrap	3.3.6

Backlog

It is the first version of this tool. However, we have already identified different items to be considered as part of the backlog for future releases. These items are summarized in the list below.

A new feature to allow the user to upload their dataset and then show all the fields with correspondent types and the information if there are missing values or not for each column. Next, the tool should allow the user to select the imputation strategy is desired for each field.
Usability enhancements are also planned to increase the interactivity of the visualizations, particularly for "Flow of classes" and "Matrix of nullity + class prediction error" visualizations.
Issues with downloading/saving the report in Google Chrome and the Jupyter Notebook versions.
A new horizontal menu fixed on the top of the report.

Citation

Please refer to this work by citing the dissertation indicate below. Milani, A. M. P.. Preprocessing profiling model for visual analytics. http://tede2.pucrs.br/tede2/handle/tede/9007. 2019.

About the Authors

We are members of the Data Visualization and Interaction Lab (DaVInt) at PUCRS:

Isabel H. Manssour -- Professor Coordinator of DaVInt -- 2017-current.
Alessandra M. P. Milani -- Master Student in Computer Science -- 2017-2019.
Lucas A. Loges -- Undergraduate Student in Computer Science -- 2019-current.

More information can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
preprocessing_profiling		preprocessing_profiling
Documentation.md		Documentation.md
MANIFEST.in		MANIFEST.in
README.md		README.md
example_report.html		example_report.html
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Preprocessing Profiling

Installation

Usage

Getting started

Visualizations

Nullity matrix

Classification report

Confusion matrix

Error distribution

Flow of classes

Multiple strategy flow of classes

Matrix of nullity + class prediction error

Other functionalities

Performance

Overview of the Results

Documentation

Dependencies

Backlog

Citation

About the Authors

About

Uh oh!

Releases

Packages

Languages

alesspaz/preprocessing-profiling-1

Folders and files

Latest commit

History

Repository files navigation

Preprocessing Profiling

Installation

Usage

Getting started

Visualizations

Nullity matrix

Classification report

Confusion matrix

Error distribution

Flow of classes

Multiple strategy flow of classes

Matrix of nullity + class prediction error

Other functionalities

Performance

Overview of the Results

Documentation

Dependencies

Backlog

Citation

About the Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages