Summer-Project-Advanced-and-Interpretable-Unsupervised-Learning

Background

The ILS algorithm was created based on the general definition of a cluster and the quality of a clustering result. ILS offers users a way to optimize hyper-parameters, such as number of clusters. Although using ILS to perform clustering was not the intended purpose, its clustering performance is considerably more successful compare to some popular mehtods. ILS is an ideal method for guiding feature selection and use in maaterials informatics.

Why ILS? ILS does not require the input of the number of clusters or a threshold separation of any points from the outset. ILS is consistently able to identify the correct number, size, and type of clusters, regardless of the complexity of the distribution of points(including the null case and chain problem)

Python Code Example

    from ILS_class import ILS
    import numpy as np
    from sklearn.datasets import * 
    X, y = make_blobs(n_samples = 500, centers= 4, n_features=2,random_state=185)   # create dataset
    ils = ILS(n_clusters=4, min_cluster_size = 50)    # initialise ILS object
    ils.fit(X)    # run ILS algorithm
    ils.plot_labels()    # plots the fitted dataset X, different colors indicates different clusters (only applicable on datasets with 2 features)
    ils.coloured_rmin()    # plots the RMin plot 
    print(ils.labels)    # cluster labels of each sample point

How does ILS for clustering work

Step 1: initialization Initialize one labeled point and applying ILS to obtain the ordered minimum distance Rmin(i) plot
Step 2: cluster extraction The number of clusters can then be automatically extracted by identifying the peaks (due to density drops between clusters) to divide the plot into n regions
Step 3: interative relabling One point relabeled in each region (preferably at the minima) to run ILS again to obtain a fully labeled data set with n clusters defined

Parameter Selection

min_cluster_size

This value should be a underestimate of the minimum cluster size. Excessively small values will lead to poor performance

n_clusters (default is not required)

Optional parameter that specifies the number of cluster to be found. There is no guarentee that the algorithm will find them. If the user is certain of the number of clusters they should specify the number and also lower the significance parameter mentioned below

significance (default = 2.56)

The significance is the number of standard deviations a potential segmentation point should exceed the mean of its surroundings.

Manual Segmentation

If the user wants to specify segmentation manually then they can use the manual segmentation function.

View the rmin plot to identify regions to segment

    ils = ILS().initial_spread(X)
    ils.plot_rmin()

Once the user has seen the plot he passes in the indexs (x1, x2, x3, ...) this will split the desired segments

    ils.manual_segmentation([x1, x2, x3])
    ils.plot_labels()

To view examples of this see Testing/ILS_tests_plots.ipynb notebook

Semi-supervised Learning, Label Spreading

If the user already has some labelled points then they can perform the spreading once with ILS.label_sprd_semi_sup(labelled_points, unlabelled_points).

An example is shown in Testing/ILS_tests_plots.ipynb

Changing Parameters

If the user is not satisfied with the clustering they can change the parameters they initially passed and perform the clustering again without having to perform the initial spreading step

Clustering Performance/Trouble Shooting

Identifying quality of clustering.

Firstly given a suggested clustering the user may have, it is better to use the ILS_Evaluation Class to check the quality of a given clustering. See ILS_Evaluation Section.

The performance of the clustering can be identified from the colouring of distance plot where the colour corresponds to which cluster it belongs to. A good clustering result will see a small amount of colour mixing.

Good clustering

Poor Clustering

Large differences in cluster size

When the cluster size of a small cluster is significantly smaller than another cluster, approximately one tenth the size or less, the segmentation method may detect multiple clusters within large clusters.

If you believe the cluster are well seperated increase the significance level to around 3.4, ILS_object = ILS(significance = 3.4)

Or specify your desired segmentation, this is shown below.

Low density cluster connected to high density cluster

In these cases the segmentation method may have correctly segmented the distance plot but the spreading has not performed well. (Need to add another git repository)

Note: The most subjective step is separating the peaks from the noise; add current peak finding method over here

Tradeoffs

The current weakness of ILS is the scaling with number of points (as opposed to number of dimensions). Since the ILS algorithm runs the iterative label spreading method twice (first run to generate labels and second run to check labeling results from the first run), the size of the dataset would affect the scaling of the algorithm.

Testing and evaluation

Unit tests are implemented in this project for testing. It is to achieve a readable, maintainable, and trustworthy test set to evaluate the clustering result of ILS. The testing datasets covers low and high dimensional datasets, such as blobs, circles and moons. Also, it includes more complex artificial and real-world data sets which are higher dimensional. For visualization, Bokeh and Matplotlib are used for result evaluation for users.

After clustering, the result can be evaluated from re-coloured datasets and re-coloured Rmin plots. It is plotted by calling either coloured_rmin and .plot_labels or .rainbow_rmin.

Functions: .coloured_rmin and .plot_labels supports users to evaluate results from dataset and Rmin coloured depending on different clusters.

Function: .rainbow_rmin returns coloured clustering results and Rmin plots coloured by labelling sequences. In both plots, colours ranges with the labelling sequence from red to purple, which is helpful for users who are interested in what sequence each point of the dataset is plotted. In this way, users are supported to compare and evaluate the clustering results in a deeper way, in order to find most appropriate clustering algorithm。

However, there are also some limitations on the current plotting methods with Bokeh. For high dimensional datasets, for example three-dimensional datasets, there are limitations on the current plotting methods on representing the scatters in a more tridimensional way. It is expected in the future work that working out other plotting method than Bokeh to provide users a more intuitive to evaluate the clustering result.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Approximated Voronoi Diagram		Approximated Voronoi Diagram
Bowyer-Watson		Bowyer-Watson
ReadMe_Images		ReadMe_Images
Testing		Testing
__pycache__		__pycache__
ILS_Evaluation.py		ILS_Evaluation.py
ILS_Project.ipynb		ILS_Project.ipynb
ILS_Summary.md		ILS_Summary.md
ILS_Tesing_Sample.ipynb		ILS_Tesing_Sample.ipynb
ILS_class.py		ILS_class.py
README.md		README.md
Segmentation_Algorithm.pdf		Segmentation_Algorithm.pdf
Untitled.ipynb		Untitled.ipynb
WSPD.ipynb		WSPD.ipynb
peak_finding.ipynb		peak_finding.ipynb
test.ann		test.ann

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Summer-Project-Advanced-and-Interpretable-Unsupervised-Learning

Background

Python Code Example

How does ILS for clustering work

Parameter Selection

Manual Segmentation

Semi-supervised Learning, Label Spreading

Changing Parameters

Clustering Performance/Trouble Shooting

Tradeoffs

Testing and evaluation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

kayla-lixinyi/Summer-Project-Advanced-and-Interpretable-Unsupervised-Learning

Folders and files

Latest commit

History

Repository files navigation

Summer-Project-Advanced-and-Interpretable-Unsupervised-Learning

Background

Python Code Example

How does ILS for clustering work

Parameter Selection

Manual Segmentation

Semi-supervised Learning, Label Spreading

Changing Parameters

Clustering Performance/Trouble Shooting

Tradeoffs

Testing and evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages