Skip to content

Commit 889966d

Browse files
x006amueller
authored andcommitted
Updates for DBSCAN clsutering docs
Please enter the commit message for your changes. Lines starting
1 parent 38868a6 commit 889966d

File tree

1 file changed

+45
-11
lines changed

1 file changed

+45
-11
lines changed

doc/modules/clustering.rst

Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -428,17 +428,51 @@ enable only merging of neighboring pixels on an image, as in the
428428
DBSCAN
429429
======
430430

431-
The :class:`DBSCAN` algorithm clusters data by finding core points which have
432-
many neighbours within a given radius. After a core point is found, the cluster
433-
is expanded by adding its neighbours to the current cluster and recursively
434-
checking if any are core points. Formally, a point is considered a core point
435-
if it has more than min_points points which are of a similarity greater than
436-
the given threshold eps. This is shown in the figure below, where the color
437-
indicates cluster membership and large circles indicate core points found by
438-
the algorithm. Moreover, the algorithm can detect outliers, indicated by black
439-
points below. The outliers are defined as points which do not belong to
440-
any current cluster and do not have enough close neighbours to start a
441-
new cluster.
431+
The :class:`DBSCAN` algorithm views clusters as areas of high density
432+
separated by areas of low density. Due to this rather generic view, clusters
433+
found by DBSCAN can be any shape, as opposed to k-means which assumes that
434+
clusters are convex shaped. The central component to the DBSCAN is the concept
435+
of *core samples*, which are samples that are in areas of high density. A
436+
cluster is therefore a set of core samples, each highly similar to each other
437+
and a set of non-core samples that are similar to a core sample (but are not
438+
themselves core samples). There are two parameters to the algorithm,
439+
`min_points` and `eps`, which define formally what we mean when we say *dense*.
440+
A higher `min_points` or lower `eps` indicate higher density necessary to form
441+
a cluster.
442+
443+
More formally, we define a core sample as being a sample in the dataset such
444+
that there exists `min_samples` other samples with a similarity higher than
445+
`eps` to it, which are defined as *neighbors* of the core sample. This tells
446+
us that the core sample is in a dense area of the vector space. A cluster
447+
is a set of core samples, that can be built by recursively by taking a core
448+
sample, finding all of its neighbors that are core samples, finding all of
449+
*their* neighbors that are core samples, and so on. A cluster also has a
450+
set of non-core samples, which are samples that are neighbors of a core sample
451+
in the cluster but are not themselves core samples. Intuitively, these samples
452+
are on the fringes of a cluster.
453+
454+
Any core sample is part of a cluster, by definition. Further, any cluster must
455+
have at least `min_samples` points in it, following the definition of a core
456+
sample. For any sample that is not a core sample, and does not have a
457+
similarity higher than `eps` to a core sample, it is considered an outlier by
458+
the algorithm.
459+
460+
The algorithm is non-deterministic, however the core samples themselves will
461+
always belong to the same clusters (although the labels themselves may be
462+
different). The non-determinism comes from deciding on which cluster a
463+
non-core sample belongs to. A non-core sample can be have a similarity higher
464+
than `eps` to two core samples in different classes. Following from the
465+
triangular inequality, those two core samples would be less similar than
466+
`eps` from each other -- else they would be in the same class. The non-core
467+
sample is simply assigned to which ever cluster is generated first, where
468+
the order is determined randomly within the code. Other than *that*, the
469+
algorithm is deterministic, making the results relatively stable between
470+
iterations on the same data.
471+
472+
In the figure below, the color indicates cluster membership, with large circles
473+
indicating core samples found by the algorithm. Smaller circles are non-core
474+
samples that are still part of a cluster. Moreover, the outliers are indicated
475+
by black points below.
442476

443477
.. |dbscan_results| image:: ../auto_examples/cluster/images/plot_dbscan_1.png
444478
:target: ../auto_examples/cluster/plot_dbscan.html

0 commit comments

Comments
 (0)