@@ -428,17 +428,51 @@ enable only merging of neighboring pixels on an image, as in the
428428DBSCAN
429429======
430430
431- The :class: `DBSCAN ` algorithm clusters data by finding core points which have
432- many neighbours within a given radius. After a core point is found, the cluster
433- is expanded by adding its neighbours to the current cluster and recursively
434- checking if any are core points. Formally, a point is considered a core point
435- if it has more than min_points points which are of a similarity greater than
436- the given threshold eps. This is shown in the figure below, where the color
437- indicates cluster membership and large circles indicate core points found by
438- the algorithm. Moreover, the algorithm can detect outliers, indicated by black
439- points below. The outliers are defined as points which do not belong to
440- any current cluster and do not have enough close neighbours to start a
441- new cluster.
431+ The :class: `DBSCAN ` algorithm views clusters as areas of high density
432+ separated by areas of low density. Due to this rather generic view, clusters
433+ found by DBSCAN can be any shape, as opposed to k-means which assumes that
434+ clusters are convex shaped. The central component to the DBSCAN is the concept
435+ of *core samples *, which are samples that are in areas of high density. A
436+ cluster is therefore a set of core samples, each highly similar to each other
437+ and a set of non-core samples that are similar to a core sample (but are not
438+ themselves core samples). There are two parameters to the algorithm,
439+ `min_points ` and `eps `, which define formally what we mean when we say *dense *.
440+ A higher `min_points ` or lower `eps ` indicate higher density necessary to form
441+ a cluster.
442+
443+ More formally, we define a core sample as being a sample in the dataset such
444+ that there exists `min_samples ` other samples with a similarity higher than
445+ `eps ` to it, which are defined as *neighbors * of the core sample. This tells
446+ us that the core sample is in a dense area of the vector space. A cluster
447+ is a set of core samples, that can be built by recursively by taking a core
448+ sample, finding all of its neighbors that are core samples, finding all of
449+ *their * neighbors that are core samples, and so on. A cluster also has a
450+ set of non-core samples, which are samples that are neighbors of a core sample
451+ in the cluster but are not themselves core samples. Intuitively, these samples
452+ are on the fringes of a cluster.
453+
454+ Any core sample is part of a cluster, by definition. Further, any cluster must
455+ have at least `min_samples ` points in it, following the definition of a core
456+ sample. For any sample that is not a core sample, and does not have a
457+ similarity higher than `eps ` to a core sample, it is considered an outlier by
458+ the algorithm.
459+
460+ The algorithm is non-deterministic, however the core samples themselves will
461+ always belong to the same clusters (although the labels themselves may be
462+ different). The non-determinism comes from deciding on which cluster a
463+ non-core sample belongs to. A non-core sample can be have a similarity higher
464+ than `eps ` to two core samples in different classes. Following from the
465+ triangular inequality, those two core samples would be less similar than
466+ `eps ` from each other -- else they would be in the same class. The non-core
467+ sample is simply assigned to which ever cluster is generated first, where
468+ the order is determined randomly within the code. Other than *that *, the
469+ algorithm is deterministic, making the results relatively stable between
470+ iterations on the same data.
471+
472+ In the figure below, the color indicates cluster membership, with large circles
473+ indicating core samples found by the algorithm. Smaller circles are non-core
474+ samples that are still part of a cluster. Moreover, the outliers are indicated
475+ by black points below.
442476
443477.. |dbscan_results | image :: ../auto_examples/cluster/images/plot_dbscan_1.png
444478 :target: ../auto_examples/cluster/plot_dbscan.html
0 commit comments