Updates for DBSCAN clsutering docs

x006 · amueller · commit 889966d4b180 · 2012-12-24T09:28:52.000+01:00
Please enter the commit message for your changes. Lines starting
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
@@ -428,17 +428,51 @@ enable only merging of neighboring pixels on an image, as in the
 DBSCAN
 ======
 
-The :class:`DBSCAN` algorithm clusters data by finding core points which have
-many neighbours within a given radius. After a core point is found, the cluster
-is expanded by adding its neighbours to the current cluster and recursively
-checking if any are core points. Formally, a point is considered a core point
-if it has more than min_points points which are of a similarity greater than
-the given threshold eps. This is shown in the figure below, where the color
-indicates cluster membership and large circles indicate core points found by
-the algorithm. Moreover, the algorithm can detect outliers, indicated by black
-points below. The outliers are defined as points which do not belong to
-any current cluster and do not have enough close neighbours to start a
-new cluster.
+The :class:`DBSCAN` algorithm views clusters as areas of high density
+separated by areas of low density. Due to this rather generic view, clusters
+found by DBSCAN can be any shape, as opposed to k-means which assumes that
+clusters are convex shaped. The central component to the DBSCAN is the concept
+of *core samples*, which are samples that are in areas of high density. A
+cluster is therefore a set of core samples, each highly similar to each other
+and a set of non-core samples that are similar to a core sample (but are not
+themselves core samples). There are two parameters to the algorithm,
+`min_points` and `eps`, which define formally what we mean when we say *dense*.
+A higher `min_points` or lower `eps` indicate higher density necessary to form
+a cluster.
+
+More formally, we define a core sample as being a sample in the dataset such
+that there exists `min_samples` other samples with a similarity higher than
+`eps` to it, which are defined as *neighbors* of the core sample. This tells
+us that the core sample is in a dense area of the vector space. A cluster
+is a set of core samples, that can be built by recursively by taking a core
+sample, finding all of its neighbors that are core samples, finding all of
+*their* neighbors that are core samples, and so on. A cluster also has a
+set of non-core samples, which are samples that are neighbors of a core sample
+in the cluster but are not themselves core samples. Intuitively, these samples
+are on the fringes of a cluster.
+
+Any core sample is part of a cluster, by definition. Further, any cluster must
+have at least `min_samples` points in it, following the definition of a core
+sample. For any sample that is not a core sample, and does not have a
+similarity higher than `eps` to a core sample, it is considered an outlier by
+the algorithm.
+
+The algorithm is non-deterministic, however the core samples themselves will
+always belong to the same clusters (although the labels themselves may be
+different). The non-determinism comes from deciding on which cluster a
+non-core sample belongs to. A non-core sample can be have a similarity higher
+than `eps` to two core samples in different classes. Following from the
+triangular inequality, those two core samples would be less similar than
+`eps` from each other -- else they would be in the same class. The non-core
+sample is simply assigned to which ever cluster is generated first, where
+the order is determined randomly within the code. Other than *that*, the
+algorithm is deterministic, making the results relatively stable between
+iterations on the same data.
+
+In the figure below, the color indicates cluster membership, with large circles
+indicating core samples found by the algorithm. Smaller circles are non-core 
+samples that are still part of a cluster. Moreover, the outliers are indicated
+by black points below.
 
 .. |dbscan_results| image:: ../auto_examples/cluster/images/plot_dbscan_1.png
         :target: ../auto_examples/cluster/plot_dbscan.html