Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
Open in app
What is DBSCAN
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a commonly
used unsupervised clustering algorithm proposed in 1996. Unlike the most well known
K-mean, DBSCAN does not need to specify the number of clusters. It can automatically
detect the number of clusters based on your input data and parameters. More
importantly, DBSCAN can find arbitrary shape clusters that k-means are not able to find.
For example, a cluster surrounded by a different cluster.
Also, DBSCAN can handle noise and outliers. All the outliers will be identified and
Open in app
marked without been classified into any cluster. Therefore, DBSCAN can also be used for
Anomaly Detection (Outlier Detection)
Before we take a look at the preusdecode, we need to first understand some basic
concepts and terms. Eps, Minpits, Directly density-reachable, density-reachable, density-
connected, core point and border point
First of all, there are two parameters we need to set for DBSCAN, Eps, and MinPts.
Density-reachable example
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 2/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
Open in app
Density-connected example
Finally, a point is a core point if it has more than a specified number of points (MinPts)
within Eps. These are points that are at the interior of a cluster A. And a border point has
fewer than MinPts within Eps, but is in the neighborhood of a core point. We can also
define the outlier(noise) point, which is the points that are neither core nor border
points.
Now, let’s take a look at how DBSCAN algorithm actually works. Here is the
preusdecode.
4. If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database
5. Continue the process until all of the points have been processed
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 3/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
Example
Consider the following 9 two-dimensional data points:
Use the Euclidean Distance with Eps =1 and MinPts = 3. Find all core points, border
points and noise points, and show the final clusters using DBCSAN algorithm. Let’s show
the result step by step.
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 4/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
N(x9) = {x9}
If the size of N(p) is at least MinPts, then p is said to be a core point. Here the given
MinPts is 3, thus the size of N(p) is at least 3. Thus core points are:{x1, x2, x3, x5, x7,
x8}
Then according to the definition of border points: given a point p, p is said to be a border
point if it is not a core point but N(p) contains at least one core point. N(x4) = {x4, x8},
N(x6) = {x6, x5}. here x8 and x5 are core points, So both x4 and x6 are border points.
Obviously, the point left, x9 is a noise point.
3. Here x1 is a core point, a cluster is formed. So we have Cluster_1: {x1, x2, x3, x7}
4. Next, we choose x5, Retrieve all points density-reachable from x5: {x8, x4, x6}
5. Here x5 is a core point, a cluster is formed.So we have Cluster_2: {x5, x4, x8, x6}
6. Next, we choose x9, x9 is a noise point, noise points do NOT belong to any clusters.
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 5/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
Open in app
Python Implementation
Here is some sample code to build FP-tree from scratch and find all frequency itemsets in
Python 3. I have also added visualization of the points and marked all outliers in blue.
1 import numpy as np
2 import collections
3 import matplotlib.pyplot as plt
4 import queue
5 import scipy.io as spio
6
7 #Define label for differnt point group
8 NOISE = 0
9 UNASSIGNED = 0
10 core=-1
11 edge=-2
12
13
14
15 #function to find all neigbor points in radius
16 def neighbor_points(data, pointId, radius):
17 points = []
18 for i in range(len(data)):
19 #Euclidian distance using L2 Norm
20 if np.linalg.norm(data[i] - data[pointId]) <= radius:
21 points.append(i)
22 return points
23
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 6/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
3
24 #DB Scan algorithom
Open
25in app
def dbscan(data, Eps, MinPt):
26 #initilize all pointlable to unassign
27 pointlabel = [UNASSIGNED] * len(data)
28 pointcount = []
29 #initilize list for core/noncore point
30 corepoint=[]
31 noncore=[]
32
33 #Find all neigbor for all point
34 for i in range(len(data)):
35 pointcount.append(neighbor_points(train,i,Eps))
36
37 #Find all core point, edgepoint and noise
38 for i in range(len(pointcount)):
39 if (len(pointcount[i])>=MinPt):
40 pointlabel[i]=core
41 corepoint.append(i)
42 else:
43 noncore.append(i)
44
45 for i in noncore:
46 for j in pointcount[i]:
47 if j in corepoint:
48 pointlabel[i]=edge
49
50 break
51
52 #start assigning point to luster
53 cl = 1
54 #Using a Queue to put all neigbor core point in queue and find neigboir's neigbor
55 for i in range(len(pointlabel)):
56 q = queue.Queue()
57 if (pointlabel[i] == core):
58 pointlabel[i] = cl
59 for x in pointcount[i]:
60 if(pointlabel[x]==core):
61 q.put(x)
62 pointlabel[x]=cl
63 elif(pointlabel[x]==edge):
64 pointlabel[x]=cl
65 #Stop when all point in Queue has been checked
66 while not q.empty():
67 neighbors = pointcount[q.get()]
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 7/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
68 for y in neighbors:
69in app
Open if (pointlabel[y]==core):
70 pointlabel[y]=cl
71 q.put(y)
72 if (pointlabel[y]==edge):
73 pointlabel[y]=cl
74 cl=cl+1 #move to next cluster
75
76 return pointlabel,cl
77
78 #Function to plot final result
79 def plotRes(data, clusterRes, clusterNum):
80 nPoints = len(data)
81 scatterColors = ['black', 'green', 'brown', 'red', 'purple', 'orange', 'yellow']
82 for i in range(clusterNum):
83 if (i==0):
84 #Plot all noise point as blue
85 color='blue'
86 else:
87 color = scatterColors[i % len(scatterColors)]
88 x1 = []; y1 = []
89 for j in range(nPoints):
90 if clusterRes[j] == i:
91 x1.append(data[j, 0])
92 y1.append(data[j, 1])
93 plt.scatter(x1, y1, c=color, alpha=1, marker='.')
94
95
96 #Load Data
97 raw = spio.loadmat('DBSCAN.mat')
98 train = raw['Points']
99
100 #Set EPS and Minpoint
101 epss = [5,10]
102 minptss = [5,10]
103 # Find ALl cluster, outliers in different setting and print resultsw
104 for eps in epss:
105 for minpts in minptss:
106 print('Set eps = ' +str(eps)+ ', Minpoints = '+str(minpts))
107 pointlabel,cl = dbscan(train,eps,minpts)
108 plotRes(train, pointlabel, cl)
109 plt.show()
110 print('number of cluster found: ' + str(cl-1))
111 counter=collections.Counter(pointlabel)
112 print(counter)
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 8/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
p ( )
113 outliers = pointlabel.count(0)
Open
114in app print('numbrer of outliers found: '+str(outliers) +'\n')
Thanks for reading and I am looking forward to hearing your questions and
thoughts. If you want to learn more about Data Science and Cloud Computing, you
can find me on Linkedin.
Reference
https://github.com/NSHipster/DBSCAN
https://en.wikipedia.org/wiki/DBSCAN
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 9/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
Open in app
Sign up for The Variable
By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 10/10