0% found this document useful (0 votes)
193 views10 pages

Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science

The document summarizes the DBSCAN clustering algorithm in three paragraphs: 1) It introduces DBSCAN and explains that it can automatically detect the number of clusters, find clusters of arbitrary shapes, handle noise and outliers, and is used for anomaly detection. 2) It describes the key concepts of DBSCAN including the Eps and MinPts parameters, and defines core points, border points, and outliers. 3) It outlines the steps of the DBSCAN algorithm to cluster points based on density connectivity, and provides a Python implementation with visualization of an example dataset.

Uploaded by

eimisjaneisi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views10 pages

Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science

The document summarizes the DBSCAN clustering algorithm in three paragraphs: 1) It introduces DBSCAN and explains that it can automatically detect the number of clusters, find clusters of arbitrary shapes, handle noise and outliers, and is used for anomaly detection. 2) It describes the key concepts of DBSCAN including the Eps and MinPts parameters, and defines core points, border points, and outliers. 3) It outlines the steps of the DBSCAN algorithm to cluster points based on density connectivity, and provides a Python implementation with visualization of an example dataset.

Uploaded by

eimisjaneisi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

Open in app

Follow 573K Followers

Understanding DBSCAN Algorithm and


Implementation from Scratch
DBSCAN Algorithm Step by Step, Python Implementation, and Visualization.

Andrewngai Jun 9, 2020 · 5 min read

What is DBSCAN
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a commonly
used unsupervised clustering algorithm proposed in 1996. Unlike the most well known
K-mean, DBSCAN does not need to specify the number of clusters. It can automatically
detect the number of clusters based on your input data and parameters. More
importantly, DBSCAN can find arbitrary shape clusters that k-means are not able to find.
For example, a cluster surrounded by a different cluster.

DBSCAN vs K-means, credit


https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 1/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

Also, DBSCAN can handle noise and outliers. All the outliers will be identified and
Open in app
marked without been classified into any cluster. Therefore, DBSCAN can also be used for
Anomaly Detection (Outlier Detection)

Before we take a look at the preusdecode, we need to first understand some basic
concepts and terms. Eps, Minpits, Directly density-reachable, density-reachable, density-
connected, core point and border point

First of all, there are two parameters we need to set for DBSCAN, Eps, and MinPts.

Eps: Maximum radius of the neighborhood

MinPts: Minimum number of points in an Eps-neighbourhood of that point

And there is the concept of Directly density-reachable: A point p is directly density


reachable from a point q w.r.t. Eps, MinPts, if NEps (q): {p belongs to D | dist(p,q) ≤
Eps} and |N Eps (q)| ≥ MinPts. Let’s take a look at an example with Minpts = 5, Eps =
1. Let’s take a look at an example to understand density-reachable and density-
connected.

Density-reachable example

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 2/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

Open in app
Density-connected example

Finally, a point is a core point if it has more than a specified number of points (MinPts)
within Eps. These are points that are at the interior of a cluster A. And a border point has
fewer than MinPts within Eps, but is in the neighborhood of a core point. We can also
define the outlier(noise) point, which is the points that are neither core nor border
points.

Core point, Border point, Outlier Point examples

Now, let’s take a look at how DBSCAN algorithm actually works. Here is the
preusdecode.

1. Arbitrary select a point p

2. Retrieve all points density-reachable from p based on Eps and MinPts

3. If p is a core point, a cluster is formed

4. If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database

5. Continue the process until all of the points have been processed

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 3/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is


Open in app
the number of database objects. Otherwise, the complexity is O(n² )

Example
Consider the following 9 two-dimensional data points:

x1(0,0), x2(1,0), x3(1,1), x4(2,2), x5(3,1), x6(3,0), x7(0,1), x8(3,2), x9(6,3)

Use the Euclidean Distance with Eps =1 and MinPts = 3. Find all core points, border
points and noise points, and show the final clusters using DBCSAN algorithm. Let’s show
the result step by step.

Example Data Visuilization

First, Calculate the N(p), Eps-neighborhood of point p

N(x1) = {x1, x2, x7}

N(x2) = {x2, x1, x3}

N(x3) = {x3, x2, x7}

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 4/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

N(x4) = {x4, x8}


Open in app

N(x5) = {x5, x6, x8}

N(x6) = {x6, x5}

N(x7) = {x7, x1, x3}

N(x8) = {x8, x4, x5}

N(x9) = {x9}

If the size of N(p) is at least MinPts, then p is said to be a core point. Here the given
MinPts is 3, thus the size of N(p) is at least 3. Thus core points are:{x1, x2, x3, x5, x7,
x8}

Then according to the definition of border points: given a point p, p is said to be a border
point if it is not a core point but N(p) contains at least one core point. N(x4) = {x4, x8},
N(x6) = {x6, x5}. here x8 and x5 are core points, So both x4 and x6 are border points.
Obviously, the point left, x9 is a noise point.

Now, let’s follow the preusdecode to produce the clusters.

1. Arbitrary select a point p, now we choose x1

2. Retrieve all points density-reachable from x1: {x2, x3, x7}

3. Here x1 is a core point, a cluster is formed. So we have Cluster_1: {x1, x2, x3, x7}

4. Next, we choose x5, Retrieve all points density-reachable from x5: {x8, x4, x6}

5. Here x5 is a core point, a cluster is formed.So we have Cluster_2: {x5, x4, x8, x6}

6. Next, we choose x9, x9 is a noise point, noise points do NOT belong to any clusters.

7. Thus the algorithm stops here.

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 5/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

Open in app

Final DBSCAN Cluster Result

Python Implementation
Here is some sample code to build FP-tree from scratch and find all frequency itemsets in
Python 3. I have also added visualization of the points and marked all outliers in blue.

1 import numpy as np
2 import collections
3 import matplotlib.pyplot as plt
4 import queue
5 import scipy.io as spio
6
7 #Define label for differnt point group
8 NOISE = 0
9 UNASSIGNED = 0
10 core=-1
11 edge=-2
12
13
14
15 #function to find all neigbor points in radius
16 def neighbor_points(data, pointId, radius):
17 points = []
18 for i in range(len(data)):
19 #Euclidian distance using L2 Norm
20 if np.linalg.norm(data[i] - data[pointId]) <= radius:
21 points.append(i)
22 return points
23
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 6/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
3
24 #DB Scan algorithom
Open
25in app
def dbscan(data, Eps, MinPt):
26 #initilize all pointlable to unassign
27 pointlabel = [UNASSIGNED] * len(data)
28 pointcount = []
29 #initilize list for core/noncore point
30 corepoint=[]
31 noncore=[]
32
33 #Find all neigbor for all point
34 for i in range(len(data)):
35 pointcount.append(neighbor_points(train,i,Eps))
36
37 #Find all core point, edgepoint and noise
38 for i in range(len(pointcount)):
39 if (len(pointcount[i])>=MinPt):
40 pointlabel[i]=core
41 corepoint.append(i)
42 else:
43 noncore.append(i)
44
45 for i in noncore:
46 for j in pointcount[i]:
47 if j in corepoint:
48 pointlabel[i]=edge
49
50 break
51
52 #start assigning point to luster
53 cl = 1
54 #Using a Queue to put all neigbor core point in queue and find neigboir's neigbor
55 for i in range(len(pointlabel)):
56 q = queue.Queue()
57 if (pointlabel[i] == core):
58 pointlabel[i] = cl
59 for x in pointcount[i]:
60 if(pointlabel[x]==core):
61 q.put(x)
62 pointlabel[x]=cl
63 elif(pointlabel[x]==edge):
64 pointlabel[x]=cl
65 #Stop when all point in Queue has been checked
66 while not q.empty():
67 neighbors = pointcount[q.get()]

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 7/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
68 for y in neighbors:
69in app
Open if (pointlabel[y]==core):
70 pointlabel[y]=cl
71 q.put(y)
72 if (pointlabel[y]==edge):
73 pointlabel[y]=cl
74 cl=cl+1 #move to next cluster
75
76 return pointlabel,cl
77
78 #Function to plot final result
79 def plotRes(data, clusterRes, clusterNum):
80 nPoints = len(data)
81 scatterColors = ['black', 'green', 'brown', 'red', 'purple', 'orange', 'yellow']
82 for i in range(clusterNum):
83 if (i==0):
84 #Plot all noise point as blue
85 color='blue'
86 else:
87 color = scatterColors[i % len(scatterColors)]
88 x1 = []; y1 = []
89 for j in range(nPoints):
90 if clusterRes[j] == i:
91 x1.append(data[j, 0])
92 y1.append(data[j, 1])
93 plt.scatter(x1, y1, c=color, alpha=1, marker='.')
94
95
96 #Load Data
97 raw = spio.loadmat('DBSCAN.mat')
98 train = raw['Points']
99
100 #Set EPS and Minpoint
101 epss = [5,10]
102 minptss = [5,10]
103 # Find ALl cluster, outliers in different setting and print resultsw
104 for eps in epss:
105 for minpts in minptss:
106 print('Set eps = ' +str(eps)+ ', Minpoints = '+str(minpts))
107 pointlabel,cl = dbscan(train,eps,minpts)
108 plotRes(train, pointlabel, cl)
109 plt.show()
110 print('number of cluster found: ' + str(cl-1))
111 counter=collections.Counter(pointlabel)
112 print(counter)
https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 8/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science
p ( )
113 outliers = pointlabel.count(0)
Open
114in app print('numbrer of outliers found: '+str(outliers) +'\n')

DBSCAN.py hosted with ❤ by GitHub view raw

Thanks for reading and I am looking forward to hearing your questions and
thoughts. If you want to learn more about Data Science and Cloud Computing, you
can find me on Linkedin.

Photo by Alfons Morales on Unsplash

Reference

https://github.com/NSHipster/DBSCAN

https://en.wikipedia.org/wiki/DBSCAN

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 9/10
28/3/2021 Understanding DBSCAN Algorithm and Implementation from Scratch | by Andrewngai | Towards Data Science

Open in app
Sign up for The Variable
By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Emails will be sent to [email protected].


Get this newsletter
Not you?

Dbscan Data Mining Clustering Python Outlier Detection

About Help Legal

Get the Medium app

https://towardsdatascience.com/understanding-dbscan-algorithm-and-implementation-from-scratch-c256289479c5 10/10

You might also like