100% found this document useful (1 vote)
115 views

K Means Clustering

This document analyzes customer data using k-means clustering. It loads customer data, cleans and prepares the data, runs k-means clustering for k values from 1 to 9, and analyzes the results. It finds that the sum of squared distances decreases most significantly (around 30%) when going from 1 to 2 clusters and again from 2 to 3 clusters, indicating those are the optimal numbers of clusters for the data.

Uploaded by

Walid Sassi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
115 views

K Means Clustering

This document analyzes customer data using k-means clustering. It loads customer data, cleans and prepares the data, runs k-means clustering for k values from 1 to 9, and analyzes the results. It finds that the sum of squared distances decreases most significantly (around 30%) when going from 1 to 2 clusters and again from 2 to 3 clusters, indicating those are the optimal numbers of clusters for the data.

Uploaded by

Walid Sassi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [1]:

1 import pandas as pd

In [5]:

1 ml = pd.read_csv("mall_kmeans.csv")

In [6]:

1 ml.head()

Out[6]:

CustomerID Genre Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

In [8]:

1 ml.isnull().sum()

Out[8]:

CustomerID 0
Genre 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64

In [9]:

1 ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Genre 200 non-null object
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 1/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [10]:

1 ml.Genre.value_counts()

Out[10]:

Female 112
Male 88
Name: Genre, dtype: int64

In [11]:

1 ml.Genre.replace({'Female':0,'Male':1},inplace=True)

In [14]:

1 ml.select_dtypes(include='object').columns

Out[14]:

Index([], dtype='object')

In [15]:

1 from sklearn.cluster import KMeans

In [111]:

1 kmeans_ml = KMeans(n_clusters=5)

In [112]:

1 kmeans_ml.fit(ml)

Out[112]:

KMeans(n_clusters=5)

In [113]:

1 kmeans_ml.labels_

Out[113]:

array([2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4,
2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4,
2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 3, 1, 3, 1, 3,
1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3,
1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3,
1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3,
1, 3])

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 2/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [114]:

1 set(kmeans_ml.labels_)

Out[114]:

{0, 1, 2, 3, 4}

In [115]:

1 kmeans_ml.cluster_centers_

Out[115]:

array([[ 92.53030303, 0.42424242, 42.72727273, 57.75757576,


49.46969697],
[164. , 0.52777778, 40.80555556, 87.91666667,
17.88888889],
[ 33.34285714, 0.37142857, 45.31428571, 31.8 ,
30.31428571],
[162. , 0.46153846, 32.69230769, 86.53846154,
82.12820513],
[ 25.16666667, 0.41666667, 25.83333333, 26.95833333,
77.79166667]])

In [116]:

1 len(kmeans_ml.cluster_centers_)

Out[116]:

In [117]:

1 centroid_df = pd.DataFrame(kmeans_ml.cluster_centers_)

In [118]:

1 centroid_df.columns = ml.columns

In [119]:

1 centroid_df

Out[119]:

CustomerID Genre Age Annual Income (k$) Spending Score (1-100)

0 92.530303 0.424242 42.727273 57.757576 49.469697

1 164.000000 0.527778 40.805556 87.916667 17.888889

2 33.342857 0.371429 45.314286 31.800000 30.314286

3 162.000000 0.461538 32.692308 86.538462 82.128205

4 25.166667 0.416667 25.833333 26.958333 77.791667

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 3/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [120]:

1 kmeans_ml.score(ml)

Out[120]:

-157141.33959373957

In [94]:

1 lst = []
2 for k in range(1,10):
3 kmeans_ml = KMeans(n_clusters=k)
4 kmeans_ml.fit(ml)
5 score = kmeans_ml.score(ml)
6 lst.append(score)
7 print("cluster over are",k, "cluster left are",len(range(1,10))-k)
8 print("____________________")

C:\Users\MR.GODHADE\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.p
y:1036: UserWarning: KMeans is known to have a memory leak on Windows with
MKL, when there are less chunks than available threads. You can avoid it b
y setting the environment variable OMP_NUM_THREADS=1.
warnings.warn(

cluster over are 1 cluster left are 8


____________________
cluster over are 2 cluster left are 7
____________________
cluster over are 3 cluster left are 6
____________________
cluster over are 4 cluster left are 5
____________________
cluster over are 5 cluster left are 4
____________________
cluster over are 6 cluster left are 3
____________________
cluster over are 7 cluster left are 2
____________________
cluster over are 8 cluster left are 1
____________________
cluster over are 9 cluster left are 0
____________________

In [121]:

1 import numpy as np

In [122]:

1 lst = np.round(np.abs(lst))

In [123]:

1 cluster_num = list(range(1,10))

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 4/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [124]:

1 import matplotlib.pyplot as plt

In [125]:

1 plt.plot(cluster_num,lst, marker ="*")


2 plt.grid()

In [126]:

1 lst

Out[126]:

array([975512., 387066., 271385., 195401., 157621., 122608., 103233.,


86004., 77299.])

In [127]:

1 (975512 - 387066)*100/975512 #60% drop in ssd when k change from 1 to 2


2 (387066 - 271397)*100/387066 #29% drop in ssd when k change from 1 to 2
3 (271397 - 195401)*100/271397 #28% drop in ssd when k change from 1 to 2
4 (195401 - 157506)*100/195401 #19% drop in ssd when k change from 1 to 2
5 (157506 - 122630)*100/195401 #17% drop in ssd when k change from 1 to 2

Out[127]:

17.848424521880645

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 5/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [128]:

1 (387066 - 271397)*100/387066

Out[128]:

29.88353407429224

In [129]:

1 (271397 - 195401)*100/271397

Out[129]:

28.001783365328283

In [130]:

1 (195401 - 157506)*100/195401

Out[130]:

19.393452438830916

In [131]:

1 colormap = np.array(['Red','Green','Blue','Yellow','Black'])

In [140]:

1 kmeans_ml.labels_

Out[140]:

array([2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4,
2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4,
2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 3, 1, 3, 1, 3,
1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3,
1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3,
1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3,
1, 3])

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 6/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [139]:

1 colormap[kmeans_ml.labels_]

Out[139]:

array(['Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black',


'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black',
'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black',
'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black',
'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black',
'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Black', 'Blue', 'Blue',
'Blue', 'Blue', 'Blue', 'Black', 'Blue', 'Blue', 'Blue', 'Blue',
'Blue', 'Blue', 'Red', 'Blue', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red',
'Red', 'Red', 'Red', 'Red', 'Yellow', 'Red', 'Yellow', 'Red',
'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow',
'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green',
'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow',
'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green',
'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow',
'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green',
'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow',
'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green',
'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow',
'Green', 'Yellow', 'Green', 'Yellow', 'Green', 'Yellow', 'Green',
'Yellow', 'Green', 'Yellow'], dtype='<U6')

In [133]:

1 plt.scatter(ml['Age'],ml['Annual Income (k$)'], c = colormap[kmeans_ml.labels_])

Out[133]:

<matplotlib.collections.PathCollection at 0x1c7ec6cefa0>

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 7/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [134]:

1 ml

Out[134]:

CustomerID Genre Age Annual Income (k$) Spending Score (1-100)

0 1 1 19 15 39

1 2 1 21 15 81

2 3 0 20 16 6

3 4 0 23 16 77

4 5 0 31 17 40

... ... ... ... ... ...

195 196 0 35 120 79

196 197 0 45 126 28

197 198 1 32 126 74

198 199 1 32 137 18

199 200 1 30 137 83

200 rows × 5 columns

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 8/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [136]:

1 plt.scatter(ml['Age'],ml['Spending Score (1-100)'], c = colormap[kmeans_ml.labels_])


2 plt.xlabel('Age')
3 plt.ylabel('Spending Score')

Out[136]:

Text(0, 0.5, 'Spending Score')

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 9/10
9/1/23, 2:11 PM Mall_kmean - Jupyter Notebook

In [137]:

1 plt.scatter(ml['Annual Income (k$)'],ml['Spending Score (1-100)'], c = colormap[kmea


2 plt.xlabel('Annual Income (k$)')
3 plt.ylabel('Spending Score')

Out[137]:

Text(0, 0.5, 'Spending Score')

In [ ]:

localhost:8888/notebooks/Desktop/ML/Mall_kmeans/Mall_kmean.ipynb 10/10

You might also like