|
| 1 | +# k-Means Clustering |
| 2 | + |
| 3 | +Goal: Partition data into **k** clusters based on nearest means. |
| 4 | + |
| 5 | +The idea behind k-Means is to take data that has no formal classification to it and determine if there are any natural clusters (groups of related objects) within the data. |
| 6 | + |
| 7 | +k-Means assumes that there are **k-centers** within the data. The data that is closest to these *centroids* become classified or grouped together. k-Means doesn't tell you what the classifier is for that particular data group, but it assists in trying to find what clusters potentially exist. |
| 8 | + |
| 9 | +## The algorithm |
| 10 | + |
| 11 | +The k-Means algorithm is really quite simple at its core: |
| 12 | + |
| 13 | +1. Choose **k** random points to be the initial centers |
| 14 | +2. Repeat the following two steps until the *centroids* reach convergence: |
| 15 | + 1. Assign each point to its nearest *centroid* |
| 16 | + 2. Update the *centroid* to the mean of its nearest points |
| 17 | + |
| 18 | +Convergence is said to be reached when all of the *centroids* have not changed. |
| 19 | + |
| 20 | +This brings about a few of the parameters that are required for k-Means: |
| 21 | + |
| 22 | +- **k**: This is the number of *centroids* to attempt to locate. |
| 23 | +- **convergence distance**: The minimum distance that the centers are allowed to move after a particular update step. |
| 24 | +- **distance function**: There are a number of distance functions that can be used, but mostly commonly the Euclidean distance function is adequate. But often that can lead to convergence not being reached in higher dimensionally. |
| 25 | + |
| 26 | +This is what the algorithm would look like in Swift: |
| 27 | + |
| 28 | +```swift |
| 29 | +func kMeans(numCenters: Int, convergeDist: Double, points: [Vector]) -> [Vector] { |
| 30 | + var centerMoveDist = 0.0 |
| 31 | + let zeros = [Double](count: points[0].length, repeatedValue: 0.0) |
| 32 | + |
| 33 | + var kCenters = reservoirSample(points, k: numCenters) |
| 34 | + |
| 35 | + repeat { |
| 36 | + var cnts = [Double](count: numCenters, repeatedValue: 0.0) |
| 37 | + var newCenters = [Vector](count:numCenters, repeatedValue: Vector(d:zeros)) |
| 38 | + |
| 39 | + for p in points { |
| 40 | + let c = nearestCenter(p, centers: kCenters) |
| 41 | + cnts[c] += 1 |
| 42 | + newCenters[c] += p |
| 43 | + } |
| 44 | + |
| 45 | + for idx in 0..<numCenters { |
| 46 | + newCenters[idx] /= cnts[idx] |
| 47 | + } |
| 48 | + |
| 49 | + centerMoveDist = 0.0 |
| 50 | + for idx in 0..<numCenters { |
| 51 | + centerMoveDist += euclidean(kCenters[idx], newCenters[idx]) |
| 52 | + } |
| 53 | + |
| 54 | + kCenters = newCenters |
| 55 | + } while centerMoveDist > convergeDist |
| 56 | + |
| 57 | + return kCenters |
| 58 | +} |
| 59 | +``` |
| 60 | + |
| 61 | +## Example |
| 62 | + |
| 63 | +These examples are contrived to show the exact nature of k-Means and finding clusters. These clusters are very easily identified by human eyes: we see there is one in the lower left corner, one in the upper right corner, and maybe one in the middle. |
| 64 | + |
| 65 | +In all these examples the squares represent the data points and the stars represent the *centroids*. |
| 66 | + |
| 67 | +##### Good clusters |
| 68 | + |
| 69 | +This first example shows k-Means finding all three clusters: |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | +The selection of initial centroids found the lower left cluster (indicated by red) and did pretty good on the center and upper left clusters. |
| 74 | + |
| 75 | +#### Bad Clustering |
| 76 | + |
| 77 | +The next two examples highlight the unpredictability of k-Means and how it not always finds the best clustering. |
| 78 | + |
| 79 | + |
| 80 | + |
| 81 | +As you can see in this one, the initial *centroids* were all a little too close and the 'blue' didn't quite get to a good place. By adjusting the convergence distance we should be able to get it better. |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +In this example, the blue cluster never really could separate from the red cluster and as such sort of got stuck down there. |
| 86 | + |
| 87 | +## Performance |
| 88 | + |
| 89 | +The first thing to recognize is that k-Means is classified as an NP-Hard type of problem. The selection of the initial *centroids* has a big effect on how the resulting clusters may end up. This means that trying to find an exact solution is not likely -- even in 2 dimensional space. |
| 90 | + |
| 91 | +As seen from the steps above the complexity really isn't that bad -- it is often considered to be on the order of **O(kndi)**, where **k** is the number of *centroids*, **n** is the number of **d**-dimensional vectors, and **i** is the number of iterations for convergence. |
| 92 | + |
| 93 | +The amount of data has a big linear effect on the running time of k-Means, but tuning how far you want the *centroids* to converge can have a big impact how many iterations will be done. As a general rule, **k** should be relatively small compared to the number of vectors. |
| 94 | + |
| 95 | +Often times as more data is added certain points may lie in the boundary between two *centroids* and as such those centroids would continue to bounce back and forth and the **convergence** distance would need to be tuned to prevent that. |
| 96 | + |
| 97 | +## See Also |
| 98 | + |
| 99 | +[K-Means Clustering on Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering) |
| 100 | + |
| 101 | +*Written by John Gill* |
0 commit comments