What are clustering algorithms?

What is clustering ?

Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.

Example:

The example below demonstrates the clustering of balls of same color. There are a total of 10 balls which are of three different colors. We are interested in clustering of balls of the three different colors into three different groups.

The balls of same color are clustered into a group as shown below :

Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.

What is a clustering algorithm ?

A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.


Definition:

The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters.

What is the common metric for clustering techniques ?

Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The most ommonly used distance measure is the Euclidean metric which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) as :