Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.
Example:
The balls of same color are clustered into a group as shown below :
Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.
A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. The clustering algorithm also finds the centroid of a group of data sets. To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.
Definition:
The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters.
Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The most ommonly used distance measure is the Euclidean metric which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) as :